Joined: 7 Sep 05
I run Einstein@Home on a very large bunch of computers using a 'much overlooked and under-appreciated' Linux distro called PCLinuxOS - PCLOS to its friends. It's a "rolling-release" distro - install once update forever. The potential show stopper with that release model is that a poorly implemented update can wreck the whole system. I've been running it since 2006 and I've never had that problem happen to me.
There is a single, well maintained repository. Nothing gets into it without the say-so of the boss who does the bulk of the packaging. I don't know how he finds the time, but the quality of the packages is spot on and issues get resolved extremely quickly. To overcome any potential 'rolling-release' headaches I maintain my own full local copy of the PCLOS repo and clone it at times of known good stability. Currently, a clone copy is ~36GB and I have about 20 dated clone copies going way back to around 2012, all on a 2TB external USB drive that's only half full.
I created a new clone (dated 29 Nov 2020) and chose to test it by updating a machine that was last updated in March 2018, and had been working fine. It was one of a bunch I'd decided to shut down to limit my summer heat problem. I don't usually try updating after such a long interval (I usually fresh install from a fully kitted out remaster) so I was interested to see what happened. I actually expected it to fail. There wasn't a single problem. I chose several others (different hardware) with the same result. 29 Nov 2020 is obviously a keeper.
PCLOS refuses to package BOINC. They did many years ago but the boss was frustrated with what he referred to as "crap that isn't even alpha quality". I've always used the self-extracting archive on the BOINC website until they stopped making those. The last one I used was 7.2.42. In early 2017 I bit the bullet and followed the instructions for building BOINC on unix that I found on the website and was able to work out the full list of development packages needed. That first one was quite an adventure but it's very simple these days. I had built 7.16.5 earlier this year and with the PCLOS 29 Nov 20 repo a keeper, it seemed like the perfect time to build something a little more recent. I chose 7.16.11 which I built yesterday.
I tested the new build on two widely different machines and had no issues with the way I launch the client or with a local or remote manager connecting with the running client. I also updated the OpenCL capability on those machines using the Red Hat flavour of amdgpu-pro 20.40 from which had I extracted the OpenCL libs. One machine was using an RX 460 GPU, the other an RX 570 and everything was normal - so I added the 2 machines to the list of hosts for automatic control and went home.
I use a series of scripts that check all aspects of all 'production' machines. The two prime functions are to monitor regularly for 'misbehaviour' (once per hour, 24/7) and to control work fetch and data file download behaviour. To find hosts that need attention, the first thing is to make sure the host is 'pingable' and that both BOINC and the science apps are running. To make sure nothing is 'spinning its wheels', I use the kernel's tools for tracking the use of CPU cycles. My main interest is in GPU tasks and the kernel can show exactly how many 'clock ticks' per second get used by the CPU in supporting a GPU task. This has proved extremely reliable for detecting a stuck GPU.
For work fetch aspects, I cache data files so that any particular file gets downloaded once and deployed to all machines that subsequently might need it. This works because the control script manipulates the work cache setting. Normal is 0.05 days. Six times a day, the control script verifies a host has all current data and then increases the cache setting to the desired value (currently ~1.5 days) to trigger a work fetch. There is a suitable interval between each host and once all have had time to finish 'feeding', they have the cache size returned to 0.05 days. If any host receives new data, it is deployed to all others and also sent to the cache. The stats tell me that many hundreds of potential data file downloads get saved every day so I feel this is worthwhile.
After going home last night, the script that controls work fetch reported an error with both the hosts that had the new 7.16.11 BOINC version. The particular error message I got was:
Can't get RPC password: gui_rpc_auth.cfg not found. Try reinstalling BOINC.
This was a bit of a surprise because the file has always existed and never previously caused a problem with any self-built BOINC versions. I found the several reports (here and here) but these seemed more to do with the Manager not being able to connect to the client and I'd already confirmed that I had no such issue, since I'd run the manager (local and remote) before adding the machines to the auto-control group. Error messages like this get logged and usually it means that there has been a failure. In this case the message was quite bogus because, on closely inspecting the two hosts involved, both had downloaded new work at the appropriate time and both had been successfully returned to the default 0.05 days setting as per normal.
Since I use boinccmd over the LAN to control each client, I went and read the boinccmd docs again and the following quote gave me a clue:-
If you run boinccmd in the same directory as the BOINC client, you don't need to supply either a host name or a password.
I decided I'd better look at exactly how my bash script uses boinccmd to make the cache size adjustments. Firstly there is the command to adjust the value in the global_prefs_override.xml file. The secure shell (ssh) is used. $ip is the variable containing the target host's IP address and sed is the stream editor that finds the appropriate field and changes the default value (0.05) into the desired value (1.50). There is no problem with that.
ssh $ip "sed -i /min_days/s/0.05/1.50/ BOINC/global_prefs_override.xml"
Secondly, there is the command to get the client to take notice of the change. Once again ssh delivers the command which launches boinccmd from the home directory by specifying the BOINC data directory where the boinccmd app resides. The variable $passwd contains the contents of gui_rpc_auth.cfg.
ssh $ip "BOINC/boinccmd --passwd $passwd --read_global_prefs_override"
I've never had a problem with this before but was immediately struck by the above quote from the docs. So I made the following change to the command string that the secure shell launches and it has completely solved the problem. Essentially, the fix was to cd into the data directory before launching boinccmd.
ssh $ip "cd BOINC ; ./boinccmd --passwd $passwd --read_global_prefs_override"
I decided to post all this just in case anyone else happens to run across these error messages and might not realise that boinccmd can trigger them as well. But, as Richard mentioned in one of his posts about this that I'd read, the message was bogus for me because the operation actually succeeded, despite the error message.
Joined: 31 Dec 18
An excellent description of your set up, very interesting so thank you.
I suspect that your workaround works because you have removed the need to look up the file and that you could remove the —passwd $passwd from the command string. It would still be good to see the new feature revert to the old code and stop the problem once and for all.
Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.