Detecting situations where, "Scheduler RPC deferred for xx:yy:zz" has been issued.

Message boards : Questions and problems : Detecting situations where, "Scheduler RPC deferred for xx:yy:zz" has been issued.
Message board moderation

To post messages, you must log in.

AuthorMessage
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89147 - Posted: 9 Dec 2018, 4:10:39 UTC

I'm currently running the Einstein project on 94 separate hosts, most of which have a range of different AMD GPUs. All hosts run Linux. First some background.

Just over half of them use client version 7.2.42 - the final version that was provided in shell archive format on this website. They do this because the OS is a mid 2016 version which was in play at the time the fglrx driver was being deprecated. The GPUs are all GCN 1.0 - Southern Islands series - which are not yet fully supported by the latest amdgpu open source drivers and OpenCL libs. So they stay at the last working version of fglrx, a version that does work very satisfactorily.

The remainder are on 7.6.33. My distro of choice is PCLinuxOS and the Devs there don't (and will not) package BOINC so I built it myself by downloading the source. It was a bit of an adventure since I really had no clue about what I was doing but fortunately I was able to find all the necessary -devel libs in the repository. I haven't had any problems with 7.6.33. I have the build procedure sorted now so I'll build later versions as I need them.

This second half of the fleet are all up-to-date OS wise. I keep a fully updated copy of the PCLOS repo on an external drive. I have a script that updates my copy from one of the local PCLOS mirrors every two weeks. All these hosts run mainly Polaris GPUs from RX 460 to RX 580. The amdgpu kernel module is under active development so I sometimes get performance improvements and often get bugfixes by updating to the latest versions.

To manage this number of hosts, I have written quite a number of bash scripts which run on a central server machine and communicate sequentially with individual LAN hosts using ssh. One of those scripts has the purpose of controlling work fetch and caching the associated data files that otherwise would need to be downloaded individually. To make sure random hosts don't randomly ask for work and get random copies of the same file, all hosts have a work cache setting of 0.05 days by default. Periodically, (and well before any host runs out of work) a script will visit each host, make sure the host has a fully updated set of data files (some get deleted when tasks finish) and only then adjust the work cache setting, usually to 1.0 days, to trigger work fetch. If a brand new data file happens to be downloaded at this time, it will automatically be detected, cached and immediately distributed to any earlier hosts in the overall sequence. On completion of work fetch, the cache is returned to 0.05 days until next visit. boinccmd is a central player in these scripts to manipulate the client in this way. I've had this working very well for quite a while now. It makes a significant saving in bandwidth.

Communications with the Einstein servers are often slow but mostly tolerable. At times, the project can become unresponsive and comms can even drop out for short periods. In periods of heavy congestion, a BOINC client can defer communications, sometimes for considerable periods. I've actually seen examples of up to 24 hour backoffs. I have a separate script that can detect these but only when the backoff has been in place for about 2 hours. I use the --get_project_status option to boinccmd and then parse the 'last RPC:' line in the output to work out how long it's been since there was a scheduler contact. If that's longer than say 2 hours, it's virtually guaranteed that a backoff is underway.

So now to the question. How can I find out immediately if a client has suffered a significant deferral? In playing with such a host yesterday using BOINC Manager, if I select the project and click 'properties', the properties page has an entry. "Scheduler RPC deferred for xx:yy:zz" which seems perfect. That line is missing from the page where the host is not deferred. So if boincmgr can get this info from the client, presumably boinccmd could as well, and if so could just include an extra line of output for the --get_project_status option. I looked at all potential options like --get_state, --get_project_status, --get_cc_status, --get_simple_gui_info, and couldn't find any sign of deferral info. Did I miss something? If not, does anyone know if this might be pretty simple to implement as part of an existing option? Who would be the best person to ask, thanks?
Cheers,
Gary.
ID: 89147 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89150 - Posted: 9 Dec 2018, 10:11:55 UTC - in response to Message 89147.  

Now there's a proper question to get the juices flowing!

Some background to get us started. BOINC has several components. The key one is the 'client' or 'daemon': all the really important stuff happens inside here, but you can't see it - it has no user interface. Instead, there are several alternative stand-alone interface modules which communicate with the client and with the outside world. There's BOINC Manager, of course, and boinccmd - and also independently-written tools like BoincTasks and BoincView.

All of these rely on a common communications standard called an RPC (Remote Procedure Call). The principle is that the interface tool says 'Send me all you've got on project status' (or whatever), and gets a great wodge of XML in return. Then, it's up to the interface module to pluck the interesting or important bits out of the wodge and present them to the user in its own particular way.

In development terms, this process tends to be driven by BOINC Manager. Somebody wants to display new information, so they add a display in the Manager, and perforce they have to add the underlying data to the RPC. But there's been no pressure to keep every interface tool up-to-date with new features. I suspect that we'll find that the backoff time is in the XML sent from the client to boinccmd, but the code to transfer it from XML to output simply hasn't been updated. boinccmd is pretty simple, so I'll have a go at that this afternoon: since you already have the build tools to make an operational version of boinccmd, you can update your working copy. We will have to double-check when this particular element was added to the client and Manager, in case you need to update the sender as well as the receiver.

"Who would be the best person to ask?", indeed. BOINC these days is maintained by a group of volunteer developers, and I operate on the fringes of that - I've even had a few minor patches accepted by the core group. So, adding extra code is a matter of opening an issue or pull request on Github, and seeing whether it sinks or swims.

But, I'm not a great fan of that process. It lacks any overarching design stage before we start lobbing code into the pot. There have been too many cases recently where somebody has had a great idea and coded it, only to stir up a howl of protest from people who rely on the old way of doing things - some of my work has been to integrate the best of the old and the new, and the measure of my success has been that absolutely no-one has noticed or commented - it just works the way you want it to, whichever side of the fence you're on.

I have a slight worry that this may be one of those cases. Some sysadmins may have an existing investment in scripts which rely on boinccmd outputting in a rigidly familiar format - exactly X lines in a get_project_status report, with my data on line Y, for example. It's not good practice - the XML 'search and report if found' technique is better - but it's quick. And dirty. And common. So, we need to have a meta-discussion about the design, before we start adding extra data in the public source. I can try and feed that into the group.

Going back to your original use-case: there are many reasons for scheduler communications to be deferred. Some are dictated by the project - in Einstein's case, delay 60 seconds between any two contacts, and wait an hour if the request was during maintenance. Others are made up by the client, and are for variable times: they tend to be triggered (initially) by communications failures, but also by project task shortages ("don't bother asking for work if there wasn't any last time"). You perhaps need to investigate whether all of these trigger the 'Scheduler RPC deferred for' line in the Manager property page - project-requested backoffs certainly do, but I don't know about the internally-generated ones.

That's enough for this cup of coffee - I'll dig further into the code later.
ID: 89150 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89152 - Posted: 9 Dec 2018, 16:52:15 UTC

As always when doing this sort of work: AAAARRRRGGGGHHHH that bastard C++ language again

First, the easy bit: what you see in --get_project_status is written by

https://github.com/BOINC/boinc/blob/master/lib/gui_rpc_client_print.cpp#L81

Sure enough, "last RPC:" is printed on line 104, and there's no reference to "RPC deferred until:". There's a perfectly good variable 'min_rpc_time', and we can display that the same way as 'last_rpc_time' - that works nicely.

But both are written as absolute times since Unix year dot. What we're interested in is the difference between 'until' and now: how much longer have we got to wait?

And when is now? The client uses gstate.now, the Manager uses dtime(), and boinccmd uses ... I haven't the f'ing foggiest idea. Everything I throw at it is off by billions of seconds. But if someone could write me a nice formatted "RPC deferred for: HH:MM:SS" function using those values, it would be a piece of piss to drop it in. They all seem to be declared as doubles, but some of them might contain tm structures. When I tried using a nice simple difftime, it crashed boinccmd.

Oh, but I did find a bug. You see a blank line after the 'last RPC' line because ctime supplies its own /n. You can have that one for free.

Why is project_files_downloaded (time) always zero?

More coffee, Jeeves.
ID: 89152 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89153 - Posted: 9 Dec 2018, 16:58:31 UTC

FWIW:

   name: Einstein@Home
   master URL: http://einstein.phys.uwm.edu/
   user_name: Richard Haselgrove
   team_name: Raccoon Lovers
   resource share: 100.000000
   user_total_credit: 112876275.710468
   user_expavg_credit: 275403.015925
   host_total_credit: 12692614.418247
   host_expavg_credit: 8478.457242
   nrpc_failures: 0
   master_fetch_failures: 2
   master fetch pending: no
   scheduler RPC pending: no
   trickle upload pending: no
   attached via Account Manager: no
   ended: no
   suspended via GUI: no
   don't request more work: no
   disk usage: 0.000000
   last RPC: Sun Dec 09 16:56:10 2018
   RPC deferred until: Sun Dec 09 16:57:10 2018
   project files downloaded: 0.000000
GUI URL:
   name: Common questions
   description: Read the Einstein@Home Frequently Asked Question list
   URL: https://einsteinathome.org/faq
ID: 89153 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89155 - Posted: 10 Dec 2018, 0:43:26 UTC - in response to Message 89150.  

Now there's a proper question to get the juices flowing!

WOW!! I didn't expect that! I'll have to file that one away in the little black book for future reference :-).

Seriously though, thanks very much for the very complete summary. What you describe makes perfect sense and boinccmd being less used and 'falling behind' was something I sort of expected.

Going back to your original use-case: there are many reasons for scheduler communications to be deferred. Some are dictated by the project - in Einstein's case, delay 60 seconds between any two contacts, and wait an hour if the request was during maintenance. Others are made up by the client, and are for variable times: they tend to be triggered (initially) by communications failures, but also by project task shortages ("don't bother asking for work if there wasn't any last time"). You perhaps need to investigate whether all of these trigger the 'Scheduler RPC deferred for' line in the Manager property page - project-requested backoffs certainly do, but I don't know about the internally-generated ones.

As it so happens (and right on cue) a new 24hr backoff occurred last night - together with a couple of other things. I launch an overnight monitoring script which loops through all the hosts at hourly intervals and leaves a log of any issues detected. Here is last night's log. Two key parameters are monitored - the last RPC time interval and the CPU clock ticks used to support a GPU task. If a GPU crashes, those ticks drop to zero. This has proved to be a very reliable indicator of a crashed GPU. The host itself soldiers on but GPU tasks consume no CPU support. I use an adjustable interval (default=2s) to monitor clock ticks (100Hz clock). TPI is ticks per interval

Basically, there were 3 issues last night, the one 24 hr backoff (items 1,2,3,5,7), a GPU crash (items 4,6,9) and a host crashing and not being contactable (item 10). The missing item 8 is the fact that item 9 now has a high RPC time as well as needing a reboot :-). Item 11 is an artifact that I'll get around to curing eventually. The crashed host (item 11) was rebooted when I arrived just before 6.00am. It was immediately caught in the next loop before an update had occurred. With no update I think the unix epoch time is used so the RPC interval is rather huge - which the script dutifully detects and complains about -). Interestingly, that crashed host had an uptime of 244 days when it went down. I guess it was getting a bit tired :-).

[gary@eros ~]$ sleep 420 ; gpu_chk -h"8 9" -tmaxa50 -s10
gpu_chk: New run started at Sun Dec  9 19:00:01 EST 2018.
Loop Item  Time      Hostname     Octet  Uptime  KDE    RPC  Status   TPI  Status
==== ==== ========  ========     =====  ======  ===    ===  ======   ===  ======================
  7.   1.  01:02:47  g3260-02    ( .96)   19.3d   v5  9461s   HIGH      7  Ticks OK
  8.   2.  02:02:51  g3260-02    ( .96)   19.4d   v5 13065s   HIGH      8  Ticks OK
  9.   3.  03:02:44  g3260-02    ( .96)   19.4d   v5 16658s   HIGH      7  Ticks OK
  9.   4.  03:02:55  g4560-03    ( .99)   38.4d   v5  2933s    OK       0  Err:  Low ticks 1st=0 2nd=0 - reboot ...
 10.   5.  04:02:48  g3260-02    ( .96)   19.5d   v5 20263s   HIGH      7  Ticks OK
 10.   6.  04:03:00  g4560-03    ( .99)   38.5d   v5  2831s    OK       0  Err:  Low ticks 1st=0 2nd=0 - reboot ...
 11.   7.  05:02:49  g3260-02    ( .96)   19.5d   v5 23863s   HIGH      7  Ticks OK
 11.   9.  05:03:01  g4560-03    ( .99)   38.5d   v5  6431s   HIGH      0  Err:  Low ticks 1st=0 2nd=0 - reboot ...
 11.  10.  05:03:42  q8400-04    ( .14)  Err: Attempt to get uptime -> ssh: connect to host 192.168.0.14 port 22: No route to host
 12.  11.  06:03:30  q8400-04    ( .14)    0.0d   v41544385810s   HIGH     37  Ticks OK
06:07:03: Run finished after 12 loops.  11 items - 4 to check and 7 for info only.

On investigating the host that had the high RPC time status, detected just after 1.00am, here is the entry in the event log that created the problem.

Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Sending scheduler request: To report completed tasks.]]>
Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Reporting 5 completed tasks]]>
Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: job cache full)]]>
Sun 09 Dec 2018 10:25:06 PM EST | Einstein@Home | <![CDATA[Scheduler request completed]]>
Sun 09 Dec 2018 10:25:06 PM EST | Einstein@Home | <![CDATA[platform 'x86_64-pc-linux-gnu' not found]]>

The host was reporting completed work at 10:25pm. Everything was normal, the report succeeded but for some unknown reason, the scheduler sent the extra message in the last line. It resulted in a 24 hour backoff.

Now that I see this, I know I've seen it previously. My assumption is that there is some sort of race condition where the scheduler is responding to the client without having first been told what the currently accepted platforms are (just in case there has been a change) so it is somehow assuming that this platform is no longer relevant so it tells the client to please cop a 24 hr backoff!! Yeah, good one!! I think it's just one of perhaps several 'congestion induced' server-side bugs that can bite you.

Once again, thanks for looking into this. Having responded to your first message, I'll now digest the other two :-).
Cheers,
Gary.
ID: 89155 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89156 - Posted: 10 Dec 2018, 6:16:41 UTC - in response to Message 89152.  

Oh, but I did find a bug. You see a blank line after the 'last RPC' line because ctime supplies its own /n. You can have that one for free.

I was (sort of) aware of that now that I go and study how my script deals with producing the last RPC time in seconds. During development of the script I remember having to deal with a difference between V7.2.42 and V7.6.33 behaviour. The difference is that 7.2.42 gives seconds since the epoch for Last RPC whilst 7.6.33 gives the value as a human readable date/time followed by the blank line. Since I wanted seconds from the epoch, I had to add some extra logic to detect and reconvert any human readable dates like this :-). I remember seeing the blank line but just ignored it since I was just grepping for the "last RPC:" string and then extracting the time field. If that field was just digits, it was seconds from the epoch. If not, treat it as a date and convert it to seconds from the epoch.

Why is project_files_downloaded (time) always zero?

I have no idea - and no real inkling as to why you might want to know that, sorry :-). Maybe it's supposed to be an indicator of how long files associated with a work fetch took to download so as to give an idea of how slow your connection is :-).

And in your final message with the FWIW example - How the hell did you get it to list that extra 'RPC deferred until' line? I understand you must have used the one minute deferral after a scheduler request/reply cycle to get something to show but what did you use that would report that short deferral? Did you build a new test version of boinccmd or did you perhaps discover an existing later version that already shows it?

Assuming this example means that there will be a way to get this information, I guess I'll need to parse that date/time and convert it to seconds since the epoch. It will be a time in the future and if the deferral is longer than say 300-500 seconds, I could try to clear it with an 'update' or perhaps produce an even larger deferral. In any case this would be very nice since an on screen log entry would bring it to my attention in a timely manner.

Thanks very much for making such nice progress :-).
Cheers,
Gary.
ID: 89156 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89157 - Posted: 10 Dec 2018, 12:53:56 UTC
Last modified: 10 Dec 2018, 13:28:12 UTC

Having a second bite at this. The good news is - for Einstein, I'm getting

   last RPC: Mon Dec 10 11:26:15 2018
   RPC deferred until: Mon Dec 10 13:24:26 2018
There's been no contact with the scheduler for over an hour because of

10/12/2018 12:46:26 | Einstein@Home | Sending scheduler request: Requested by user.
10/12/2018 12:46:26 | Einstein@Home | Reporting 8 completed tasks
10/12/2018 12:46:26 | Einstein@Home | Requesting new tasks for Intel GPU
10/12/2018 12:46:27 | Einstein@Home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates
so the client has invented this backoff by itself - and it shows up. That's one of my questions answered.

And I've given notice of "Proposed work on boinccmd output" in #2904
ID: 89157 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89162 - Posted: 11 Dec 2018, 12:44:26 UTC

I think this might be good enough:

name: Einstein@Home
master URL: http://einstein.phys.uwm.edu/
user_name: Richard Haselgrove
team_name: Raccoon Lovers
resource share: 100.000000
user_total_credit: 113462608.210468
user_expavg_credit: 283178.052641
host_total_credit: 12705426.918247
host_expavg_credit: 8245.675016
nrpc_failures: 0
master_fetch_failures: 2
master fetch pending: no
scheduler RPC pending: no
trickle upload pending: no
attached via Account Manager: no
ended: no
suspended via GUI: no
don't request more work: no
disk usage: 0.000000
last RPC: Tue Dec 11 12:28:17 2018
Scheduler RPC deferred for: 00:00:53
project files downloaded: 0.000000
Find the equivalent of https://github.com/BOINC/boinc/blob/master/lib/gui_rpc_client_print.cpp#L81 in your working copy of the code - that's the function void PROJECT::print() in the library file /lib/gui_rpc_client_print.cpp.

Look for the two lines

    time_t foo = (time_t)last_rpc_time;
    printf("   last RPC: %s\n", ctime(&foo));
and replace them with

    time_t foo = (time_t)last_rpc_time;
    printf("   last RPC: %s", ctime(&foo));  //ctime supplies its own /n
    if (min_rpc_time > dtime()) {
        time_t foo = (time_t)min_rpc_time - (time_t)dtime();
        printf("   Scheduler RPC deferred for: %02.0f:%02.0f:%02.0f\n", floor(foo/3600.0), floor(fmod(foo,3600.0)/60.0), fmod(foo,60.0));
    }
That's a small change to clean up the blank line, and (conditionally) show the deferral in HH:MM:SS, like BOINC Manager. Build a new boinccmd, and enjoy.

Thanks to stackoverflow
ID: 89162 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 89178 - Posted: 11 Dec 2018, 20:52:43 UTC - in response to Message 89152.  

Why is project_files_downloaded (time) always zero?
I don't think you know what the project files are. I feel they're the executables, library files and pictures from the project, not the work data files. So unless you just downloaded a new science application, it'll always be zero.

From gui_rpc_client.h:
double project_files_downloaded_time;
        // when the last project file download was finished
        // (i.e. the time when ALL project files were finished downloading)
And for that matter:
double last_rpc_time;
        // when the last successful scheduler RPC finished
From project.h:
double project_files_downloaded_time;
        // when last project file download finished
    void update_project_files_downloaded_time();
        // called when a project file download finishes.
        // If it's the last one, set project_files_downloaded_time to now

ID: 89178 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89186 - Posted: 11 Dec 2018, 23:03:57 UTC - in response to Message 89178.  

Could well be. I'll start backtracking to find where that number might be set - whether it's the executables or the most recent batch of datafiles.

But not tonight.
ID: 89186 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89188 - Posted: 12 Dec 2018, 10:10:19 UTC - in response to Message 89162.  

I think this might be good enough:

I'm sure it will be, thanks very much. I appreciate the effort you have put into all this.

I dusted off my notes for when I built 7.9.3 - back at the end of April. The 7.6.33 builds were done in 2017 on a different machine. I had forgotten that I had installed an RX 580 in the latest build machine in August which involved a new OS install as well. The BOINC build tree was not disturbed (different partition) so all I had to do was add all the build tools and -devel libs to the new OS install (and bring it all right up to date) - about 350 packages in total. I've now fetched the latest stuff from git and checked out the 7.14.2 version. That seems to have been successful and I've found the file you pointed to and performed the edit. That part was easy enough.

I still have the old 7.6.33 tree on the earlier machine. I could also edit the file there and then rebuild the 7.6.33 boinccmd, if I really needed to. The question I have is that seeing as I intend to build V7.14.2 anyway, could I run that boinccmd on any existing 7.6.33 machine and so save the trouble of resurrecting the old 7.6.33 build system? In thinking about that, I'm wondering what to do with all the 7.2.42 hosts that currently use fglrx. The 4.19.8 kernel which has a nasty ext4 filesystem corruption bug fixed is now out and I suspect it must be getting pretty close to being able to cope with GCN 1.0 (Southern Islands) GPUs so I might be in a position to upgrade all the 7.2.42 systems very soon anyway.

That's always been the plan. Test each kernel/amdgpu module series as they develop to see if Southern Islands GPUs will eventually work. I have a test setup on a separate hard disk. The last test used 4.18.16 and the SI GPU can be detected and BOINC reports it as usable. Einstein tasks download fine but crash immediately when attempting to run. There is something still not quite right. Perhaps 4.19.8 might finally be the one that works :-). I might have to wait for 4.20.x but that's fairly close anyway.

This is why I'm happy to build 7.14.2 and have it ready to go. I'm just thinking out loud here - I have no concerns about sticking a 7.14.2 boinccmd in a 7.2.42 host and seeing what happens. I can easily download a task or two and test on the 1min deferral. If the 7.14.2 boinccmd chokes, I'll just reverse the change. So I'm not looking for you to do any more or provide further answers unless you know for certain there would be problems with running a 7.14.2 boinccmd in the older BOINC versions. I'm quite happy to test that. If there are problems, I'll probably just continue on with the current setup and slowly upgrade all the 7.6.33 machines to 7.14.2, by which time maybe there might be a kernel/amdgpu module/OpenCL libs combination that will finally be usable with SI GPUs and allow me to upgrade the balance of the fleet.
Cheers,
Gary.
ID: 89188 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89189 - Posted: 12 Dec 2018, 10:56:15 UTC - in response to Message 89188.  

This is why I'm happy to build 7.14.2 and have it ready to go. I'm just thinking out loud here - I have no concerns about sticking a 7.14.2 boinccmd in a 7.2.42 host and seeing what happens. I can easily download a task or two and test on the 1min deferral. If the 7.14.2 boinccmd chokes, I'll just reverse the change. So I'm not looking for you to do any more or provide further answers unless you know for certain there would be problems with running a 7.14.2 boinccmd in the older BOINC versions. I'm quite happy to test that. If there are problems, I'll probably just continue on with the current setup and slowly upgrade all the 7.6.33 machines to 7.14.2, by which time maybe there might be a kernel/amdgpu module/OpenCL libs combination that will finally be usable with SI GPUs and allow me to upgrade the balance of the fleet.
There's absolutely no reason not to try. If I were you, I'd make a backup copy of the working boinccmd binary on any machine you want to test, and then replace it with the boinccmd from your your newly-built (and hacked) v7.14.2

There are only three possibilities:

It crashes, because of a different library dependency between the build machine and the older Linux on the test machine.
It runs, but produces dodgy output (either because of the older BOINC client - unlikely - or because I've written Windows-only code).
It works.

If (1) - revert to backup!
ID: 89189 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89198 - Posted: 14 Dec 2018, 8:08:35 UTC - in response to Message 89189.  

There's absolutely no reason not to try.

I figured that :-).

I couldn't get onto this yesterday - too much real life stuff to deal with :-). I was very happy this morning to see that my scripts reported everything all OK with the fleet even after 24 hours of no oversight. Storms were predicted yesterday but didn't eventuate. Supposed to really happen today. There's a fairly destructive cyclone in the gulf of Carpentaria that was heading towards the Northern Territory that has turned around and is gathering steam as it approaches the western side of cape York. It's predicted to cross to the east coast and head all the way down the east coast of Queensland and posibly end up affecting Brisbane in about 4 days time. Hopefully it will make a further unpredictable westward turn and give the poor farmers in western Queensland a bit of drought relief instead :-).

I built 7.14.2 this morning. It choked on the edited file, as per the following:-
gui_rpc_client_print.cpp: In member function ‘void PROJECT::print()’:
gui_rpc_client_print.cpp:106:10: warning: declaration of ‘foo’ shadows a previous local [-Wshadow]
   time_t foo = (time_t)min_rpc_time - (time_t)dtime();
          ^~~
gui_rpc_client_print.cpp:103:12: note: shadowed declaration is here
     time_t foo = (time_t)last_rpc_time;
            ^~~
gui_rpc_client_print.cpp:107:67: error: ‘floor’ was not declared in this scope
   printf("   Scheduler RPC feferred for: %02.0f:%02.0f:%02.0f\n", floor(foo/3600.0), floor(fmod(foo,3600.0)/60.0), fmod(foo,60.0));
                                                                   ^~~~~
gui_rpc_client_print.cpp:107:67: note: suggested alternative: ‘foo’
   printf("   Scheduler RPC feferred for: %02.0f:%02.0f:%02.0f\n", floor(foo/3600.0), floor(fmod(foo,3600.0)/60.0), fmod(foo,60.0));
                                                                   ^~~~~
                                                                   foo
gui_rpc_client_print.cpp:107:92: error: ‘fmod’ was not declared in this scope
 printf("   Scheduler RPC feferred for: %02.0f:%02.0f:%02.0f\n", floor(foo/3600.0), floor(fmod(foo,3600.0)/60.0), fmod(foo,60.0));
                                                                                          ^~~~
gui_rpc_client_print.cpp:107:92: note: suggested alternative: ‘feof’
 printf("   Scheduler RPC feferred for: %02.0f:%02.0f:%02.0f\n", floor(foo/3600.0), floor(fmod(foo,3600.0)/60.0), fmod(foo,60.0));
                                                                                          ^~~~
                                                                                          feof

Rather than spend time working out how to deal with that, I patched your patch to be as follows:-
    time_t foo = (time_t)last_rpc_time;
    printf("   last RPC: %s", ctime(&foo));  //ctime supplies its own /n
    if (min_rpc_time > dtime()) {
        time_t fop = (time_t)min_rpc_time;
        printf("   Scheduler RPC deferred until: %s", ctime(&fop));  //ctime supplies its own /n
    }

Essentially, seeing as I was editing anyway, I changed 'foo' to 'fop' to get rid of the warning and then used the same logic as for 'last_rpc_time' so as to produce the time (in the future) when the deferral would end, with the output string now reading " .. deferred until:" to make this clear. This works for me as the meaning is quite clear and now the whole thing builds.

I've installed the new boinccmd on a machine running 7.6.33, created a 1 min backoff by an 'update' and run the new boinccmd. Here's the output.
1) -----------
   name: Einstein@Home
   master URL: http://einstein.phys.uwm.edu/
   user_name: Gary Roberts
   team_name: Ellison Crunchers
   resource share: 900.000000
   user_total_credit: 19280502471.245922
   user_expavg_credit: 31593932.414089
   host_total_credit: 70909091.012384
   host_expavg_credit: 558756.202637
   nrpc_failures: 0
   master_fetch_failures: 0
   master fetch pending: no
   scheduler RPC pending: no
   trickle upload pending: no
   attached via Account Manager: no
   ended: no
   suspended via GUI: no
   don't request more work: no
   disk usage: 0.000000
   last RPC: Fri Dec 14 13:59:03 2018
   Scheduler RPC deferred until: Fri Dec 14 14:00:03 2018
   project files downloaded: 0.000000

Unfortunately, it doesn't work on a system running 7.2.42. This appears to be because the 7.14.2 stuff was built on an up-to-date machine with a later version of libstdc++.so.6. The specific complaint about that lib was "CXXABI_1.3.9 not found" and also "GLIBCXX_3.4.21 not found." I can't upgrade the OS on those older installs as I still need to run the old fglrx driver. I imagine I could try building the 7.14.2 boinccmd on an old machine so it would be using the the appropriate version of libstdc++.so.6 for the build. I might consider that since I clone my copy of the repo every 6 months and I would have all the -devel libs as they existed in the latter part of 2016 and should therefore be compatible with what is on those older installs. It would be relatively easy to set up another build machine and try to build 7.14.2 to see if there's any problems with compilation.

I'm sure you guys probably won't be all that impressed with my 'deferred until' hack but it's absolutely fine for my purposes that way. The next thing I'll try is my 'extra disk test install' to see if I can get a Southern Islands GPU working with the latest kernel/amdgpu module. If I can, I wont need to build BOINC on an old install. I'll have the much bigger job of upgrading all the 'oldies' to use my shiny new 7.14.2 stuff :-).

Once again, thanks very much for your assistance.
Cheers,
Gary.
ID: 89198 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 89199 - Posted: 14 Dec 2018, 8:51:16 UTC - in response to Message 89198.  

That's a pity - I can't submit that code as it stands, then.

The duplicate definition warning is a doddle - I can use unique names. I suspect that the Windows build system (VS2013) has warnings suppressed to avoid frightening the horses.

'Not declared in scope' is more annoying, because I lifted the code from a recognised source. It would be interesting to know if the code builds with unique names but leaving the maths alone.

The official way to fix it would be to add

#include <math.h>
in the list of includes at the top of the file.

There is an alternative time formatting function where I can print HH:MM:SS by using format %T - but that requires an intermediate buffer to hold the output, and the whole app crashed when I tried it...
ID: 89199 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89218 - Posted: 15 Dec 2018, 8:42:22 UTC - in response to Message 89199.  

That's a pity - I can't submit that code as it stands, then.

If including math.h would solve the problem, can't you just add that?

The duplicate definition warning is a doddle ....

Yes, of course, but there are zillions of warnings all throughout the code when built on linux so I don't think you need to worry about this one. They certainly don't bother me.

It would be interesting to know if the code builds with unique names but leaving the maths alone.

I'm not a programmer but why would just a warning about a duplicate definition have anything to do with whether or not some math function is able to be used? Surely the warning just means that if you were to refer to that variable somewhere later on, its value won't be what it was originally?

If you want me to test that, I could go back, restore your patch as delivered and test it out. However, I've moved on quite a bit :-). I decided to build 7.14.2 on one of my 'oldies' (mid 2016 still running fglrx). I had all bar 2 of the correct versions of -devel libs that the build system was happy with and those missing 2 were found in my end-of-2016 repo clone. So I backported them to where they were needed and the build was successful. I've just tested the boinccmd that was produced on an old 7.2.42 install and it gives the full output exactly as shown in my previous message. So I can now install the appropriate boinccmd on all the hosts in the fleet and easily detect any deferral that happens to be triggered. I can now modify my scripts to fire off an update for any deferral greater than say 30 mins. A small deferral should sort itself out. For longer deferrals, if the update still leaves the client deferred, log the situation as something that needs investigating.
Cheers,
Gary.
ID: 89218 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 89289 - Posted: 19 Dec 2018, 5:15:20 UTC

I've had another 24 hr deferral earlier today. As the log snip below shows, it's the same deal as the one I documented earlier. Again, it's on another up-to-date machine running my compiled version of 7.6.33. As before, going back through the event log identified the smoking gun - another "platform .... not found".

....
Wed 19 Dec 2018 08:46:46 AM EST | Einstein@Home | <![CDATA[Starting task LATeah2002L_652.0_0_0.0_1146941_0]]>
Wed 19 Dec 2018 08:46:48 AM EST | Einstein@Home | <![CDATA[Started upload of LATeah2002L_652.0_0_0.0_1266834_0_0]]>
Wed 19 Dec 2018 08:46:48 AM EST | Einstein@Home | <![CDATA[Started upload of LATeah2002L_652.0_0_0.0_1266834_0_1]]>
Wed 19 Dec 2018 08:46:51 AM EST | Einstein@Home | <![CDATA[Finished upload of LATeah2002L_652.0_0_0.0_1266834_0_0]]>
Wed 19 Dec 2018 08:46:51 AM EST | Einstein@Home | <![CDATA[Finished upload of LATeah2002L_652.0_0_0.0_1266834_0_1]]>
Wed 19 Dec 2018 08:54:55 AM EST | Einstein@Home | <![CDATA[Sending scheduler request: To report completed tasks.]]>
Wed 19 Dec 2018 08:54:55 AM EST | Einstein@Home | <![CDATA[Reporting 8 completed tasks]]>
Wed 19 Dec 2018 08:54:55 AM EST | Einstein@Home | <![CDATA[Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: job cache full)]]>
Wed 19 Dec 2018 08:55:05 AM EST | Einstein@Home | <![CDATA[Scheduler request completed]]>
Wed 19 Dec 2018 08:55:05 AM EST | Einstein@Home | <![CDATA[platform 'x86_64-pc-linux-gnu' not found]]>
Wed 19 Dec 2018 08:57:34 AM EST | Einstein@Home | <![CDATA[General prefs: from Einstein@Home (last modified ---)]]>
Wed 19 Dec 2018 08:57:34 AM EST | Einstein@Home | <![CDATA[Computer location: school]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[General prefs: using separate prefs for school]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[Reading preferences override file]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[Preferences:]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[ max memory usage when active: 7544.65MB]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[ max memory usage when idle: 7941.73MB]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[ max disk usage: 20.00GB]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[ max CPUs used: 8]]>
Wed 19 Dec 2018 08:57:34 AM EST | | <![CDATA[ (to change preferences, visit a project web site or select Preferences in the Manager)]]>
Wed 19 Dec 2018 09:02:29 AM EST | Einstein@Home | <![CDATA[Computation for task LATeah2002L_652.0_0_0.0_1265773_0 finished]]>
....

This was found after a 2hr delay as before since I haven't deployed any of the hacked boinccmds that I now have that could identify the problem earlier. I don't particularly like the idea of having to document for future reference, the installation of non-standard stuff like different versions of BOINC components. Since I have 7.14.2 built for both the mid-2016 machines as well as current machines, I figure I should test the complete package I've built, just to be sure everything does really work properly, and then upgrade every single machine to the appropriate 7.14.2 build.

The amount of work to upgrade the fleet, machine by machine manually, has been deterring me for a while. I've decided to create a script that will do it automatically. I've got a lot of it written already. Looping through the entire fleet is no problem, as is looping through any particular subset of machines. For each machine in the loop, all I need to do is stop the currently running version of BOINC. The old stuff to be upgraded gets saved, just in case. The appropriate set of 7.14.2 files gets copied across from a file share on the server machine. The newly installed version of BOINC gets restarted and just resumes the crunching from where the former version left off. Now what could go wrong with that :-)

I'd written a function just to deploy the new boinccmd. After thinking for a bit, I may as well extend the functionality so that all files and locations (with proper error checking and recovery) can be handled.
Cheers,
Gary.
ID: 89289 · Report as offensive

Message boards : Questions and problems : Detecting situations where, "Scheduler RPC deferred for xx:yy:zz" has been issued.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.