Thread 'GPUGrid'

Message boards : Projects : GPUGrid
Message board moderation

To post messages, you must log in.

AuthorMessage
Zalster

Send message
Joined: 8 Aug 14
Posts: 135
United States
Message 89196 - Posted: 14 Dec 2018, 0:22:56 UTC

Looks like it's down
ID: 89196 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89197 - Posted: 14 Dec 2018, 2:44:51 UTC - in response to Message 89196.  
Last modified: 14 Dec 2018, 2:45:31 UTC

And I discovered that it is the first project to attempt communication when BOINC is started which I think is alphabetical. Which promptly stalls out any further communications for my other projects. Didn't realize this was the way BOINC worked. Until GPUGrid timed out finally on its attempt to connect, it held off any other projects normal connections. This caused me to let my Seti cache fall down by a hundred tasks until finally it was given the chance to report and request work.

As far as I am concerned this is a flaw in the BOINC communications protocol. No misbehaving project should prevent any normal running project from communicating.
ID: 89197 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89200 - Posted: 14 Dec 2018, 9:16:19 UTC - in response to Message 89197.  

BOINC is not designed to handle multiple overlapping scheduler requests to different projects. Looking at my overnight logs (GPUGrid was indeed down), I see BOINC waiting 22 or 23 seconds for the reply, and trying at most two consecutive times before going into backoff:

13/12/2018 23:00:43 | GPUGRID | Sending scheduler request: Requested by project.
13/12/2018 23:01:05 | GPUGRID | Scheduler request failed: Couldn't connect to server
13/12/2018 23:01:05 | GPUGRID | Sending scheduler request: Requested by project.
13/12/2018 23:01:28 | GPUGRID | [sched_op] Deferring communication for 00:01:43
13/12/2018 23:01:28 | GPUGRID | [sched_op] Reason: Scheduler request failed
So there's a maximum delay of 45 seconds before another project gets a chance: if you can complete 100 SETI tasks in 45 seconds, you're on your own.

This is perhaps specific to GPUGrid because it's one of the few projects which requests a scheduler contact every hour (as shown in the log above), so I don't mind discussing it in this thread for the time being. But, on the other hand, my BOINC clients run typically for a month at a time: you mention yours was starting up. Could that, perhaps, be because you had shut it down to run an unsupported external tool like a rescheduler? Was BOINC trying to do something different at startup from the normal scheduled checkin? As always, supply the full context with supporting evidence if you want to lodge a feature request for development.
ID: 89200 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89214 - Posted: 14 Dec 2018, 22:57:11 UTC

Impossible to document the fault now that GPUGrid has returned. But that was not what I saw on my client. I had simply shut down BOINC to do some stability testing after changing from UMA to NUMA memory access modes. I had not run any rescheduler. I was simply returning to running BOINC. The GPUGrid request just kept happening over and over. It took something like a minute and half to timeout and then started another request. Over and over again. It must have been a half hour before the GPUGrid finally started incrementing additional server connect backoffs which finally let my other projects in.

As I stated, no supporting documentation available as I have stopped and restarted BOINC too many times now and don't have any Event Log entries from that day.
ID: 89214 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89215 - Posted: 14 Dec 2018, 23:14:33 UTC - in response to Message 89214.  

Not even in stdoutdae.txt?

I try not to over-fill my logs (just a couple of fairly quiet debug options), and I retain 50 MB - so my 'old' file currently takes me back to mid-July.
ID: 89215 · Report as offensive
ProfileJord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15565
Netherlands
Message 89224 - Posted: 15 Dec 2018, 18:31:13 UTC - in response to Message 89215.  

I use
<max_stdout_file_size>20119200</max_stdout_file_size>

Works fine. Especially since I not longer run BOINC on my PC. ;-)
ID: 89224 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89232 - Posted: 16 Dec 2018, 18:27:19 UTC
Last modified: 16 Dec 2018, 18:46:02 UTC

Richard how do I document the project request failure for you. Right now MilkyWay has the database down and I just had its server connect hang up the machine for over five minutes. I thought you said the server connects can only last 45 seconds before timing out. This prevented Seti from contacting the scheduler.

From the Event Log just now.

Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] Starting scheduler request
Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Sending scheduler request: To report completed tasks.
Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Reporting 10 completed tasks
Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Requesting new tasks for NVIDIA GPU
Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] NVIDIA GPU work request: 189859.02 seconds; 0.00 devices

bunch of Seti uploads

Sun 16 Dec 2018 10:09:58 AM PST | | Project communication failed: attempting access to reference site
Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | Scheduler request failed: Timeout was reached
Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | [sched_op] Deferring communication for 00:38:41
Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | [sched_op] Reason: Scheduler request failed
Sun 16 Dec 2018 10:10:00 AM PST | | Internet access OK - project servers may be temporarily down.


Seti finally gets a chance to connect

Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] Starting scheduler request
Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Sending scheduler request: To fetch work.
Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Reporting 36 completed tasks
Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Requesting new tasks for CPU and NVIDIA GPU
Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] CPU work request: 1037353.42 seconds; 0.00 devices
Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] NVIDIA GPU work request: 191643.44 seconds; 0.00 devices


My stdoutdae.txt is currently 2.7MB but that only covers one day. So by the time you asked for documentation, the events had been purged from that file. I have the standard log flags plus only one other sched_op_debug so I can see how many seconds of work I request each connection. That is the only "extra" information included in the log other than the normal stuff.

[Edit] Curious as to why my stdoutdae.txt is only 2.7MB and covers only a day. I have <max_stdout_file_size>0</max_stdout_file_size> and I believe that 0 means no limit.
ID: 89232 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89233 - Posted: 16 Dec 2018, 18:47:02 UTC - in response to Message 89232.  

If you're seeing, specifically, timeouts - you could try a couple of config options:

        <dont_contact_ref_site>1</dont_contact_ref_site>
Cuts out the 'internet access' check (doesn't pester Google). I guess we tend to know whether the internet is up without BOINC telling us...

        <http_transfer_timeout>60</http_transfer_timeout>
I think this controls scheduler requests as well. Default is 300 seconds - I think 60 is plenty on a decent connection (if it ain't happened by then, it ain't going to happen).

But with GPUGrid, I was getting a more proactive 'Couldn't connect to server' before the timeout. I'll compare your dropbox cc_config.xml with mine more thoroughly after dinner.
ID: 89233 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89236 - Posted: 16 Dec 2018, 19:37:16 UTC - in response to Message 89233.  

If you're seeing, specifically, timeouts - you could try a couple of config options:

        <dont_contact_ref_site>1</dont_contact_ref_site>
Cuts out the 'internet access' check (doesn't pester Google). I guess we tend to know whether the internet is up without BOINC telling us...

        <http_transfer_timeout>60</http_transfer_timeout>
I think this controls scheduler requests as well. Default is 300 seconds - I think 60 is plenty on a decent connection (if it ain't happened by then, it ain't going to happen).

But with GPUGrid, I was getting a more proactive 'Couldn't connect to server' before the timeout. I'll compare your dropbox cc_config.xml with mine more thoroughly after dinner.

Just looked and yes the <http_transfer_timeout>300</http_transfer_timeout> is default set to 5 minutes so that explains the long timeout. I agree if it doesn't happen in 60 seconds, it ain't going to happen. See that I will have to change the default for all my hosts.
ID: 89236 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89243 - Posted: 16 Dec 2018, 22:11:18 UTC - in response to Message 89233.  

I'll compare your dropbox cc_config.xml with mine more thoroughly after dinner.
Can't see any significant differences - not at this time of night, anyway.
ID: 89243 · Report as offensive
ProfileJord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15565
Netherlands
Message 89244 - Posted: 16 Dec 2018, 22:54:21 UTC - in response to Message 89232.  

[Edit] Curious as to why my stdoutdae.txt is only 2.7MB and covers only a day. I have <code>max_stdout_file_size</code> set to zero and I believe that 0 means no limit.
No, zero means here that that line isn't used in cc_config.xml, but that the default maximum from BOINC is used, which is 2048KB.
ID: 89244 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89245 - Posted: 17 Dec 2018, 1:01:33 UTC - in response to Message 89244.  

Thanks for the clarification Jord. I guess I should increase from the BOINC default. Looks like docs say the value is in bytes. Looks like there isn't any consistency in what 0 means. For example the docs say 0 means no limit for the Event Log lines.
<max_event_log_lines>N</max_event_log_lines>
Maximum number of lines to display in BOINC Manager's Event Log window (default 2000, 0 means no limit).
ID: 89245 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89249 - Posted: 17 Dec 2018, 10:45:01 UTC - in response to Message 89245.  

In general, if something is written in the docs, it can be trusted. Occasionally, things get out of whack (something gets changed in code without updating the docs, fo example): if something like that comes to our attention, either Jord or I will try to correct it.

If nothing is written in the docs, it is not safe to make any assumptions: consistency is not guaranteed, however desirable. I've had a search through the code, and I can't find anywhere where the specific case of max_stdout_file_size=0 is handled. My best guess is that a new file would be created every time BOINC is restarted, and just one single run would be kept as the 'old' file. That's probably not what you wanted.
ID: 89249 · Report as offensive
MarkJ
Volunteer tester
Help desk expert

Send message
Joined: 5 Mar 08
Posts: 272
Australia
Message 89252 - Posted: 17 Dec 2018, 13:22:49 UTC

I have seen situations where the project comms will time out after 5 minutes which is the default setting unless changed via cc_config. Its possible that it would try 1st attempt and then timeout after 5 mins followed by another go before going into project backoff. That would mean at least 10 minutes plus the default backoff interval before it goes into project backoff.
MarkJ
ID: 89252 · Report as offensive
ProfileJord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15565
Netherlands
Message 89256 - Posted: 17 Dec 2018, 13:50:14 UTC - in response to Message 89245.  

At the Options section it says "(default values will be used for any options not specified)", I concur that could be said better, but I'll have to figure out how.
ID: 89256 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89259 - Posted: 17 Dec 2018, 18:06:32 UTC - in response to Message 89249.  

In general, if something is written in the docs, it can be trusted. Occasionally, things get out of whack (something gets changed in code without updating the docs, fo example): if something like that comes to our attention, either Jord or I will try to correct it.

If nothing is written in the docs, it is not safe to make any assumptions: consistency is not guaranteed, however desirable. I've had a search through the code, and I can't find anywhere where the specific case of max_stdout_file_size=0 is handled. My best guess is that a new file would be created every time BOINC is restarted, and just one single run would be kept as the 'old' file. That's probably not what you wanted.


But -

<max_stderr_file_size>0</max_stderr_file_size>
<max_stdout_file_size>0</max_stdout_file_size>

appears to be the default values written into cc_config.xml anytime it gets fully populated. I certainly did not make any change to these parameters on all of my machines. I never even bothered to look at that parameter until you mentioned it. So I assume the developers intend the file to be recreated every time BOINC is started. Only if you are aware of these parameters and their defaults and you intend to use the files for history and troubleshooting does it need to get changed.
ID: 89259 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89260 - Posted: 17 Dec 2018, 18:10:48 UTC - in response to Message 89252.  

I have seen situations where the project comms will time out after 5 minutes which is the default setting unless changed via cc_config. Its possible that it would try 1st attempt and then timeout after 5 mins followed by another go before going into project backoff. That would mean at least 10 minutes plus the default backoff interval before it goes into project backoff.

Which was exactly the situation I faced with the GPUGrid project being down. It held off Seti connects for at least ten minutes, I say it was a lot longer than that. Don't have any proof though. All I know is that even in ten minutes I can process and report dozens of tasks that aren't being replaced because the scheduler connections is being thwarted by another projects connection attempts.
ID: 89260 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 89261 - Posted: 17 Dec 2018, 19:20:49 UTC - in response to Message 89259.  

<max_stderr_file_size>0</max_stderr_file_size>
<max_stdout_file_size>0</max_stdout_file_size>

appears to be the default values written into cc_config.xml anytime it gets fully populated.
Yes, confirmed. It all seems to happen in lib/diagnostics.cpp: I see

static double      stderr_file_size = 0;
static double      max_stderr_file_size = 2048*1024;
static double      stdout_file_size = 0;
static double      max_stdout_file_size = 2048*1024;
but I had real difficulty following how those variables are set or used this morning.

We ought to move this conversation out of the projects area if it's going to continue.
ID: 89261 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 89263 - Posted: 18 Dec 2018, 4:09:18 UTC - in response to Message 89261.  

Don't think it is necessary to pursue this any further in this thread. I understand where and why I need to change the defaults to enable troubleshooting. Both projects are connecting now. I have changed the connect interval timeout to 60 seconds so shouldn't have any further issues.
ID: 89263 · Report as offensive

Message boards : Projects : GPUGrid

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.