Message boards : Projects : GPUGrid
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Aug 14 Posts: 135 |
Looks like it's down |
Send message Joined: 17 Nov 16 Posts: 890 |
And I discovered that it is the first project to attempt communication when BOINC is started which I think is alphabetical. Which promptly stalls out any further communications for my other projects. Didn't realize this was the way BOINC worked. Until GPUGrid timed out finally on its attempt to connect, it held off any other projects normal connections. This caused me to let my Seti cache fall down by a hundred tasks until finally it was given the chance to report and request work. As far as I am concerned this is a flaw in the BOINC communications protocol. No misbehaving project should prevent any normal running project from communicating. |
Send message Joined: 5 Oct 06 Posts: 5129 |
BOINC is not designed to handle multiple overlapping scheduler requests to different projects. Looking at my overnight logs (GPUGrid was indeed down), I see BOINC waiting 22 or 23 seconds for the reply, and trying at most two consecutive times before going into backoff: 13/12/2018 23:00:43 | GPUGRID | Sending scheduler request: Requested by project. 13/12/2018 23:01:05 | GPUGRID | Scheduler request failed: Couldn't connect to server 13/12/2018 23:01:05 | GPUGRID | Sending scheduler request: Requested by project. 13/12/2018 23:01:28 | GPUGRID | [sched_op] Deferring communication for 00:01:43 13/12/2018 23:01:28 | GPUGRID | [sched_op] Reason: Scheduler request failedSo there's a maximum delay of 45 seconds before another project gets a chance: if you can complete 100 SETI tasks in 45 seconds, you're on your own. This is perhaps specific to GPUGrid because it's one of the few projects which requests a scheduler contact every hour (as shown in the log above), so I don't mind discussing it in this thread for the time being. But, on the other hand, my BOINC clients run typically for a month at a time: you mention yours was starting up. Could that, perhaps, be because you had shut it down to run an unsupported external tool like a rescheduler? Was BOINC trying to do something different at startup from the normal scheduled checkin? As always, supply the full context with supporting evidence if you want to lodge a feature request for development. |
Send message Joined: 17 Nov 16 Posts: 890 |
Impossible to document the fault now that GPUGrid has returned. But that was not what I saw on my client. I had simply shut down BOINC to do some stability testing after changing from UMA to NUMA memory access modes. I had not run any rescheduler. I was simply returning to running BOINC. The GPUGrid request just kept happening over and over. It took something like a minute and half to timeout and then started another request. Over and over again. It must have been a half hour before the GPUGrid finally started incrementing additional server connect backoffs which finally let my other projects in. As I stated, no supporting documentation available as I have stopped and restarted BOINC too many times now and don't have any Event Log entries from that day. |
Send message Joined: 5 Oct 06 Posts: 5129 |
Not even in stdoutdae.txt? I try not to over-fill my logs (just a couple of fairly quiet debug options), and I retain 50 MB - so my 'old' file currently takes me back to mid-July. |
Send message Joined: 29 Aug 05 Posts: 15565 |
I use <max_stdout_file_size>20119200</max_stdout_file_size> Works fine. Especially since I not longer run BOINC on my PC. ;-) |
Send message Joined: 17 Nov 16 Posts: 890 |
Richard how do I document the project request failure for you. Right now MilkyWay has the database down and I just had its server connect hang up the machine for over five minutes. I thought you said the server connects can only last 45 seconds before timing out. This prevented Seti from contacting the scheduler. From the Event Log just now. Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] Starting scheduler request Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Sending scheduler request: To report completed tasks. Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Reporting 10 completed tasks Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | Requesting new tasks for NVIDIA GPU Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices Sun 16 Dec 2018 10:04:49 AM PST | Milkyway@Home | [sched_op] NVIDIA GPU work request: 189859.02 seconds; 0.00 devices bunch of Seti uploads Sun 16 Dec 2018 10:09:58 AM PST | | Project communication failed: attempting access to reference site Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | Scheduler request failed: Timeout was reached Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | [sched_op] Deferring communication for 00:38:41 Sun 16 Dec 2018 10:09:58 AM PST | Milkyway@Home | [sched_op] Reason: Scheduler request failed Sun 16 Dec 2018 10:10:00 AM PST | | Internet access OK - project servers may be temporarily down. Seti finally gets a chance to connect Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] Starting scheduler request Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Sending scheduler request: To fetch work. Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Reporting 36 completed tasks Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | Requesting new tasks for CPU and NVIDIA GPU Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] CPU work request: 1037353.42 seconds; 0.00 devices Sun 16 Dec 2018 10:10:03 AM PST | SETI@home | [sched_op] NVIDIA GPU work request: 191643.44 seconds; 0.00 devices My stdoutdae.txt is currently 2.7MB but that only covers one day. So by the time you asked for documentation, the events had been purged from that file. I have the standard log flags plus only one other sched_op_debug so I can see how many seconds of work I request each connection. That is the only "extra" information included in the log other than the normal stuff. [Edit] Curious as to why my stdoutdae.txt is only 2.7MB and covers only a day. I have <max_stdout_file_size>0</max_stdout_file_size> and I believe that 0 means no limit. |
Send message Joined: 5 Oct 06 Posts: 5129 |
If you're seeing, specifically, timeouts - you could try a couple of config options: <dont_contact_ref_site>1</dont_contact_ref_site>Cuts out the 'internet access' check (doesn't pester Google). I guess we tend to know whether the internet is up without BOINC telling us... <http_transfer_timeout>60</http_transfer_timeout>I think this controls scheduler requests as well. Default is 300 seconds - I think 60 is plenty on a decent connection (if it ain't happened by then, it ain't going to happen). But with GPUGrid, I was getting a more proactive 'Couldn't connect to server' before the timeout. I'll compare your dropbox cc_config.xml with mine more thoroughly after dinner. |
Send message Joined: 17 Nov 16 Posts: 890 |
If you're seeing, specifically, timeouts - you could try a couple of config options: Just looked and yes the <http_transfer_timeout>300</http_transfer_timeout> is default set to 5 minutes so that explains the long timeout. I agree if it doesn't happen in 60 seconds, it ain't going to happen. See that I will have to change the default for all my hosts. |
Send message Joined: 5 Oct 06 Posts: 5129 |
I'll compare your dropbox cc_config.xml with mine more thoroughly after dinner.Can't see any significant differences - not at this time of night, anyway. |
Send message Joined: 29 Aug 05 Posts: 15565 |
[Edit] Curious as to why my stdoutdae.txt is only 2.7MB and covers only a day. I have <code>max_stdout_file_size</code> set to zero and I believe that 0 means no limit.No, zero means here that that line isn't used in cc_config.xml, but that the default maximum from BOINC is used, which is 2048KB. |
Send message Joined: 17 Nov 16 Posts: 890 |
Thanks for the clarification Jord. I guess I should increase from the BOINC default. Looks like docs say the value is in bytes. Looks like there isn't any consistency in what 0 means. For example the docs say 0 means no limit for the Event Log lines. <max_event_log_lines>N</max_event_log_lines> Maximum number of lines to display in BOINC Manager's Event Log window (default 2000, 0 means no limit). |
Send message Joined: 5 Oct 06 Posts: 5129 |
In general, if something is written in the docs, it can be trusted. Occasionally, things get out of whack (something gets changed in code without updating the docs, fo example): if something like that comes to our attention, either Jord or I will try to correct it. If nothing is written in the docs, it is not safe to make any assumptions: consistency is not guaranteed, however desirable. I've had a search through the code, and I can't find anywhere where the specific case of max_stdout_file_size=0 is handled. My best guess is that a new file would be created every time BOINC is restarted, and just one single run would be kept as the 'old' file. That's probably not what you wanted. |
Send message Joined: 5 Mar 08 Posts: 272 |
I have seen situations where the project comms will time out after 5 minutes which is the default setting unless changed via cc_config. Its possible that it would try 1st attempt and then timeout after 5 mins followed by another go before going into project backoff. That would mean at least 10 minutes plus the default backoff interval before it goes into project backoff. MarkJ |
Send message Joined: 29 Aug 05 Posts: 15565 |
At the Options section it says "(default values will be used for any options not specified)", I concur that could be said better, but I'll have to figure out how. |
Send message Joined: 17 Nov 16 Posts: 890 |
In general, if something is written in the docs, it can be trusted. Occasionally, things get out of whack (something gets changed in code without updating the docs, fo example): if something like that comes to our attention, either Jord or I will try to correct it. But - <max_stderr_file_size>0</max_stderr_file_size> <max_stdout_file_size>0</max_stdout_file_size> appears to be the default values written into cc_config.xml anytime it gets fully populated. I certainly did not make any change to these parameters on all of my machines. I never even bothered to look at that parameter until you mentioned it. So I assume the developers intend the file to be recreated every time BOINC is started. Only if you are aware of these parameters and their defaults and you intend to use the files for history and troubleshooting does it need to get changed. |
Send message Joined: 17 Nov 16 Posts: 890 |
I have seen situations where the project comms will time out after 5 minutes which is the default setting unless changed via cc_config. Its possible that it would try 1st attempt and then timeout after 5 mins followed by another go before going into project backoff. That would mean at least 10 minutes plus the default backoff interval before it goes into project backoff. Which was exactly the situation I faced with the GPUGrid project being down. It held off Seti connects for at least ten minutes, I say it was a lot longer than that. Don't have any proof though. All I know is that even in ten minutes I can process and report dozens of tasks that aren't being replaced because the scheduler connections is being thwarted by another projects connection attempts. |
Send message Joined: 5 Oct 06 Posts: 5129 |
<max_stderr_file_size>0</max_stderr_file_size>Yes, confirmed. It all seems to happen in lib/diagnostics.cpp: I see static double stderr_file_size = 0; static double max_stderr_file_size = 2048*1024; static double stdout_file_size = 0; static double max_stdout_file_size = 2048*1024;but I had real difficulty following how those variables are set or used this morning. We ought to move this conversation out of the projects area if it's going to continue. |
Send message Joined: 17 Nov 16 Posts: 890 |
Don't think it is necessary to pursue this any further in this thread. I understand where and why I need to change the defaults to enable troubleshooting. Both projects are connecting now. I have changed the connect interval timeout to 60 seconds so shouldn't have any further issues. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.