Not getting new WU's for CPU projects after upgrade to 6.6.20

Author	Message
Chimmy Send message Joined: 27 Nov 08 Posts: 3	Message 24359 - Posted: 18 Apr 2009, 1:39:24 UTC Ever since I upgraded to 6.6.20, I have to suspend all of the other projects to get one project to download new tasks. I have 6 projects, 2 GPU, 4 CPU. The GPU ones run fine, but eventually the CPU tasks finish up and my CPU's are all idle. When in this state, I am constantly getting the following for all of the CPU projects: <project name> Sending scheduler request: To fetch work. <project name> Requesting new tasks <project name> Scheduler request completed: got 0 new tasks Then I'll pause all other projects except one and it will pull new WU's for that project. I've tried resetting each project but still run in to the exact same thing after the initial download of WU's. I’m on a dual-proc dual-core AMD running WinXP x64, tons of free memory/disk/etc. Any ideas? Need more info, let me know. Thanks, Jim ID: 24359 ·

ZPM Send message Joined: 14 Mar 09 Posts: 215	Message 24362 - Posted: 18 Apr 2009, 2:57:58 UTC - in response to Message 24359. Last modified: 18 Apr 2009, 3:02:29 UTC your's is being like mine was, give it a day or so. eventually you will get a bunch from each. if you haven't try restarting pc. let boinc load it self if u have it set too. let it fetch for work by itself. i've learned that pressing update all the time (the servers) like to ignore my request.... machines are alive... oh yea, you could try detaching and reattaching the projects. just make sure you have no work units in progress or waiting to report.name your projects... some are going doing stuff right now, and work is sporadic. ID: 24362 ·

Aurora Borealis Send message Joined: 8 Jan 06 Posts: 448	Message 24373 - Posted: 18 Apr 2009, 4:05:11 UTC Last modified: 18 Apr 2009, 4:13:02 UTC Forcing downloads from individual projects is rarely a good idea, it only make matters worse. If a project is not downloading work it is usually due to the fact that it has already received more processing time than you allocated in your resource share. Boinc needs to let the other projects catch up. By forcing a download the project just gets deeper into debt to the other projects and it will take longer next time for Boinc to automatically fetched from that project. I have 10 active projects about half with low resource share. Despite this, 8 projects currently have WU on my system. I let Boinc do its thing and it maintains a good variety of work all the time. From time to time, I won't get work from a certain project for a week, especially the low resource ones which have been overworked, but that is to be expected. The resource share works in the long term. It is normal that in the short term some projects get more work otherwise Boinc wouldn't be able to maintain the resource share you allocated. Boinc V 7.4.36 Win7 i5 3.33G 4GB NVidia 470 ID: 24373 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 24379 - Posted: 18 Apr 2009, 11:14:09 UTC There's at least one bug still in work fetch, which can look like the problem reported here. I've reported it as a v6.6.23 bug, but the relevent section of code hasn't changed since v6.6.20 With <work_fetch_debug> on, it looks like this: 17/04/2009 11:21:51 Einstein@Home [wfd] request: CPU (0.00 sec, 0) CUDA (0.00 sec, 0) 17/04/2009 11:21:51 Einstein@Home Sending scheduler request: To fetch work. 17/04/2009 11:21:51 Einstein@Home Requesting new tasks 17/04/2009 11:21:56 Einstein@Home Scheduler request completed: got 0 new tasks In that particular case, the host actually had enough CPU work (no CPU shortfall), but wanted more CUDA work. Somehow, this got translated into a CPU request for Einstein - the CPU shortfall figure was copied to the work request, and you see the result above. I'm worried about this bug, on a number of levels. 1) Without [wfd], it's extremely confusing for users and project helpdesk volunteers. "Requested work, got nothing" sounds like a project server problem: but "Requested 0.00 seconds of work" - while, in computer geek terms, still technically a request - translates in human terms into "Didn't request work", which takes you down a different trouble-shooting path. 2) Getting no work (because you didn't ask for any) doesn't change the machine state, which means that it's going to make the same request the next time the work fetch algorithm is run. So that's an awful lot of unnecessary scheduler contacts. Einstein - I believe - still has 'resend lost results' enabled, so each scheduler RPC is going to entail a LOT of server work. Could this explain why Einstein - and seemingly a lot of other projects - have been "experiencing load problems on the database server" (front page) recently? 3) We now have a client which instigates a per-resource backoff (not visible in the message log without [wfd] - you can look on the new 'properties' page for the project) when a work request results in no tasks being issued. Without considering the reason. So these "got 0 new task" replies will back off CPU work requests. That goes some way towards mitigating the server workload, but introduces another problem. Eventually, the point will come where the project really does need new work - but by then, it could be backed off by many hours. In effect, we have "the client which cries wolf" and ends up starving itself. And this is now the recommended download. I hope we can identify and cure the remaining bugs, and replace v6.6.20 with v6.6.24 or whatever as soon as possible. ID: 24379 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 24380 - Posted: 18 Apr 2009, 11:52:54 UTC - in response to Message 24379. but "Requested 0.00 seconds of work" - while, in computer geek terms, still technically a request - translates in human terms into "Didn't request work", which takes you down a different trouble-shooting path. This may also be the "automated call-back to home to update statistics/preferences" thingy that has not been documented much but was available since a long time. ID: 24380 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 24382 - Posted: 18 Apr 2009, 12:07:35 UTC - in response to Message 24380. but "Requested 0.00 seconds of work" - while, in computer geek terms, still technically a request - translates in human terms into "Didn't request work", which takes you down a different trouble-shooting path. This may also be the "automated call-back to home to update statistics/preferences" thingy that has not been documented much but was available since a long time. It could have been, but in this case it wasn't. 17/04/2009 11:21:51 [work_fetch_debug] Request work fetch: Backoff ended for Einstein@Home 17/04/2009 11:21:51 Einstein@Home chosen: CPU minor shortfall 17/04/2009 11:21:51 [wfd] ------- start work fetch state ------- 17/04/2009 11:21:51 [wfd] target work buffer: 173664.00 sec 17/04/2009 11:21:51 [wfd] CPU: shortfall 0.00 nidle 0.00 est. delay 0.00 RS fetchable 100.00 runnable 500.00 17/04/2009 11:21:51 climateprediction.net [wfd] CPU: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 CPDN Beta [wfd] CPU: fetch share 0.00 debt 1267472.66 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 Einstein@Home [wfd] CPU: fetch share 1.00 debt 0.00 backoff dt 0.00 int 120.00 17/04/2009 11:21:51 lhcathome [wfd] CPU: fetch share 0.00 debt 0.00 backoff dt 5213.54 int 86400.00 17/04/2009 11:21:51 orbit@home [wfd] CPU: fetch share 0.00 debt 72954.31 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 SETI@home [wfd] CPU: fetch share 0.00 debt -284.69 backoff dt 1373.18 int 15360.00 (comm deferred) 17/04/2009 11:21:51 SETI@home Beta Test [wfd] CPU: fetch share 0.00 debt -596125.25 backoff dt 0.00 int 0.00 (no new tasks) (overworked) 17/04/2009 11:21:51 [wfd] CUDA: shortfall 21420.16 nidle 0.00 est. delay 0.00 RS fetchable 0.00 runnable 300.00 17/04/2009 11:21:51 climateprediction.net [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 86400.00 (no new tasks) 17/04/2009 11:21:51 CPDN Beta [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 Einstein@Home [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 48874.10 int 86400.00 17/04/2009 11:21:51 lhcathome [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 61413.46 int 86400.00 17/04/2009 11:21:51 orbit@home [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 SETI@home [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (comm deferred) 17/04/2009 11:21:51 SETI@home Beta Test [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (no new tasks) 17/04/2009 11:21:51 climateprediction.net [wfd] overall_debt 0 17/04/2009 11:21:51 CPDN Beta [wfd] overall_debt 1267473 17/04/2009 11:21:51 Einstein@Home [wfd] overall_debt 0 17/04/2009 11:21:51 lhcathome [wfd] overall_debt 0 17/04/2009 11:21:51 orbit@home [wfd] overall_debt 72954 17/04/2009 11:21:51 SETI@home [wfd] overall_debt -285 17/04/2009 11:21:51 SETI@home Beta Test [wfd] overall_debt -596125 17/04/2009 11:21:51 [wfd] ------- end work fetch state ------- 17/04/2009 11:21:51 Einstein@Home [wfd] request: CPU (0.00 sec, 0) CUDA (0.00 sec, 0) 17/04/2009 11:21:51 Einstein@Home Sending scheduler request: To fetch work. 17/04/2009 11:21:51 Einstein@Home Requesting new tasks 17/04/2009 11:21:56 Einstein@Home Scheduler request completed: got 0 new tasks Note that in this case, because of a combination of backoffs, NNT and 'comm deferred' (SETI was down at the time), the Einstein/CPU resource was the only combination in 'fetchable' state. So it went ahead with the fetch, even though it was fetching nothing. And also note that the Einstein/CPU resource has just ended a 120 second backoff, implying that it has already been round the cycle a couple of times (while I was getting [wfd] set up to see what the heck was going on). ID: 24382 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 24383 - Posted: 18 Apr 2009, 12:48:52 UTC - in response to Message 24380. This may also be the "automated call-back to home to update statistics/preferences" thingy ... Just for giggles, I did a manual update on the same v6.6.23 host, and got the expected "not requesting new tasks": 18/04/2009 13:38:43 Einstein@Home Sending scheduler request: Requested by user. 18/04/2009 13:38:43 Einstein@Home Reporting 1 completed tasks, not requesting new tasks 18/04/2009 13:38:48 Einstein@Home Scheduler request completed: got 0 new tasks 18/04/2009 13:38:48 Einstein@Home Message from server: Server can't open database I think the call-back thingy would also show as "not requesting new tasks", and that doesn't worry me at all: it's the "Requesting new tasks" which turns out not to be a request which is confusing and - as I argued - potentially dangerous. That's where I call 'bug'. ID: 24383 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 24384 - Posted: 18 Apr 2009, 13:41:19 UTC - in response to Message 24383. This may also be the "automated call-back to home to update statistics/preferences" thingy ... Just for giggles, I did a manual update on the same v6.6.23 host, and got the expected "not requesting new tasks" I was just flagging it as I think it does it with a request for work. It has to contact the scheduler in some way. I'll see if I can find where it does it from within the source code. ID: 24384 ·

Chimmy Send message Joined: 27 Nov 08 Posts: 3	Message 24386 - Posted: 18 Apr 2009, 17:52:16 UTC Thanks for all the replies. I turned on the work_fetch_debug and it looks like all but one of the projects is overworked. I'll let it go until it's out of work or nearly out and turn it on again. Is there any way to decrease the frequency that wfd displays data? Every second is a bit much. Thanks, Jim ID: 24386 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 24387 - Posted: 18 Apr 2009, 17:57:45 UTC - in response to Message 24386. Last modified: 18 Apr 2009, 17:58:55 UTC Is there any way to decrease the frequency that wfd displays data? Every second is a bit much. No, there isn't. You can increase the size of the log files though, or only run WFD for a short time, to log inconsistencies, before shutting it down. To increase log size, add the following to cc_config.xml in your BOINC Data directory: <cc-config> <options> <max_stdout_file_size>8388608</max_stdout_file_size> <max_stderr_file_size>8388608</max_stderr_file_size> </options> </cc_config> The numbers there signify 8MB. The size of the log must be set in bytes. When this file is saved, make sure it didn't get a .txt extension. Then open BOINC Manager->Advanced->Read config file. ID: 24387 ·

Paul D. Buck Send message Joined: 29 Aug 05 Posts: 225	Message 24485 - Posted: 22 Apr 2009, 19:55:23 UTC - in response to Message 24379. There's at least one bug still in work fetch, which can look like the problem reported here. I've reported it as a v6.6.23 bug, but the relevent section of code hasn't changed since v6.6.20 I don't recall seeing this on the mailing lists. Or is it being talked about elsewhere. Not that I consider most of the discussion responsive on the mailing lists. ID: 24485 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 24488 - Posted: 22 Apr 2009, 22:17:54 UTC - in response to Message 24485. There's at least one bug still in work fetch, which can look like the problem reported here. I've reported it as a v6.6.23 bug, but the relevent section of code hasn't changed since v6.6.20 I don't recall seeing this on the mailing lists. Or is it being talked about elsewhere. Not that I consider most of the discussion responsive on the mailing lists. It appeared on boinc_alpha - my name, subject 'Work fetch bug in v6.6.23', first line "I'm afraid we still have a problem with v6.6.23." My confirmation copy back from the boinc_alpha server is timed at 17 April 2009 11:51 (that's BST, or UTC +1 - or about 23 minutes before the post on this board that you quoted). No replies on the mailing list yet, but I'm waiting for fuller particulars from server administrators - I've asked at Einstein and CPDN, two projects I know well which have had recent server overload problems, if they feel that they're suffering from what I'm calling "content free" schedulewr RPC calls from clients. Thyme Lawn has guided Milo in what to look for in the server logs - Jord has access to the area where the question was put, and where any answer will appear. None yet. ID: 24488 ·

Paul D. Buck Send message Joined: 29 Aug 05 Posts: 225	Message 24489 - Posted: 22 Apr 2009, 22:27:24 UTC - in response to Message 24488. No replies on the mailing list yet, but I'm waiting for fuller particulars from server administrators - I've asked at Einstein and CPDN, two projects I know well which have had recent server overload problems, if they feel that they're suffering from what I'm calling "content free" schedulewr RPC calls from clients. Thyme Lawn has guided Milo in what to look for in the server logs - Jord has access to the area where the question was put, and where any answer will appear. None yet. I am pretty sure you are correct as this being a problem. Might also want to consider trying to get the info from MW too ... my bogus GPU calls to them may be part of the problem that they are seeing also. The mind set is that if the cost appears low there is no reason to avoid doing stupid things over and over and over again. I have still to hear why we want to reschedule CPUs once every 60 seconds much less more often than that. Yet that at its root what is causing the issue we have been trying to get them to recognize ... ID: 24489 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 24490 - Posted: 22 Apr 2009, 22:49:22 UTC - in response to Message 24489. Might also want to consider trying to get the info from MW too ... my bogus GPU calls to them may be part of the problem that they are seeing also. Could you ask at MW? I'm not a member of that project, and a complicated question like that from a total stranger isn't going to get much attention - especially if they're in the middle of a crisis. The mind set is that if the cost appears low there is no reason to avoid doing stupid things over and over and over again. I have still to hear why we want to reschedule CPUs once every 60 seconds much less more often than that. Yet that at its root what is causing the issue we have been trying to get them to recognize ... As I said on boinc_alpha, I don't mind if the client spends a few milliseconds checking itself over - "Everybody OK? No problems? OK, Carry on as you were". The problem only arises if the client takes wasteful action as an improper result of the self-check - preempting tasks and so on - and there are problems in spades if the client fools itself into sending DDOS 'content free' RPC calls to servers. But it's a big call to make, to claim that the new "recommended" client is crippling the BOINC server network, and I must have confirmation of my facts before I call in anything louder than the question I have already put. I'm doing my damnedest to get that confirmation. ID: 24490 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.