WCG: new systems download 100s of CPU work units, not possible to work all

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 106361 - Posted: 9 Dec 2021, 17:14:00 UTC Last modified: 9 Dec 2021, 17:17:06 UTC IMHO That problem with gpugrid is going to be hard to debug. I would not expect a gpu tasks to be swapped out for another from the same project. Thinking about that reminds me of a problem that showed up over at Milkyway earlier that I tried to help with. an n-body (cpu needs 4 threads) was totally idle wile four cpu tasks were running (system had only 4 cores). My guess was the nbody was swapped out but would never got a time slice again because of all the smaller cpu tasks that finish at different times. All tasks were MW. I suggest to run either one or the other but not both from the same project. In other news I was able to verify that a new install of BOINC needed "WUprop" so that adding Einstein or WCG would not .cause 100s of downloads Einstein is my fallback project with share = 0 and Milkway is my %100 as I can run 4 concurrent tasks. I tried running two Einstein concurrently. Saw a tiny improvement but not enough to justify having to use a bigger fan to cool my rack of GPS. I recently joined that supersecret GPU club and have some ideas to work on. One is to try to arrange my "boinc mod" so that if gpugrid gets suspended the GPUs get assigned to the same slot they were using. When running my rack of three gpugrid tasks: p102-100, gtrx1070 and gtx1660ti all three can die when resumed from suspension as the CUDA compiler does not know the meta data is different and tries to pick up where it left off which causes a failure. The alternative is to run 3 instances of BOINC but that is a PITA. ID: 106361 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106362 - Posted: 9 Dec 2021, 18:16:51 UTC - in response to Message 106361. I don't think much of that will come into play. That machine has a TSI (Task Switch Interval - 'Switch between tasks every ...') of 3,000 minutes - over 2 days. Tasks could still be switched if any of them were reaching a deadline, but my shortest deadline is 1.5 days for WCG resends, and my cache request is 0.25 + 0.05 days - about 7 hours. Nothing should hit any of those triggers in normal running. My biggest risk is fractional GPU running. As the screenshot shows, Einstein is set to use 0.5 GPUs, and so is WCG. GPUGrid is allowed to wallow in a whole GPU to itself, so won't start automatically when there's only half a GPU free. That requires a little gentle nudging (one GPUGrid task will follow another, if only the project would keep up a regular supply). My big worry is simply the work fetch algorithm. *Something* has unleashed work fetch for GPUGrid when it shouldn't have, and I didn't have enough active in the Event Log to show what it was. I'll turn on some extra flags which I reach the equivalent stage tomorrow, and try again. ID: 106362 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 879	Message 106364 - Posted: 9 Dec 2021, 19:25:44 UTC I agree GPUGrid shouldn't have fetched another task while 1 was already running. But work_fetch.cpp tied into rr_simulation is such a kludge now, I can assume things will fall through the abundant cracks in its logic. ID: 106364 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106366 - Posted: 9 Dec 2021, 19:41:43 UTC - in response to Message 106364. ... work_fetch.cpp tied into rr_simulation is such a kludge now ... Yup, but logic is still logic, even if it doesn't do what David thinks it does. I have a memory gnawing away at the back of my mind. Sometime fairly recently - say last six months, maybe even more recent - I think I saw an issue or PR on GitHub to the effect of 'always request work when contacting a project' - thus over-riding the work fetch priority values, or so it seemed. Ring any bells with anyone here? The Github search tools aren't good enough to find it, and I don't remember the exact wording. It may even have been somewhere else, like these boards, and a request rather than an actual change. I'll keep poking, but any assistance would be welcome. ID: 106366 ·

Harri Liljeroos Send message Joined: 25 Jul 18 Posts: 63	Message 106368 - Posted: 9 Dec 2021, 22:08:16 UTC Last modified: 9 Dec 2021, 22:09:49 UTC There are options in cc_config <fetch_on_update>0</fetch_on_update> and <fetch_minimal_work>0</fetch_minimal_work>. ID: 106368 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106375 - Posted: 10 Dec 2021, 14:24:10 UTC - in response to Message 106368. Thanks for the reminder. We had a discussion about that in the projects area in October (that might be what I was remembering, though it doesn't quite match). In that discussion, I suggested I might give it a try, but I've looked - I confirm that option is not active on the machine in question. Also, in the event log extracts I showed you yesterday, there was a work request to GPUGrid an hour earlier, and NVidia work was not requested - NVidia work was only requested when I was trying to fill the Einstein cache. Here are the events on either side of that earlier work request: 09-Dec-2021 13:25:24 [World Community Grid] Sending scheduler request: To fetch work. 09-Dec-2021 13:25:24 [World Community Grid] Requesting new tasks for NVIDIA GPU and Intel GPU 09-Dec-2021 13:25:24 [World Community Grid] [sched_op] NVIDIA GPU work request: 21042.74 seconds; 2.00 devices 09-Dec-2021 13:25:24 [World Community Grid] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices 09-Dec-2021 13:26:30 [GPUGRID] Sending scheduler request: Requested by project. 09-Dec-2021 13:26:30 [GPUGRID] Requesting new tasks for Intel GPU 09-Dec-2021 13:26:30 [GPUGRID] [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 09-Dec-2021 13:26:30 [GPUGRID] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices 09-Dec-2021 13:26:31 [GPUGRID] Scheduler request completed: got 0 new tasks 09-Dec-2021 13:30:21 [World Community Grid] Sending scheduler request: To fetch work. 09-Dec-2021 13:30:21 [World Community Grid] Requesting new tasks for NVIDIA GPU and Intel GPU 09-Dec-2021 13:30:21 [World Community Grid] [sched_op] NVIDIA GPU work request: 21420.77 seconds; 2.00 devices 09-Dec-2021 13:30:21 [World Community Grid] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices So the overall cache was definitely low, but the running GPU task and the exclusion of the second GPU meant that it wasn't appropriate for the client to request any from GPUGrid - as intended. ID: 106375 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106376 - Posted: 10 Dec 2021, 15:19:23 UTC Well, I've got a routine log for one of the scheduled GPUGrid updates: 10/12/2021 14:59:17 \| GPUGRID \| [sched_op] sched RPC pending: Requested by project 10/12/2021 14:59:17 \| GPUGRID \| piggyback_work_request() 10/12/2021 14:59:17 \| \| [rr_sim] doing sim: work fetch 10/12/2021 14:59:17 \| \| [rr_sim] start: work_buf min 21600 additional 4320 total 25920 on_frac 1.000 active_frac 1.000 10/12/2021 14:59:17 \| GPUGRID \| [rr_sim] 82570.33: e9s627_e1s741p0f526-ADRIA_BanditGPCR_APJ_b0-0-1-RND6065_3 finishes (1.00 CPU + 1.00 NVIDIA GPU) (3715287.81G/47.27G) 10/12/2021 14:59:17 \| \| [rr_sim] end 10/12/2021 14:59:17 \| \| [work_fetch] ------- start work fetch state ------- 10/12/2021 14:59:17 \| \| [work_fetch] target work buffer: 21600.00 + 4320.00 sec 10/12/2021 14:59:17 \| \| [work_fetch] --- project states --- 10/12/2021 14:59:17 \| GPUGRID \| [work_fetch] REC 391197.604 prio -1.010 can request work 10/12/2021 14:59:17 \| \| [work_fetch] --- state for CPU --- 10/12/2021 14:59:17 \| \| [work_fetch] shortfall 0.00 nidle 0.00 saturated 26946.30 busy 0.00 10/12/2021 14:59:17 \| GPUGRID \| [work_fetch] share 0.000 blocked by project preferences 10/12/2021 14:59:17 \| \| [work_fetch] --- state for NVIDIA GPU --- 10/12/2021 14:59:17 \| \| [work_fetch] shortfall 15302.30 nidle 0.00 saturated 10522.10 busy 0.00 10/12/2021 14:59:17 \| GPUGRID \| [work_fetch] share 0.000 job cache full 10/12/2021 14:59:17 \| \| [work_fetch] --- state for Intel GPU --- 10/12/2021 14:59:17 \| \| [work_fetch] shortfall 0.00 nidle 0.00 saturated 29576.94 busy 0.00 10/12/2021 14:59:17 \| GPUGRID \| [work_fetch] share 0.000 project is backed off (resource backoff: 116728.86, inc 86400.00) 10/12/2021 14:59:17 \| \| [work_fetch] ------- end work fetch state ------- 10/12/2021 14:59:17 \| GPUGRID \| piggyback: resource CPU 10/12/2021 14:59:17 \| GPUGRID \| piggyback: can't fetch CPU: blocked by project preferences 10/12/2021 14:59:17 \| GPUGRID \| piggyback: resource NVIDIA GPU 10/12/2021 14:59:17 \| GPUGRID \| piggyback: can't fetch NVIDIA GPU: job cache full 10/12/2021 14:59:17 \| GPUGRID \| piggyback: resource Intel GPU 10/12/2021 14:59:17 \| GPUGRID \| piggyback: don't need Intel GPU 10/12/2021 14:59:17 \| GPUGRID \| [rr_sim] piggyback: don't need work 10/12/2021 14:59:17 \| GPUGRID \| [sched_op] Starting scheduler request 10/12/2021 14:59:17 \| GPUGRID \| [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA GPU (0.00 sec, 0.00 inst) Intel GPU (0.00 sec, 0.00 inst) 10/12/2021 14:59:17 \| GPUGRID \| Sending scheduler request: Requested by project. 10/12/2021 14:59:17 \| GPUGRID \| Not requesting tasks: don't need (CPU: ; NVIDIA GPU: ; Intel GPU: job cache full) 10/12/2021 14:59:17 \| GPUGRID \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 10/12/2021 14:59:17 \| GPUGRID \| [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 10/12/2021 14:59:17 \| GPUGRID \| [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 10/12/2021 14:59:18 \| GPUGRID \| Scheduler request completed Preserving so we can see what's different if we allow Einstein to fetch as well. ID: 106376 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106377 - Posted: 10 Dec 2021, 16:13:56 UTC And today, it didn't even attempt to fetch work. Stayed at GPUGRID \| [work_fetch] share 0.000 job cache full throughout the Einstein refill. As it should. No configuration changes, apart from the log flag selection. Maybe it just doesn't like Thursdays? ID: 106377 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 106381 - Posted: 10 Dec 2021, 19:33:10 UTC Last modified: 10 Dec 2021, 19:37:44 UTC The option <fetch_on_update>0</fetch_on_update> is not working like I expected. I added it to cc_config.xml "options" <cc_config> <options> <use_all_gpus>1</use_all_gpus> <allow_remote_gui_rpc>1</allow_remote_gui_rpc> <fetch_on_update>0</fetch_on_update> </options> </cc_config> and restarted the client, waited a while, then requested an update and got over 100 tasks hp3400 68 Milkyway@Home 12/10/2021 1:20:31 PM update requested by user 69 Milkyway@Home 12/10/2021 1:20:34 PM Sending scheduler request: Requested by user. 70 Milkyway@Home 12/10/2021 1:20:34 PM Requesting new tasks for AMD/ATI GPU 71 Milkyway@Home 12/10/2021 1:20:36 PM Scheduler request completed: got 119 new tasks However, the project Milkyway has a known problem: It does not download new work units until 91 seconds after all existing work units have finished so getting 100+ tasks was doubly unexpected! ID: 106381 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 106382 - Posted: 10 Dec 2021, 19:41:31 UTC - in response to Message 106381. The option <fetch_on_update>0</fetch_on_update> is not working like I expected. I added it to cc_config.xml "options" I think it works the way the developers intended: <fetch_on_update>0\|1</fetch_on_update> When updating a project, request work even if not highest priority project. Setting it to 1 adds extra fetching, but 0 doesn't block normal fetches. That quote comes from the User Manual. ID: 106382 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 106383 - Posted: 10 Dec 2021, 19:53:59 UTC - in response to Message 106382. Last modified: 10 Dec 2021, 20:04:11 UTC The option <fetch_on_update>0</fetch_on_update> is not working like I expected. I added it to cc_config.xml "options" I think it works the way the developers intended: <fetch_on_update>0\|1</fetch_on_update> When updating a project, request work even if not highest priority project. Setting it to 1 adds extra fetching, but 0 doesn't block normal fetches. That quote comes from the User Manual. IMHO the "Extra Fetch" was clearly added as shown quote "Sending scheduler request: Requested by user" I set the option to >1< and restarted the client and did an update after a few minutes and got essentially the same thing hp3400 57 Milkyway@Home 12/10/2021 1:48:11 PM update requested by user 58 Milkyway@Home 12/10/2021 1:48:15 PM Sending scheduler request: Requested by user. 59 Milkyway@Home 12/10/2021 1:48:15 PM Requesting new tasks for AMD/ATI GPU 60 Milkyway@Home 12/10/2021 1:48:33 PM Scheduler request completed: got 0 new tasks 61 Milkyway@Home 12/10/2021 1:48:33 PM Not sending work - last request too recent: 35 sec 62 Milkyway@Home 12/10/2021 1:48:33 PM Project requested delay of 91 seconds Unless I am missing something, there is no difference on either update I requested other than I did get additional tasks with the >0< so with or w/o work is always requested. [edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while hp3400 57 Milkyway@Home 12/10/2021 1:48:11 PM update requested by user 58 Milkyway@Home 12/10/2021 1:48:15 PM Sending scheduler request: Requested by user. 59 Milkyway@Home 12/10/2021 1:48:15 PM Requesting new tasks for AMD/ATI GPU 60 Milkyway@Home 12/10/2021 1:48:33 PM Scheduler request completed: got 0 new tasks 61 Milkyway@Home 12/10/2021 1:48:33 PM Not sending work - last request too recent: 35 sec 62 Milkyway@Home 12/10/2021 1:48:33 PM Project requested delay of 91 seconds 63 Milkyway@Home 12/10/2021 1:50:04 PM Sending scheduler request: To fetch work. 64 Milkyway@Home 12/10/2021 1:50:04 PM Requesting new tasks for AMD/ATI GPU 65 Milkyway@Home 12/10/2021 1:50:07 PM Scheduler request completed: got 36 new tasks 66 Milkyway@Home 12/10/2021 1:50:07 PM Project requested delay of 91 seconds ID: 106383 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 879	Message 106385 - Posted: 10 Dec 2021, 21:44:05 UTC - in response to Message 106383. [edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while Wow!! Could it be as simple as that? What I would like to see is a reported task and requested work during the same scheduler connection being filled. ID: 106385 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 106386 - Posted: 11 Dec 2021, 1:56:39 UTC - in response to Message 106385. [edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while Wow!! Could it be as simple as that? What I would like to see is a reported task and requested work during the same scheduler connection being filled. Sorry, just got around to reading this. No, that option did not cause new work units to be downloaded after a "finished" upload. The work count starts at 300 for a single board and slowly drops to 0 and then there is that 91 second + up to 5 minute wai and occasionally even longer idle. I think what happened was I requested an update and it just so happened that 91 seconds had elapsed since the last request so I actually got serviced. On my "racks" with multiple GPUs an MW work unit finishes on the average of every 15 seconds so the 91 second requirement never happens. This test system had 1 board and all 4 tasks finish about exactly the same time and 2.5 minutes apart so there is a good chance the 91 seconds have elapsed. The net effect is I still have to use my boinc client "mod": to avoid the long idle time. ID: 106386 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 879	Message 106387 - Posted: 11 Dec 2021, 2:01:44 UTC OK, sorry to hear a miracle "fix" hadn't occurred. Yes, either your modified BOINC client, the GPUUG client or the PowerShell script is still needed to get around the flaw in the Milkyway scheduler. ID: 106387 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.