WCG: new systems download 100s of CPU work units, not possible to work all

Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106361 - Posted: 9 Dec 2021, 17:14:00 UTC
Last modified: 9 Dec 2021, 17:17:06 UTC

IMHO That problem with gpugrid is going to be hard to debug. I would not expect a gpu tasks to be swapped out for another from the same project.

Thinking about that reminds me of a problem that showed up over at Milkyway earlier that I tried to help with.
an n-body (cpu needs 4 threads) was totally idle wile four cpu tasks were running (system had only 4 cores).
My guess was the nbody was swapped out but would never got a time slice again because of all the smaller cpu tasks that finish at different times. All tasks were MW.
I suggest to run either one or the other but not both from the same project.

In other news I was able to verify that a new install of BOINC needed "WUprop" so that adding Einstein or WCG would not .cause 100s of downloads

Einstein is my fallback project with share = 0 and Milkway is my %100 as I can run 4 concurrent tasks.

I tried running two Einstein concurrently. Saw a tiny improvement but not enough to justify having to use a bigger fan to cool my rack of GPS.

I recently joined that supersecret GPU club and have some ideas to work on. One is to try to arrange my "boinc mod" so that if gpugrid gets suspended the GPUs get assigned to the same slot they were using.
When running my rack of three gpugrid tasks: p102-100, gtrx1070 and gtx1660ti all three can die when resumed from suspension as the CUDA compiler does not know the meta data is different and tries to pick up where it left off which causes a failure. The alternative is to run 3 instances of BOINC but that is a PITA.
ID: 106361 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106362 - Posted: 9 Dec 2021, 18:16:51 UTC - in response to Message 106361.  

I don't think much of that will come into play. That machine has a TSI (Task Switch Interval - 'Switch between tasks every ...') of 3,000 minutes - over 2 days. Tasks could still be switched if any of them were reaching a deadline, but my shortest deadline is 1.5 days for WCG resends, and my cache request is 0.25 + 0.05 days - about 7 hours. Nothing should hit any of those triggers in normal running.

My biggest risk is fractional GPU running. As the screenshot shows, Einstein is set to use 0.5 GPUs, and so is WCG. GPUGrid is allowed to wallow in a whole GPU to itself, so won't start automatically when there's only half a GPU free. That requires a little gentle nudging (one GPUGrid task will follow another, if only the project would keep up a regular supply).

My big worry is simply the work fetch algorithm. Something has unleashed work fetch for GPUGrid when it shouldn't have, and I didn't have enough active in the Event Log to show what it was. I'll turn on some extra flags which I reach the equivalent stage tomorrow, and try again.
ID: 106362 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 106364 - Posted: 9 Dec 2021, 19:25:44 UTC

I agree GPUGrid shouldn't have fetched another task while 1 was already running. But work_fetch.cpp tied into rr_simulation is such a kludge now, I can assume things will fall through the abundant cracks in its logic.
ID: 106364 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106366 - Posted: 9 Dec 2021, 19:41:43 UTC - in response to Message 106364.  

... work_fetch.cpp tied into rr_simulation is such a kludge now ...
Yup, but logic is still logic, even if it doesn't do what David thinks it does.

I have a memory gnawing away at the back of my mind. Sometime fairly recently - say last six months, maybe even more recent - I think I saw an issue or PR on GitHub to the effect of 'always request work when contacting a project' - thus over-riding the work fetch priority values, or so it seemed. Ring any bells with anyone here? The Github search tools aren't good enough to find it, and I don't remember the exact wording. It may even have been somewhere else, like these boards, and a request rather than an actual change.

I'll keep poking, but any assistance would be welcome.
ID: 106366 · Report as offensive
Harri Liljeroos

Send message
Joined: 25 Jul 18
Posts: 62
Finland
Message 106368 - Posted: 9 Dec 2021, 22:08:16 UTC
Last modified: 9 Dec 2021, 22:09:49 UTC

There are options in cc_config <fetch_on_update>0</fetch_on_update> and <fetch_minimal_work>0</fetch_minimal_work>.
ID: 106368 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106375 - Posted: 10 Dec 2021, 14:24:10 UTC - in response to Message 106368.  

Thanks for the reminder. We had a discussion about that in the projects area in October (that might be what I was remembering, though it doesn't quite match). In that discussion, I suggested I might give it a try, but I've looked - I confirm that option is not active on the machine in question.

Also, in the event log extracts I showed you yesterday, there was a work request to GPUGrid an hour earlier, and NVidia work was not requested - NVidia work was only requested when I was trying to fill the Einstein cache.

Here are the events on either side of that earlier work request:
09-Dec-2021 13:25:24 [World Community Grid] Sending scheduler request: To fetch work.
09-Dec-2021 13:25:24 [World Community Grid] Requesting new tasks for NVIDIA GPU and Intel GPU
09-Dec-2021 13:25:24 [World Community Grid] [sched_op] NVIDIA GPU work request: 21042.74 seconds; 2.00 devices
09-Dec-2021 13:25:24 [World Community Grid] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices

09-Dec-2021 13:26:30 [GPUGRID] Sending scheduler request: Requested by project.
09-Dec-2021 13:26:30 [GPUGRID] Requesting new tasks for Intel GPU
09-Dec-2021 13:26:30 [GPUGRID] [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices
09-Dec-2021 13:26:30 [GPUGRID] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices
09-Dec-2021 13:26:31 [GPUGRID] Scheduler request completed: got 0 new tasks

09-Dec-2021 13:30:21 [World Community Grid] Sending scheduler request: To fetch work.
09-Dec-2021 13:30:21 [World Community Grid] Requesting new tasks for NVIDIA GPU and Intel GPU
09-Dec-2021 13:30:21 [World Community Grid] [sched_op] NVIDIA GPU work request: 21420.77 seconds; 2.00 devices
09-Dec-2021 13:30:21 [World Community Grid] [sched_op] Intel GPU work request: 25920.00 seconds; 1.00 devices
So the overall cache was definitely low, but the running GPU task and the exclusion of the second GPU meant that it wasn't appropriate for the client to request any from GPUGrid - as intended.
ID: 106375 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106376 - Posted: 10 Dec 2021, 15:19:23 UTC

Well, I've got a routine log for one of the scheduled GPUGrid updates:

10/12/2021 14:59:17 | GPUGRID | [sched_op] sched RPC pending: Requested by project
10/12/2021 14:59:17 | GPUGRID | piggyback_work_request()
10/12/2021 14:59:17 |  | [rr_sim] doing sim: work fetch
10/12/2021 14:59:17 |  | [rr_sim] start: work_buf min 21600 additional 4320 total 25920 on_frac 1.000 active_frac 1.000
10/12/2021 14:59:17 | GPUGRID | [rr_sim] 82570.33: e9s627_e1s741p0f526-ADRIA_BanditGPCR_APJ_b0-0-1-RND6065_3 finishes (1.00 CPU + 1.00 NVIDIA GPU) (3715287.81G/47.27G)
10/12/2021 14:59:17 |  | [rr_sim] end
10/12/2021 14:59:17 |  | [work_fetch] ------- start work fetch state -------
10/12/2021 14:59:17 |  | [work_fetch] target work buffer: 21600.00 + 4320.00 sec
10/12/2021 14:59:17 |  | [work_fetch] --- project states ---
10/12/2021 14:59:17 | GPUGRID | [work_fetch] REC 391197.604 prio -1.010 can request work
10/12/2021 14:59:17 |  | [work_fetch] --- state for CPU ---
10/12/2021 14:59:17 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 26946.30 busy 0.00
10/12/2021 14:59:17 | GPUGRID | [work_fetch] share 0.000 blocked by project preferences
10/12/2021 14:59:17 |  | [work_fetch] --- state for NVIDIA GPU ---
10/12/2021 14:59:17 |  | [work_fetch] shortfall 15302.30 nidle 0.00 saturated 10522.10 busy 0.00
10/12/2021 14:59:17 | GPUGRID | [work_fetch] share 0.000 job cache full
10/12/2021 14:59:17 |  | [work_fetch] --- state for Intel GPU ---
10/12/2021 14:59:17 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 29576.94 busy 0.00
10/12/2021 14:59:17 | GPUGRID | [work_fetch] share 0.000 project is backed off  (resource backoff: 116728.86, inc 86400.00)
10/12/2021 14:59:17 |  | [work_fetch] ------- end work fetch state -------
10/12/2021 14:59:17 | GPUGRID | piggyback: resource CPU
10/12/2021 14:59:17 | GPUGRID | piggyback: can't fetch CPU: blocked by project preferences
10/12/2021 14:59:17 | GPUGRID | piggyback: resource NVIDIA GPU
10/12/2021 14:59:17 | GPUGRID | piggyback: can't fetch NVIDIA GPU: job cache full
10/12/2021 14:59:17 | GPUGRID | piggyback: resource Intel GPU
10/12/2021 14:59:17 | GPUGRID | piggyback: don't need Intel GPU
10/12/2021 14:59:17 | GPUGRID | [rr_sim] piggyback: don't need work
10/12/2021 14:59:17 | GPUGRID | [sched_op] Starting scheduler request
10/12/2021 14:59:17 | GPUGRID | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA GPU (0.00 sec, 0.00 inst) Intel GPU (0.00 sec, 0.00 inst)
10/12/2021 14:59:17 | GPUGRID | Sending scheduler request: Requested by project.
10/12/2021 14:59:17 | GPUGRID | Not requesting tasks: don't need (CPU: ; NVIDIA GPU: ; Intel GPU: job cache full)
10/12/2021 14:59:17 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
10/12/2021 14:59:17 | GPUGRID | [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices
10/12/2021 14:59:17 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
10/12/2021 14:59:18 | GPUGRID | Scheduler request completed
Preserving so we can see what's different if we allow Einstein to fetch as well.
ID: 106376 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106377 - Posted: 10 Dec 2021, 16:13:56 UTC

And today, it didn't even attempt to fetch work. Stayed at

GPUGRID | [work_fetch] share 0.000 job cache full
throughout the Einstein refill. As it should. No configuration changes, apart from the log flag selection. Maybe it just doesn't like Thursdays?
ID: 106377 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106381 - Posted: 10 Dec 2021, 19:33:10 UTC
Last modified: 10 Dec 2021, 19:37:44 UTC

The option
<fetch_on_update>0</fetch_on_update>


is not working like I expected. I added it to cc_config.xml "options"

<cc_config>
    <options>
        <use_all_gpus>1</use_all_gpus>
      <allow_remote_gui_rpc>1</allow_remote_gui_rpc>
      <fetch_on_update>0</fetch_on_update>
    </options>
</cc_config>


and restarted the client, waited a while, then requested an update and got over 100 tasks

hp3400

68	Milkyway@Home	12/10/2021 1:20:31 PM	update requested by user	
69	Milkyway@Home	12/10/2021 1:20:34 PM	Sending scheduler request: Requested by user.	
70	Milkyway@Home	12/10/2021 1:20:34 PM	Requesting new tasks for AMD/ATI GPU	
71	Milkyway@Home	12/10/2021 1:20:36 PM	Scheduler request completed: got 119 new tasks	


However, the project Milkyway has a known problem: It does not download new work units until 91 seconds after all existing work units have finished so getting 100+ tasks was doubly unexpected!
ID: 106381 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 106382 - Posted: 10 Dec 2021, 19:41:31 UTC - in response to Message 106381.  

The option
<fetch_on_update>0</fetch_on_update>

is not working like I expected. I added it to cc_config.xml "options"
I think it works the way the developers intended:

<fetch_on_update>0|1</fetch_on_update>
When updating a project, request work even if not highest priority project.
Setting it to 1 adds extra fetching, but 0 doesn't block normal fetches. That quote comes from the User Manual.
ID: 106382 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106383 - Posted: 10 Dec 2021, 19:53:59 UTC - in response to Message 106382.  
Last modified: 10 Dec 2021, 20:04:11 UTC

The option
<fetch_on_update>0</fetch_on_update>

is not working like I expected. I added it to cc_config.xml "options"
I think it works the way the developers intended:

<fetch_on_update>0|1</fetch_on_update>
When updating a project, request work even if not highest priority project.
Setting it to 1 adds extra fetching, but 0 doesn't block normal fetches. That quote comes from the User Manual.


IMHO the "Extra Fetch" was clearly added as shown quote "Sending scheduler request: Requested by user"

I set the option to >1< and restarted the client and did an update after a few minutes and got essentially the same thing
hp3400

57	Milkyway@Home	12/10/2021 1:48:11 PM	update requested by user	
58	Milkyway@Home	12/10/2021 1:48:15 PM	Sending scheduler request: Requested by user.	
59	Milkyway@Home	12/10/2021 1:48:15 PM	Requesting new tasks for AMD/ATI GPU	
60	Milkyway@Home	12/10/2021 1:48:33 PM	Scheduler request completed: got 0 new tasks	
61	Milkyway@Home	12/10/2021 1:48:33 PM	Not sending work - last request too recent: 35 sec	
62	Milkyway@Home	12/10/2021 1:48:33 PM	Project requested delay of 91 seconds	


Unless I am missing something, there is no difference on either update I requested other than I did get additional tasks with the >0<

so with or w/o work is always requested.

[edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while

hp3400

57	Milkyway@Home	12/10/2021 1:48:11 PM	update requested by user	
58	Milkyway@Home	12/10/2021 1:48:15 PM	Sending scheduler request: Requested by user.	
59	Milkyway@Home	12/10/2021 1:48:15 PM	Requesting new tasks for AMD/ATI GPU	
60	Milkyway@Home	12/10/2021 1:48:33 PM	Scheduler request completed: got 0 new tasks	
61	Milkyway@Home	12/10/2021 1:48:33 PM	Not sending work - last request too recent: 35 sec	
62	Milkyway@Home	12/10/2021 1:48:33 PM	Project requested delay of 91 seconds	
63	Milkyway@Home	12/10/2021 1:50:04 PM	Sending scheduler request: To fetch work.	
64	Milkyway@Home	12/10/2021 1:50:04 PM	Requesting new tasks for AMD/ATI GPU	
65	Milkyway@Home	12/10/2021 1:50:07 PM	Scheduler request completed: got 36 new tasks	
66	Milkyway@Home	12/10/2021 1:50:07 PM	Project requested delay of 91 seconds	
ID: 106383 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 106385 - Posted: 10 Dec 2021, 21:44:05 UTC - in response to Message 106383.  

[edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while

Wow!! Could it be as simple as that? What I would like to see is a reported task and requested work during the same scheduler connection being filled.
ID: 106385 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106386 - Posted: 11 Dec 2021, 1:56:39 UTC - in response to Message 106385.  

[edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while

Wow!! Could it be as simple as that? What I would like to see is a reported task and requested work during the same scheduler connection being filled.


Sorry, just got around to reading this.

No, that option did not cause new work units to be downloaded after a "finished" upload.
The work count starts at 300 for a single board and slowly drops to 0 and then there is that 91 second + up to 5 minute wai and occasionally even longer idle.

I think what happened was I requested an update and it just so happened that 91 seconds had elapsed since the last request so I actually got serviced.

On my "racks" with multiple GPUs an MW work unit finishes on the average of every 15 seconds so the 91 second requirement never happens. This test system had 1 board and all 4 tasks finish about exactly the same time and 2.5 minutes apart so there is a good chance the 91 seconds have elapsed. The net effect is I still have to use my boinc client "mod": to avoid the long idle time.
ID: 106386 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 106387 - Posted: 11 Dec 2021, 2:01:44 UTC

OK, sorry to hear a miracle "fix" hadn't occurred. Yes, either your modified BOINC client, the GPUUG client or the PowerShell script is still needed to get around the flaw in the Milkyway scheduler.
ID: 106387 · Report as offensive
Previous · 1 · 2

Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.