Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Jan 21 Posts: 5 |
I have observed intermittent scheduling issues that result in skipped jobs due to over-commitment of CPU cores. This occasionally means a GPU task is skipped. One example log entry: 1/27/2021 4:44:53 PM | collatz | [cpu_sched_debug] skipping GPU job collatz_sieve_4404021a-de64-41b2-bbb3-0b8228e66814_0; CPU committed I have run into this on the 64 bit 7.16.11 Windows and 7.16.14 Mac OS versions of the BOINC Manager. I've been able to partially work around the issue by limiting the number of cores for each project inside app_config.xml. An example scenario I created shows the skipping of jobs, though on my machine a GPU job is skipped rather than a CPU job: https://boinc.berkeley.edu/sim_web.php?action=show_scenario&name=188 I have 3 different systems where I see this behavior from time to time and can create more scenarios if helpful. |
Send message Joined: 5 Oct 06 Posts: 5129 |
Yes, that's a thing: client/cpu_sched.cpp, line 1250. It's designed to happen, but only when CPU tasks have been given unusual precedence. We need to work out why that's happened. There are two basic cases when CPU tasks are promoted before GPU tasks. 1) When a CPU task needs to run in 'high priority' or 'Earliest Deadline First' (EDF) mode, to avoid missing its deadline. 2) When a CPU task is running 'multi-threaded' (MT). GPU tasks can still run, but only to the extent that all GPU tasks between them are assessed to need less than one complete CPU between them for support purposes. A GPU task which takes the total CPU requirement beyond 1 would be blocked. Would either of those cases apply here? In your simulation, six Rosetta tasks are running in EDF, and the first CPU sched debug ends all CPUs used (12.84 >= 12), skipping OPN1_0032301_05289_0But I've sometimes found that the simulator doesn't exactly reproduce the decisions of the real-world client. The Einstein NVidia task is allocated a whole CPU, and the Collatz intel_gpu task is allocated 0.83 CPU, so something has to be freed up to allow both to run. Rosetta tasks have short deadlines, so will need EDF more often. Reducing your cache size to well below the Rosetta deadline may well help. |
Send message Joined: 28 Jan 21 Posts: 5 |
This fills in the picture a bit, though something very odd is going on. I aborted over 900 Rosetta tasks last night that couldn't possibly have ever been completed on time. I don't have a clue how those could have been scheduled. I also aborted around 100 World Community Grid tasks and still have some that may not be started on time. In trying to keep the GPUs working by limiting the number of concurrent jobs, I have noticed scenarios where I will have available cores and GPUs with GPU tasks that could be assigned, but aren't. For now, I'm going to let things settle down and see if I can get back to the original scheduling issue that started me on this journey. One followup question: "Reducing your cache size to well below the Rosetta deadline may well help." I'm unsure what you mean by "reducing cache size". The best I can manage is to set a project's "Resource Share" and the number of days of work (at least and additional under preferences). Is that what you are referring to or is there a configuration flag for the number of jobs? That would be quite helpful for Rosetta as it has always been frustrating due to the very short deadlines relative to other projects. |
Send message Joined: 28 Jun 10 Posts: 2704 |
"Reducing your cache size to well below the Rosetta deadline may well help." If Rosetta deadline is say typically 12 hours, using manager, under computing preferences set BOINC to store at least 0.1 day's work and up to an additional 0.1 days work. If running without the manager there are commands to do this using boinccmd. |
Send message Joined: 5 Oct 06 Posts: 5129 |
OK, two issues arise from that. 1) You mention "limiting the number of concurrent jobs" - I assume by setting one or other of the possible <max_concurrent> flags in one or more app_config.xml files. I have reported a possible bug in the boinc client - the component that lives on, and controls the work done on, your computer. The effect of the bug makes your computer ask, repeatedly, for more and more work. If you use app_config files, and especially if you use max_concurrent, keep a close eye on the Event Log and be prepared to step in if you see repeated requests for the same amount of work over a short period of time. By 'short', I mean really short: I think Rosetta would request work every twelve seconds. 2) "Cache size" - on these, and other BOINC message boards - is a shorthand for the configuration parameters Store at least 0.01 days of work Store up to an additional --- days of workYou can set those either on your account page at a project web site, or directly through BOINC Manager on your computer. You can't use both techniques at the same time. If you use both, the ones on your machine will take precedence, and the ones from a project website will be ignored. IIRC, Rosetta has a deadline of 3 days from the time new tasks are downloaded. Your cache (sum of 'at least' and 'additional') should always be less than half that, and while you're working out what's going wrong, keep it even smaller. I personally use 0.25 + 0.05 days: that means 'report the work you've completed, and get a new batch, about once an hour', and the whole lot should turn over in about 6 - 8 hours. Adjust those figures if you have internet constraints, if you don't let the computer run 24 hours a day, or suchlike. |
Send message Joined: 8 Nov 19 Posts: 718 |
If you run more than 1 GPU in your system, and run it from a dual or quad core CPU, you could easily run into issues like that. If you have more available CPU threads (eg:6 or more), but only 1 or 2 GPUs, chances are you'll easily make the deadline, even if some of the GPUs are using more CPU processing. Not so, if you limit the CPU to only 2 cores (with 2 GPUs to be fed), as in this case, 1 or 2 CPU threads can share the workload with GPUs simultaneously, but at a reduced speed. Boinc tests your CPU flops rating in the beginning, and downloads tasks it think you'll finish if you were crunching ~66-75% of the time with that setting. This doesn't take into consideration the extra CPU processing a GPU needs to be fed, as well as that it doesn't take into consideration a reduction of Turbo boost frequencies due to overcommitting and higher temperatures. This is especially the case on CPUs with IGP, where the IGP heats up the CPU to temperatures where the driver needs to lower boost frequencies, to stay within a thermal limitation, or a pre-set power envelope; essentially dropping the CPU's actual processing power compared to the flops rating the benchmark did. You can also run into issues, if you're playing with the CPU utilization by switching the CPU utilization back and forth to 100%. Setting it to 100% can cause the scheduler to load more tasks; and lowering it will cause your system not to be able to finish all those downloaded tasks. |
Send message Joined: 28 Jan 21 Posts: 5 |
"If you run more than 1 GPU in your system, and run it from a dual or quad core CPU, you could easily run into issues like that." I specifically purchased a 32 core (16 w/hyperthreading on) workstation that can handle 4GPUs for this, plus a bit of gaming, so ... yes, I'll definitely run into this - and other issues. My goal was originally to utilize some of the "waste" electricity that I heat my condo with to do science. So now, the "waste" is Scientific Computing. "If you have more available CPU threads (eg:6 or more), but only 1 or 2 GPUs, chances are you'll easily make the deadline" Given my purposes, I'm turning the usual choices for system components upside down. I want to minimize cost of components (except where quality maters for safety - like a 1600W Platinum power supply) and *maximize* heat output. So far, the closest thing to a perfect component would be a power hungry older GPU that still does a lot of work. My two favorites are a GTX 970 and a Tesla K20c I just picked up this week. (It's been gobbling up MilkyWay @ Home WUs in about 1/4 to 1/3 the time of my 1070ti which is used for gaming. TL;DR: I want the GPUs running full bore all the time, though I do use overclocking software (MSI Afterburner right now) to dial things back a bit to temper stress on components. "This is especially the case on CPUs with IGP, where the IGP heats up the CPU to temperatures where the driver needs to lower boost frequencies, to stay within a thermal limitation, or a pre-set power envelope" I believe the cooling in my systems appears to keep these issues in check, though there's more to be done in that area. I've been studying up on water cooling but that's probably a rabbit hole I won't go down. I'm fine dialing back the CPUs if needed and keep an eye on it. Hey... Everyone needs a hobby, right? Thank you for the details. This helps a lot. |
Send message Joined: 28 Jan 21 Posts: 5 |
" I assume by setting one or other of the possible <max_concurrent> flags in one or more app_config.xml files" Yes, as well as "No New Tasks" and suspending projects. All of this in the context of not understanding how the scheduler would react. "The effect of the bug makes your computer ask, repeatedly, for more and more work. If you use app_config files, and especially if you use max_concurrent, keep a close eye on the Event Log and be prepared to step in if you see repeated requests for the same amount of work over a short period of time" Intuitively, I'm relatively sure this is what happened to me. I recall thinking "WTF!? You can't even finish the jobs you already have..." a couple of times - That's when I started suspending projects and such. For now I'm trying to back off fiddling with configuration and get to a place where the scheduler mostly gets enough work to keep all of my GPUs screaming along. (See above response) "You can set those either on your account page at a project web site, or directly through BOINC Manager on your computer. You can't use both techniques at the same time." Ah, good to know. I'll look into this further. There was additional work being done on limiting network traffic, but I can scale that back. RE: Rosetta & Cache - Yes, that makes sense. I was working on limiting downloads to the evenings and raised cache limits well above the numbers you mentioned but have already scaled back and will do more. Thanks again for the details. |
Send message Joined: 8 Nov 19 Posts: 718 |
So long your CPU is below 75C you're good. With water cooling, you could look into closed loop water cooling systems. They are cheap, and no dripping hose messes (so long the CPU temps stay below 75C). I know they say less than 60C but I've gone as high as 95C with a single fan radiator, where the air literally hurt my hand (like a hair dryer). Twin fan systems (2x 140mm) do keep 105-125W cool enough, however with PBO enabled, Ryzen 3900x or better easily hit 150W which is a bit on the hot side. The triple fan systems usually stay hand cool, with fans running very low speeds (vs at full speed on 2x140mm fan systems). |
Send message Joined: 9 Apr 06 Posts: 302 |
Trapped by the same issue at E@h project For details:https://einsteinathome.org/goto/comment/183445 https://einsteinathome.org/goto/comment/183494 Hugely overcommitted CPU cache and no GPU tasks, GPU sits idle after app_config use. Is it possible to fix this bug ? |
Send message Joined: 5 Oct 06 Posts: 5129 |
Is it possible to fix this bug ?Well, you're a programmer, so you know that it's always possible to fix a bug - but it's not always easy. There are two prerequisites: 1) That somebody carefully analyses the cause and location of the bug, so that it can be expressed in programming terms. 2) That a programmer familiar with the existing code chooses to devote enough time and effort to re-writing that code, with sufficient care to avoid introducing a new set of bugs. I've tried to address (1) by writing up https://github.com/BOINC/boinc/issues/4117, and supplying as much information (simulation run, extensive logs, etc.) as I can. The first thing to get clear in your head is that this is a client problem - your client repetitively requests new work, in spite of receiving everything it asks for (Einstein is good at that). If you haven't already, check your local Event Log (or history in stdoutdae.txt), and see the repeated requests/replies. I have to confess to a total failure to move David on to prerequisite (2). |
Send message Joined: 9 Apr 06 Posts: 302 |
Of course, it's client, not server issue in first hand. But server adds to problem too. Client constantly (really constantly before queue limit was hit, at each opportunity now) requests tasks. Both CPU and GPU. CPU part results in hugely overcommitted cache (300+ CPU tasks each ~20h for quad with cache settings 0.1/0.1 day ). And then it seems server comes into play - my host hit the limit of daily tasks (384 for very that host) and doesn't get anything new because of that, not CPU nor GPU (while GPU queue is zero, GPU stays idle). Such CPU and GPU queues interconnection is the another part of issue, separate from overload in CPU part. Edit: and the end of my prev post was rather more wish and ask for move than possibility question, of course it can be fixed somehow :)) |
Send message Joined: 5 Oct 06 Posts: 5129 |
Well, Einstein's 384 limit is implemented as a single global value. So it's a pretty blunt weapon, more of a long stop than something you should expect to ride on day by day. Ease off any app_config.xml files that contain a <max_concurrent> value, and I'm not sure how Einstein implemented the back stop: if it's saying "daily limit", you'll probably have to wait until midnight (German time) to start filling the cache with GPU tasks. "Max tasks in progress" might have been better here, but I suspect their rathed old server doesn't have that option. |
Send message Joined: 28 Jun 10 Posts: 2704 |
Unless an answer is found for the client code, the only way around it I can see is to micromanage things and turn off cpu work on the project web site(s) to request enough work to fill the coffers, then turn it back on again. Not a good way of managing things if time is short for doing so though. |
Send message Joined: 9 Apr 06 Posts: 302 |
Unless an answer is found for the client code, the only way around it I can see is to micromanage things and turn off cpu work on the project web site(s) to request enough work to fill the coffers, then turn it back on again. Not a good way of managing things if time is short for doing so though. Thanks, at least it meets my other criteria of playable solution. Not set and forget one, but... @Richard, I'm not quite understand if some more data required to document this issue or ball on bug-hunter side (DA, probably) now? Should I switch on scheduler debugging flags and post outcome? // Ease off any app_config.xml files that contain a <max_concurrent> value - then I will just produce crashes in alloc.c ... EDIT: and one single detail: TASKS FOR COMPUTER 12862097 All (678) In progress (621) Pending (19) Valid (37) Invalid (0) Error (1) So, despite of 384 limit in server reply host managed to aquire much-much more! (And this means that 384 is daily limit. And then , w/o any action I will fit system partition by E@h tasks again in few days/weeks....) EDIT2: 2/15/2021 20:08:20 PM | Einstein@Home | update requested by user 2/15/2021 20:08:21 PM | Einstein@Home | Sending scheduler request: Requested by user. 2/15/2021 20:08:21 PM | Einstein@Home | Requesting new tasks for NVIDIA GPU 2/15/2021 20:08:23 PM | Einstein@Home | Scheduler request completed: got 0 new tasks 2/15/2021 20:08:23 PM | Einstein@Home | No work sent 2/15/2021 20:08:23 PM | Einstein@Home | (reached daily quota of 384 tasks) 2/15/2021 20:08:23 PM | Einstein@Home | Project has no jobs available 2/15/2021 20:08:23 PM | Einstein@Home | Project requested delay of 25808 seconds Host is on "No CPU tasks" venue now. One more cold night and it should start to warm the room again :) |
Send message Joined: 5 Oct 06 Posts: 5129 |
It's probably easiest on Einstein if you work out a reasonable target for what you can work off before you start hitting deadlines (given your various resource constraints), and then abort the excess sooner rather than later. No shame in that, if it's going to happen anyway, and better for the project if it happens sooner, so the resends can go out quickly. If the website says you have more in progress than you can see locally, brace yourself for resent lost tasks - Einstein has that setting switched on permanently. |
Send message Joined: 9 Apr 06 Posts: 302 |
Another experiment with that host: I disabled GW CPU tasks in project settings and allowed CPU work (after cleaning CPU queue). I hoped that being disallowed to download work for app mentioned in app_config BOINC will function normally and fill CPU cache for other (binary pulsars search) tasks. Unfortunately, in this case too BOINC completely ignored allowed size of CPU queue (0.1/0.1 days) again and downloaded as many binary search tasks as server allowed (384 daily). So, now CPU cache hugely overcommitted by CPU work of another type (not mentioned in app_config at all) and GPU sits idle again cause project refuses to send more tasks of any kind. So, more complex micro-management required until this bug be fixed. |
Send message Joined: 9 Apr 06 Posts: 302 |
Very sad but I had to disallow GW CPU work on most of my hosts due to this BOINC bug. Can't micromanage all of them currently. |
Send message Joined: 9 Apr 06 Posts: 302 |
At weekend I have some time so returned one host to "generic" venue, where all work is allowed. Project share set to 100. E@h has app_config allowing only 2 instances of GW app (same config that highlighted bug with work fetch before). I enabled work_fetch_debug flag. Few communications with project already - no new CPU tasks downloaded. Looks like bug disappeared for now - but why?.... Here is last communication with server: 2/27/2021 14:36:10 PM | Einstein@Home | update requested by user 2/27/2021 14:36:10 PM | | [work_fetch] Request work fetch: project updated by user 2/27/2021 14:36:13 PM | Einstein@Home | piggyback_work_request() 2/27/2021 14:36:13 PM | | [work_fetch] ------- start work fetch state ------- 2/27/2021 14:36:13 PM | | [work_fetch] target work buffer: 8640.00 + 8640.00 sec 2/27/2021 14:36:13 PM | | [work_fetch] --- project states --- 2/27/2021 14:36:13 PM | Einstein@Home | [work_fetch] REC 23842.698 prio -2.167 can request work 2/27/2021 14:36:13 PM | Milkyway@Home | [work_fetch] REC 2006.536 prio -7.976 can request work 2/27/2021 14:36:13 PM | SETI@home Beta Test | [work_fetch] REC 0.000 prio 0.000 can't request work: suspended via Manager 2/27/2021 14:36:13 PM | | [work_fetch] --- state for CPU --- 2/27/2021 14:36:13 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 210592.77 busy 0.00 2/27/2021 14:36:13 PM | Einstein@Home | [work_fetch] share 1.000 2/27/2021 14:36:13 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences 2/27/2021 14:36:13 PM | SETI@home Beta Test | [work_fetch] share 0.000 2/27/2021 14:36:13 PM | | [work_fetch] --- state for NVIDIA GPU --- 2/27/2021 14:36:13 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 20223.68 busy 0.00 2/27/2021 14:36:13 PM | Einstein@Home | [work_fetch] share 0.990 2/27/2021 14:36:13 PM | Milkyway@Home | [work_fetch] share 0.010 2/27/2021 14:36:13 PM | SETI@home Beta Test | [work_fetch] share 0.000 2/27/2021 14:36:13 PM | | [work_fetch] ------- end work fetch state ------- 2/27/2021 14:36:13 PM | Einstein@Home | piggyback: resource CPU 2/27/2021 14:36:13 PM | Einstein@Home | piggyback: don't need CPU 2/27/2021 14:36:13 PM | Einstein@Home | piggyback: resource NVIDIA GPU 2/27/2021 14:36:13 PM | Einstein@Home | piggyback: don't need NVIDIA GPU 2/27/2021 14:36:13 PM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA GPU (0.00 sec, 0.00 inst) 2/27/2021 14:36:13 PM | Einstein@Home | Sending scheduler request: Requested by user. 2/27/2021 14:36:13 PM | Einstein@Home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full) 2/27/2021 14:36:14 PM | Einstein@Home | Scheduler request completed 2/27/2021 14:36:14 PM | Einstein@Home | Project requested delay of 60 seconds 2/27/2021 14:36:14 PM | | [work_fetch] Request work fetch: RPC complete 2/27/2021 14:36:19 PM | | choose_project(): 1614425779.941008 2/27/2021 14:36:19 PM | | [work_fetch] ------- start work fetch state ------- 2/27/2021 14:36:19 PM | | [work_fetch] target work buffer: 8640.00 + 8640.00 sec 2/27/2021 14:36:19 PM | | [work_fetch] --- project states --- 2/27/2021 14:36:19 PM | Einstein@Home | [work_fetch] REC 23842.698 prio -1.338 can't request work: scheduler RPC backoff (54.94 sec) 2/27/2021 14:36:19 PM | Milkyway@Home | [work_fetch] REC 2006.536 prio -1.135 can request work 2/27/2021 14:36:19 PM | SETI@home Beta Test | [work_fetch] REC 0.000 prio 0.000 can't request work: suspended via Manager 2/27/2021 14:36:19 PM | | [work_fetch] --- state for CPU --- 2/27/2021 14:36:19 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 210583.18 busy 0.00 2/27/2021 14:36:19 PM | Einstein@Home | [work_fetch] share 0.000 2/27/2021 14:36:19 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences 2/27/2021 14:36:19 PM | SETI@home Beta Test | [work_fetch] share 0.000 2/27/2021 14:36:19 PM | | [work_fetch] --- state for NVIDIA GPU --- 2/27/2021 14:36:19 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 20213.64 busy 0.00 2/27/2021 14:36:19 PM | Einstein@Home | [work_fetch] share 0.000 2/27/2021 14:36:19 PM | Milkyway@Home | [work_fetch] share 1.000 2/27/2021 14:36:19 PM | SETI@home Beta Test | [work_fetch] share 0.000 2/27/2021 14:36:19 PM | | [work_fetch] ------- end work fetch state ------- 2/27/2021 14:36:19 PM | SETI@home Beta Test | choose_project: scanning 2/27/2021 14:36:19 PM | SETI@home Beta Test | skip: suspended via Manager 2/27/2021 14:36:19 PM | Milkyway@Home | choose_project: scanning 2/27/2021 14:36:19 PM | Milkyway@Home | can't fetch CPU: blocked by project preferences 2/27/2021 14:36:19 PM | Milkyway@Home | can fetch NVIDIA GPU 2/27/2021 14:36:19 PM | Einstein@Home | choose_project: scanning 2/27/2021 14:36:19 PM | Einstein@Home | skip: scheduler RPC backoff 2/27/2021 14:36:19 PM | | [work_fetch] No project chosen for work fetch Few questions regarding log entries: 1) What is piggyback_work_request and all "piggyback" entries later 2) What exact meaning of REC and prio values? 3) What are exact meanings of shortfall 0.00 nidle 0.00 saturated 210592.77 busy 0.00 ? 4) choose_project(): 1614425779.941008 - what number means? EDIT: Currently I see only one difference with old situation where hst downloaded CPU tasks up to project hard limit: there is NO GW tasks in queue currently. Only pulsar search CPU tasks. And app_config mention CPU GW app, not pulsar one. So will wait until CPU queue will need more CPU tasks... |
Send message Joined: 9 Apr 06 Posts: 302 |
Gradually increased work queue up to: 2/28/2021 18:45:53 PM | | [work_fetch] target work buffer: 38880.00 + 60480.00 sec at this point BOINC client decides that it has not enough work for CPU and asked for work. It received 1 GW task (with many support files). GW task estimated as 1+ day to run so cache was filled. And indeed next work fetch calculation cycles ask no work. That is, good so far. But CPU still busy with non-GW tasks (GW are the ones that mentioned in app_config). To this moment I deciphered shortfall as number of seconds between available work on host (for particular type of device) and total cache size. If available work smaller than cache size this field starts to increase. saturated as number of seconds of available work for particular device type. busy - that field became non-zero when BOINC switched to "missing deadline" mode. All 4 CPU cores were assigned with CPU tasks while 1 should be reserved for GPU. Perhaps, number of seconds of work that miss deadline. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.