Posts by HeatForScience

1) Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores (Message 102760)
Posted 31 Jan 2021 by HeatForScience
Post:
" I assume by setting one or other of the possible <max_concurrent> flags in one or more app_config.xml files"

Yes, as well as "No New Tasks" and suspending projects. All of this in the context of not understanding how the scheduler would react.

"The effect of the bug makes your computer ask, repeatedly, for more and more work. If you use app_config files, and especially if you use max_concurrent, keep a close eye on the Event Log and be prepared to step in if you see repeated requests for the same amount of work over a short period of time"

Intuitively, I'm relatively sure this is what happened to me. I recall thinking "WTF!? You can't even finish the jobs you already have..." a couple of times - That's when I started suspending projects and such. For now I'm trying to back off fiddling with configuration and get to a place where the scheduler mostly gets enough work to keep all of my GPUs screaming along. (See above response)

"You can set those either on your account page at a project web site, or directly through BOINC Manager on your computer. You can't use both techniques at the same time."

Ah, good to know. I'll look into this further. There was additional work being done on limiting network traffic, but I can scale that back.

RE: Rosetta & Cache - Yes, that makes sense. I was working on limiting downloads to the evenings and raised cache limits well above the numbers you mentioned but have already scaled back and will do more.

Thanks again for the details.
2) Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores (Message 102759)
Posted 31 Jan 2021 by HeatForScience
Post:
"If you run more than 1 GPU in your system, and run it from a dual or quad core CPU, you could easily run into issues like that."

I specifically purchased a 32 core (16 w/hyperthreading on) workstation that can handle 4GPUs for this, plus a bit of gaming, so ... yes, I'll definitely run into this - and other issues. My goal was originally to utilize some of the "waste" electricity that I heat my condo with to do science. So now, the "waste" is Scientific Computing.

"If you have more available CPU threads (eg:6 or more), but only 1 or 2 GPUs, chances are you'll easily make the deadline"

Given my purposes, I'm turning the usual choices for system components upside down. I want to minimize cost of components (except where quality maters for safety - like a 1600W Platinum power supply) and *maximize* heat output. So far, the closest thing to a perfect component would be a power hungry older GPU that still does a lot of work. My two favorites are a GTX 970 and a Tesla K20c I just picked up this week. (It's been gobbling up MilkyWay @ Home WUs in about 1/4 to 1/3 the time of my 1070ti which is used for gaming.

TL;DR: I want the GPUs running full bore all the time, though I do use overclocking software (MSI Afterburner right now) to dial things back a bit to temper stress on components.

"This is especially the case on CPUs with IGP, where the IGP heats up the CPU to temperatures where the driver needs to lower boost frequencies, to stay within a thermal limitation, or a pre-set power envelope"

I believe the cooling in my systems appears to keep these issues in check, though there's more to be done in that area. I've been studying up on water cooling but that's probably a rabbit hole I won't go down. I'm fine dialing back the CPUs if needed and keep an eye on it.

Hey... Everyone needs a hobby, right?

Thank you for the details. This helps a lot.
3) Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores (Message 102743)
Posted 29 Jan 2021 by HeatForScience
Post:
This fills in the picture a bit, though something very odd is going on. I aborted over 900 Rosetta tasks last night that couldn't possibly have ever been completed on time. I don't have a clue how those could have been scheduled. I also aborted around 100 World Community Grid tasks and still have some that may not be started on time.

In trying to keep the GPUs working by limiting the number of concurrent jobs, I have noticed scenarios where I will have available cores and GPUs with GPU tasks that could be assigned, but aren't.

For now, I'm going to let things settle down and see if I can get back to the original scheduling issue that started me on this journey.

One followup question:

"Reducing your cache size to well below the Rosetta deadline may well help."

I'm unsure what you mean by "reducing cache size". The best I can manage is to set a project's "Resource Share" and the number of days of work (at least and additional under preferences). Is that what you are referring to or is there a configuration flag for the number of jobs? That would be quite helpful for Rosetta as it has always been frustrating due to the very short deadlines relative to other projects.
4) Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores (Message 102726)
Posted 28 Jan 2021 by HeatForScience
Post:
I have observed intermittent scheduling issues that result in skipped jobs due to over-commitment of CPU cores. This occasionally means a GPU task is skipped. One example log entry:

1/27/2021 4:44:53 PM | collatz | [cpu_sched_debug] skipping GPU job collatz_sieve_4404021a-de64-41b2-bbb3-0b8228e66814_0; CPU committed

I have run into this on the 64 bit 7.16.11 Windows and 7.16.14 Mac OS versions of the BOINC Manager. I've been able to partially work around the issue by limiting the number of cores for each project inside app_config.xml.

An example scenario I created shows the skipping of jobs, though on my machine a GPU job is skipped rather than a CPU job: https://boinc.berkeley.edu/sim_web.php?action=show_scenario&name=188

I have 3 different systems where I see this behavior from time to time and can create more scenarios if helpful.




Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.