Please make keeping GPUs fully occupied at all times a priority for the task scheduler

Author	Message
Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 85800 - Posted: 9 Apr 2018, 16:46:05 UTC This is somewhat a continuation of https://boinc.berkeley.edu/dev/forum_thread.php?id=10746. The behavior has not changed since then. At the time I was puzzled at the behavior. Not so much, anymore. Seems this is completely due to an apparent preference of the task scheduler for keeping the CPUs/threads busy over keeping possibly multiple GPUs busy that are installed in the host. For example, I have this host, among a dozen other mostly multiple-GPU hosts, that has a 16-core AMD ThreadRipper CPU plus four amped up 1080Ti cards (http://www.primegrid.com/show_host_detail.php?hostid=927928). And, very very VERY frequently, I am seeing some of the GPUs staying at idle, even sometimes only with 1 of 4 being utilized by a task. Looking at it, it is NOT because there are no GPU tasks. On the contrary, there are hundreds of GPU tasks waiting, from PrimeGrid, Milkyway, SETI@Home, etc. I am one of those who attach to almost all projects (~45 currently) on all my hosts, so there are always plenty of CPU and GPU tasks to pick from. However, the task scheduling seems to first assign tasks to the CPUs, to the 32 threads, and if there are some tasks that happen to use the GPU, great, as long as there is CPU reserves left to satisfy the CPU needs of that GPU task. Which means, say, if roughly 30 CPU threads are already assigned to some CPU tasks across the many projects because it is those projects' turn to run, that means that only two GPU tasks, each requesting 1 CPU, can get scheduled, leaving two GPUs completely unused until the scheduler looks at stuff again, and actually decides to pick up a GPU task instead of yet another CPU task. Given that nowadays the latest GPUs are going at close to $1000, and are much much much faster at solving certain problems if the project developers chose to write their app for the GPU, compared to $850 for the latest non-server CPU with a $850/32=$26/thread cost, it is definitely a huge waste to keep the GPUs unoccupied to keep the CPUs fully utilized. I am pretty sure this was probably covered or thought about before (a few quick searches on the message board didn't yield anything obvious), but now might really be the time to pay attention to this need: GPUs should stay fully utilized at all times for most return on the $ to the projects, and only then the remaining CPU resources should be distributed among the CPU-only tasks per whatever priority they are getting assigned by now. The answer is NOT and CAN'T be to carefully select a mix of projects, adjust their priorities, etc. Clearly, the field either is moving or already has moved from CPU to GPU computation, and the scheduler should acknowledge that. Thanks for listening! Tuna ID: 85800 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1283	Message 85801 - Posted: 9 Apr 2018, 17:22:06 UTC Many project GPU applications require some support from the CPU and "starving" the GPU of that support will result in its utilisation dropping quite substantially - for example the SETI@Home "SoG" application requires almost one complete CPU thread available all the time to keep the GPU more or less fully occupied. Try releasing one CPU thread for each GPU task you are running. ID: 85801 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 85802 - Posted: 9 Apr 2018, 17:40:09 UTC - in response to Message 85801. Last modified: 9 Apr 2018, 17:48:24 UTC EXACTLY my point. I, the human, shouldn't need to do anything. When you have so many projects running on your host, you cannot micro-manage your CPUs, GPUs, % usages, available numbers of each, etc. That is why there is a "task scheduler". Humans shouldn't try to beat it into submission by manual workarounds to not waste mucho $s invested. All "GPU tasks" already indicate how many CPUs and GPUs they require. So, if there are 4 GPUs in the host, and there are 32 CPU threads available when nothing is running, and there are hundreds of GPU tasks in the queue, then the scheduler should grab the top GPU tasks from the queue per whatever other prioritization algorithm is in action right now to select the next project(s) to run tasks for, until all GPUs are full, then add up the declared CPU requirements of those tasks, subtract it from 32, and then schedule CPU-only tasks for the remaining CPUs/threads from the task queue per the same prioritization algorithm that is in action right now. Expecting humans to do things to force an approximation to this behavior, and probably still fail at it, is futile. Tuna ID: 85802 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 85803 - Posted: 9 Apr 2018, 18:01:40 UTC - in response to Message 85802. https://boinc.berkeley.edu/ says Use the idle time on your computer (Windows, Mac, Linux, or Android) to cure diseases, study global warming, discover pulsars, and do many other types of scientific research. This is what the normal user will find his system doing, with the GPU even only being used when his system is idle. ...if there are 4 GPUs in the host... That's quite an extreme setup that the normal user won't come across easily. So then when you do, you're an advanced user, and the advanced user will have to set up his system correctly to make it do what he wants from it. That means add correctly set up app_config.xml files per application, per project, possibly app_info.xml files as well, and make sure the resource share between projects is as it should be. The work scheduler has to cater for a lot of systems out there, and it'll never be good enough for everyone. Making changes as you propose will throw the scheduler in disarray for a whole lot of people, who are now used to BOINC doing what it is doing now. Aside from that, BOINC hasn't got the man power to program something that big and intricate, not with just three volunteer coders available. The one coder who knows all of the inside of the scheduler is too busy with his own project at the moment. So if you want this done, the best, easiest and cheapest way is to do it yourself, or if you cannot program, find a programmer willing to do so. Who will then have to add it through a pull request in BOINC Github Issues, after which it is up to the PMC to decide upon if they want this big a change to BOINC, and after that it can be added, tested, debugged, retested, debugged and figured if it's wanted or not. But then you're 6 to 12 months down the pipe, if not longer. ID: 85803 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.