exclude_gpu not handled properly by scheduler: gpu allowed to run dry

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 84525 - Posted: 24 Jan 2018, 15:26:36 UTC Last modified: 24 Jan 2018, 15:56:58 UTC I found a second case where the boinc scheduler fails to provide work units to a secondary GPU when the project has work units available. The GPU is allowed to run empty unless the work load changes on the primary (or just "other") GPU. The first case I reported here was for 7.8.3 + vbox and has no solution other than manual intervention. That case did not involve "exclude_gpu" however. This second case is for 7.8.3 (no vbox) but the problem seems to be related to "gpu_exclude" and "gpu_usage". I read here that for gpu_usage and cpu_usage there is quote " Note: there is no provision for specifying this per GPU type or per device" Some history first: I have a GTX1070 TI that is set to run a single GPUGRID work unit on a older 4 core system. Task take about 10 hours. Frequently, GPUGRID is out of data so I enabled 3 EINSTEIN and set its resource to 0 so that when GPUGRID is out of data then Einstein takes over with <gpu_usage>.333</gpu_usage> <cpu_usage>1.0</cpu_usage> I did not use 0.25 as this system is used for other stuff and I wanted a free core. All was well and fine until I remember I had a old gtx770 that was not being used. I knew that GPUGRID could not finish by a deadline and EINSTEIN took too many cores so I attached MilkyWay since it uses about .15 core and would easily run 3 units on that 770. I used the following cc_config to route the work units to the proper GPU: <cc_config> <log_flags> </log_flags> <options> <use_all_gpus>1</use_all_gpus> <exclusive_gpu_app>DVDFab.exe</exclusive_gpu_app> <allow_remote_gui_rpc>1</allow_remote_gui_rpc> <exclude_gpu> <url>www.gpugrid.net</url> <device_num>1</device_num> </exclude_gpu> <exclude_gpu> <url>https://milkyway.cs.rpi.edu/milkyway/</url> <device_num>0</device_num> </exclude_gpu> <exclude_gpu> <url>http://einstein.phys.uwm.edu/</url> <device_num>1</device_num> </exclude_gpu> </options> </cc_config> All seemed well and good for several days. The MilkyWay queue had about 70 tasks, Einstein had about 40 or so. It even worked fine with GPUGrid when that project had work available. The Einstein stopped and were waiting to run of course, and even the the MilkyWay seemed to work perfectly UP UNTIL ITS QUEUE RAN OUT. The boinc scheduler failed to ask the milkyway project for more work! I noticed that GPUGrid was running, there were 7 Einstein "waiting to run" but there were no MilkyWay tasks running on the gtx770. On a hunch I suspended the "waiting to run" Einstein and sure enough, the boinc scheduler then asked Milkyway to provide tasks. I think this is a design problem in the scheduler. I can get around this on my end by allowing a larger queue size for milkway but I think there is a limit to how many tasks they send out. A coding hack would be is to have the scheduler check to see if THERE ARE ANY EMPTY GPUs and authorize a download for that GPU from the last project that used it. A better fix would be to allow a provision for specifying usage per GPU type which is currently not done if I understand this correctly. [EDIT] I will look at setting various resource values differently among these 3 projects to see if that helps but I suspect there is a scheduling problem. ID: 84525 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.