GPU not receiving tasks when CPU computing disabled

Author	Message
Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 104274 - Posted: 2 May 2021, 18:59:01 UTC Just saw reports that work was flowing more freely, bumped the cache, and got 14 new tasks for iGPU. So that rules out a limit, unless they've been changing settings while we experiment. I've turned off CPU tasks for that profile, so we'll see how it holds up, maybe with another cache boost later (when the rush has died back down again - downloads are busy ATM). ID: 104274 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104275 - Posted: 2 May 2021, 19:04:03 UTC - in response to Message 104268. Last modified: 2 May 2021, 19:06:12 UTC Having work constantly available enabled me to run tests and replicate the issue on command repeatedly. We have scripts for that now!. I am guessing you mean requesting new work every X seconds? Mine is a bit more complicated than what people have posted as it checks and only requests if the requested delay has been exceeded(as well as doing other things). Even with that you can still not get work for quite a few requests in a row when there is only 2000 every 30 minutes. That makes it take longer to show this issue, as not getting work can be for normal reasons. Although it does give the "Project has no tasks available" when it is because there is no tasks available and it does not give that message with this issue(as you can see from my last post). Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. ID: 104275 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104276 - Posted: 2 May 2021, 19:27:14 UTC - in response to Message 104274. Just saw reports that work was flowing more freely, bumped the cache, and got 14 new tasks for iGPU. So that rules out a limit, unless they've been changing settings while we experiment. I've turned off CPU tasks for that profile, so we'll see how it holds up, maybe with another cache boost later (when the rush has died back down again - downloads are busy ATM). Yeah, I noticed that the downloads are busy. I got task for the iGPU on every request until I hit the limit I set(50). I had set it higher to test, I am going to lower it back down because that can be over 2 days depending on task length. ID: 104276 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 104277 - Posted: 2 May 2021, 19:28:09 UTC - in response to Message 104275. Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. Since you mention SETI, I posted the instructions for offline testing in SETI message 2072928 ID: 104277 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104278 - Posted: 2 May 2021, 19:33:03 UTC By the way, the person(binii) who I was trying to help whose problem I set about replicating enabled cpu computing and started getting tasks. This was after having tried many things including running linux - all without success. Thanks man! This fixed the problem. I instantly got GPU-packages, when I enabled CPU computing on the web preferences... a bit weird I must admid. Any idea what's the logic behind this? :-) Is it safe to disable CPU computing now? Also(after my response): Makes sense. My friend called me last week and told that his laptop started using GPU after he enabled CPU computing on the web. It sounded so absurd I didnt even figured out to test that ID: 104278 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 104279 - Posted: 2 May 2021, 19:53:03 UTC - in response to Message 104278. Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? That's pretty much me for the evening - the UK has a watercooler appointment with the TV in about 10 minutes. ID: 104279 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104280 - Posted: 2 May 2021, 20:10:34 UTC - in response to Message 104277. Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. Since you mention SETI, I posted the instructions for offline testing in SETI message 2072928 Thank you! I bookmarked it. :) I have been looking through the files, especially winstringify.h. I have not done any programming with GPUs before, just a fair amount of programming in various languages. Unfortunately I do not have time to learn opencl programming right now to the level that would be necessary. ID: 104280 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104281 - Posted: 2 May 2021, 20:21:39 UTC - in response to Message 104279. Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? That's pretty much me for the evening - the UK has a watercooler appointment with the TV in about 10 minutes. I am inclined towards WCG, whether that is modifications or using an older version of the scheduler(which may or may not be modified). Yeah, same here. It is pretty late. ID: 104281 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104286 - Posted: 4 May 2021, 7:48:55 UTC - in response to Message 104281. Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? That's pretty much me for the evening - the UK has a watercooler appointment with the TV in about 10 minutes. I am inclined towards WCG, whether that is modifications or using an older version of the scheduler(which may or may not be modified). Yeah, same here. It is pretty late. I thought of and tried something else today. I also reproduced the issue of no work for the intel GPU while having CPU enabled. I did this by putting project_max_concurrent=2(I run 2 intel gpu units), and set the preferences to only allow 12% of the CPUs to be used(1 for this 4c/8t CPU). It then did not request cpu tasks due to saturation, but was requesting intel GPU tasks. It never received any. I saved the sched_ files(along with some other ones) from the last request before exiting BOINC. I then ran BOINC with setting the work_req_seconds to the highest of the req_secs. It then got intel GPU tasks on every request until it hit the limit from the WCG profile for it. So it may not actually be because of CPU being disabled, but that it does not send tasks unless work_req_seconds is > 0. ID: 104286 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 104301 - Posted: 4 May 2021, 18:35:33 UTC Last modified: 4 May 2021, 18:50:34 UTC At long last, a smoking gun. This looks very suggestive: 04/05/2021 17:17:33 \| World Community Grid \| Host location: none ... 04/05/2021 17:39:45 \| World Community Grid \| Sending scheduler request: To fetch work. 04/05/2021 17:39:45 \| World Community Grid \| Requesting new tasks for Intel GPU 04/05/2021 17:39:45 \| World Community Grid \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 04/05/2021 17:39:45 \| World Community Grid \| [sched_op] Intel GPU work request: 7918.15 seconds; 0.00 devices 04/05/2021 17:39:46 \| World Community Grid \| Scheduler request completed: got 0 new tasks ... 04/05/2021 18:24:17 \| World Community Grid \| Sending scheduler request: To fetch work. 04/05/2021 18:24:17 \| World Community Grid \| Requesting new tasks for Intel GPU 04/05/2021 18:24:17 \| World Community Grid \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 04/05/2021 18:24:17 \| World Community Grid \| [sched_op] Intel GPU work request: 11269.92 seconds; 0.00 devices 04/05/2021 18:24:18 \| World Community Grid \| Scheduler request completed: got 0 new tasks ... 04/05/2021 19:29:18 \| World Community Grid \| Sending scheduler request: To fetch work. 04/05/2021 19:29:18 \| World Community Grid \| Requesting new tasks for CPU and Intel GPU 04/05/2021 19:29:18 \| World Community Grid \| [sched_op] CPU work request: 740.35 seconds; 0.00 devices 04/05/2021 19:29:18 \| World Community Grid \| [sched_op] Intel GPU work request: 16509.12 seconds; 0.00 devices 04/05/2021 19:29:19 \| World Community Grid \| Scheduler request completed: got 8 new tasks 04/05/2021 19:29:19 \| World Community Grid \| [sched_op] estimated total CPU task duration: 21082 seconds 04/05/2021 19:29:19 \| World Community Grid \| [sched_op] estimated total Intel GPU task duration: 17858 seconds It does look like you don't only have to enable CPU work, you have to request it as well. Edit - ouch! That poor little celeron is going to have it's work cut out. Of those 8 tasks, 7 were for the iGPU. Estimated at 41 minutes, but actually running for nearly five hours. And they were all _2, _3, _4 resends, with a 36-hour 'hurry up' deadline. I've cut down the CPU workload. ID: 104301 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104312 - Posted: 5 May 2021, 20:19:24 UTC - in response to Message 104301. It does look like you don't only have to enable CPU work, you have to request it as well. I agree. That is the direction I was headed when I noticed I could reproduce the issue with CPU enabled. Edit - ouch! That poor little celeron is going to have it's work cut out. Of those 8 tasks, 7 were for the iGPU. Estimated at 41 minutes, but actually running for nearly five hours. And they were all _2, _3, _4 resends, with a 36-hour 'hurry up' deadline. I've cut down the CPU workload. That is a short deadline! Unfortunately I have had several server aborts and aborted for not starting by the deadline from getting too many intel GPU units(on the nvidia + intel GPU) from setting the units to unlimited(so that the total units could be above 64) combined with the estimates being way off. As an update to this: Oh, and by the way, my machine with 1 NV and 1 intel GPU is up to 50 NV units and 96(!) intel GPU units. That is stock boinc and cpu computing disabled in preferences. It only stopped at 96 due to cache size. I had raised the cache to see how high it would go. I am tempted to raise the cache a bit more just to see if it stops at 100. However, the time estimates are way off, so it will have trouble completing them before the new 3 day deadline. So, I did increase the cache, apparently by too much. I walked away to do something and came back and there were 172(!) intel GPU units and 38 NV GPU units for a total of 210 between the 2 GPUs. I am not sure that the intel gpu units have the same 50 per GPU limit. I lowered the cache settings right away, it stopped requesting intel GPU tasks and it built up to 50 NV. To be clear, this is on the machine running unmodified BOINC 7.6.11. ID: 104312 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104319 - Posted: 7 May 2021, 11:15:08 UTC Hi Richard, I was wondering what you thought about my post earlier about the possible scheduler bug that I found(unrelated to this issue)? It looks like there are two orders for calling sched_work_old and sched_work_locality, but 1 of the times the result is saved before the scheduler is actually called. ID: 104319 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 104320 - Posted: 7 May 2021, 13:26:49 UTC - in response to Message 104319. Hi Richard, I was wondering what you thought about my post earlier about the possible scheduler bug that I found(unrelated to this issue)? It looks like there are two orders for calling sched_work_old and sched_work_locality, but 1 of the times the result is saved before the scheduler is actually called. Well, I saw it and had a quick look... Sat down to have another read-through (starting my head-spin again), and this is what I think is happening: Send_work goes through some basic checks, and then - starting at the line you link - goes through five possible configurations 1) locality_scheduler_fraction - mixed, locality first. 2) debug_locality - mixed, old scheduler first 3) locality_scheduling 4) sched_old 5) send_work_score It's only the 'mixed' cases where two separate schedulers are called, and some flags are saved and reused. So only the 'locality' and 'old' schedulers are in play. With the 'locality' scheduler, the 'insufficient' flags are set during the scheduler run, and need to be preserved. But I can't find any sign of the flags being set in any of the other schedulers, including 'old'. So, although in the 'debug' case, the 'save and reset' precaution is in the wrong place, (send_work_old(); should be around line 1685), in reality there aren't any volatile flags that need to be preserved. Further, there's only one project that we know of that uses locality scheduling (Einstein) - this scheduler was written, by Bruce Allen, specifically for Einstein. Their main server is using config (1), which has the 'save and reset' in the right order. They may get a bit of a shock if they ever try 'debug' mode in the future, but probably not. Phew. ID: 104320 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104322 - Posted: 7 May 2021, 15:47:20 UTC - in response to Message 104320. Well, I saw it and had a quick look... Sat down to have another read-through (starting my head-spin again), and this is what I think is happening: Send_work goes through some basic checks, and then - starting at the line you link - goes through five possible configurations 1) locality_scheduler_fraction - mixed, locality first. 2) debug_locality - mixed, old scheduler first 3) locality_scheduling 4) sched_old 5) send_work_score It's only the 'mixed' cases where two separate schedulers are called, and some flags are saved and reused. So only the 'locality' and 'old' schedulers are in play. With the 'locality' scheduler, the 'insufficient' flags are set during the scheduler run, and need to be preserved. But I can't find any sign of the flags being set in any of the other schedulers, including 'old'. Thank you. I should have looked deeper, I missed that send_work_old() does not set those flags. Other than it appears to set g_wreq->no_allowed_apps_available on L168 in sched_array.cpp in the quick_check() function which is called by scan_work_array(), which is called by send_work_old(). Perhaps my c++ is rusty, but if they set the flags to false and the other 3 flags do not get set by send_work_old(), won't they be set to false even if schedule_locality() sets them to true? Example: If send_work_locality() sets g_wreq->disk.insufficient to true, disk_insufficient = true g_wreq->disk.insufficient gets set back to false send_work_old() is called, but does not set flag. false && true = false g_wreq->disk.insufficient gets set to false even though send_work_locality() set it to true // recombine the 'insufficient' flags from the two schedulers g_wreq->disk.insufficient = g_wreq->disk.insufficient && disk_insufficient; g_wreq->speed.insufficient = g_wreq->speed.insufficient && speed_insufficient; g_wreq->mem.insufficient = g_wreq->mem.insufficient && mem_insufficient; g_wreq->no_allowed_apps_available = g_wreq->no_allowed_apps_available && no_allowed_apps_available; So, although in the 'debug' case, the 'save and reset' precaution is in the wrong place, (send_work_old(); should be around line 1685), in reality there aren't any volatile flags that need to be preserved. Further, there's only one project that we know of that uses locality scheduling (Einstein) - this scheduler was written, by Bruce Allen, specifically for Einstein. Their main server is using config (1), which has the 'save and reset' in the right order. They may get a bit of a shock if they ever try 'debug' mode in the future, but probably not. Phew. That is interesting, I did not know that it was specifically written for Einstein. I guess there is no reason to change anything if Einstein is the only one using it and it is not causing them issues. ID: 104322 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.