Posts by goben

1) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104322) Posted 7 May 2021 by goben_2003 Post: Well, I saw it and had a quick look... Sat down to have another read-through (starting my head-spin again), and this is what I think is happening: Send_work goes through some basic checks, and then - starting at the line you link - goes through five possible configurations 1) locality_scheduler_fraction - mixed, locality first. 2) debug_locality - mixed, old scheduler first 3) locality_scheduling 4) sched_old 5) send_work_score It's only the 'mixed' cases where two separate schedulers are called, and some flags are saved and reused. So only the 'locality' and 'old' schedulers are in play. With the 'locality' scheduler, the 'insufficient' flags are set during the scheduler run, and need to be preserved. But I can't find any sign of the flags being set in any of the other schedulers, including 'old'. Thank you. I should have looked deeper, I missed that send_work_old() does not set those flags. Other than it appears to set g_wreq->no_allowed_apps_available on L168 in sched_array.cpp in the quick_check() function which is called by scan_work_array(), which is called by send_work_old(). Perhaps my c++ is rusty, but if they set the flags to false and the other 3 flags do not get set by send_work_old(), won't they be set to false even if schedule_locality() sets them to true? Example: If send_work_locality() sets g_wreq->disk.insufficient to true, disk_insufficient = true g_wreq->disk.insufficient gets set back to false send_work_old() is called, but does not set flag. false && true = false g_wreq->disk.insufficient gets set to false even though send_work_locality() set it to true // recombine the 'insufficient' flags from the two schedulers g_wreq->disk.insufficient = g_wreq->disk.insufficient && disk_insufficient; g_wreq->speed.insufficient = g_wreq->speed.insufficient && speed_insufficient; g_wreq->mem.insufficient = g_wreq->mem.insufficient && mem_insufficient; g_wreq->no_allowed_apps_available = g_wreq->no_allowed_apps_available && no_allowed_apps_available; So, although in the 'debug' case, the 'save and reset' precaution is in the wrong place, (send_work_old(); should be around line 1685), in reality there aren't any volatile flags that need to be preserved. Further, there's only one project that we know of that uses locality scheduling (Einstein) - this scheduler was written, by Bruce Allen, specifically for Einstein. Their main server is using config (1), which has the 'save and reset' in the right order. They may get a bit of a shock if they ever try 'debug' mode in the future, but probably not. Phew. That is interesting, I did not know that it was specifically written for Einstein. I guess there is no reason to change anything if Einstein is the only one using it and it is not causing them issues.
2) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104319) Posted 7 May 2021 by goben_2003 Post: Hi Richard, I was wondering what you thought about my post earlier about the possible scheduler bug that I found(unrelated to this issue)? It looks like there are two orders for calling sched_work_old and sched_work_locality, but 1 of the times the result is saved before the scheduler is actually called.
3) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104312) Posted 5 May 2021 by goben_2003 Post: It does look like you don't only have to enable CPU work, you have to request it as well. I agree. That is the direction I was headed when I noticed I could reproduce the issue with CPU enabled. Edit - ouch! That poor little celeron is going to have it's work cut out. Of those 8 tasks, 7 were for the iGPU. Estimated at 41 minutes, but actually running for nearly five hours. And they were all _2, _3, _4 resends, with a 36-hour 'hurry up' deadline. I've cut down the CPU workload. That is a short deadline! Unfortunately I have had several server aborts and aborted for not starting by the deadline from getting too many intel GPU units(on the nvidia + intel GPU) from setting the units to unlimited(so that the total units could be above 64) combined with the estimates being way off. As an update to this: Oh, and by the way, my machine with 1 NV and 1 intel GPU is up to 50 NV units and 96(!) intel GPU units. That is stock boinc and cpu computing disabled in preferences. It only stopped at 96 due to cache size. I had raised the cache to see how high it would go. I am tempted to raise the cache a bit more just to see if it stops at 100. However, the time estimates are way off, so it will have trouble completing them before the new 3 day deadline. So, I did increase the cache, apparently by too much. I walked away to do something and came back and there were 172(!) intel GPU units and 38 NV GPU units for a total of 210 between the 2 GPUs. I am not sure that the intel gpu units have the same 50 per GPU limit. I lowered the cache settings right away, it stopped requesting intel GPU tasks and it built up to 50 NV. To be clear, this is on the machine running unmodified BOINC 7.6.11.
4) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104286) Posted 4 May 2021 by goben_2003 Post: Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? That's pretty much me for the evening - the UK has a watercooler appointment with the TV in about 10 minutes. I am inclined towards WCG, whether that is modifications or using an older version of the scheduler(which may or may not be modified). Yeah, same here. It is pretty late. I thought of and tried something else today. I also reproduced the issue of no work for the intel GPU while having CPU enabled. I did this by putting project_max_concurrent=2(I run 2 intel gpu units), and set the preferences to only allow 12% of the CPUs to be used(1 for this 4c/8t CPU). It then did not request cpu tasks due to saturation, but was requesting intel GPU tasks. It never received any. I saved the sched_ files(along with some other ones) from the last request before exiting BOINC. I then ran BOINC with setting the work_req_seconds to the highest of the req_secs. It then got intel GPU tasks on every request until it hit the limit from the WCG profile for it. So it may not actually be because of CPU being disabled, but that it does not send tasks unless work_req_seconds is > 0.
5) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104281) Posted 2 May 2021 by goben_2003 Post: Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? That's pretty much me for the evening - the UK has a watercooler appointment with the TV in about 10 minutes. I am inclined towards WCG, whether that is modifications or using an older version of the scheduler(which may or may not be modified). Yeah, same here. It is pretty late.
6) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104280) Posted 2 May 2021 by goben_2003 Post: Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. Since you mention SETI, I posted the instructions for offline testing in SETI message 2072928 Thank you! I bookmarked it. :) I have been looking through the files, especially winstringify.h. I have not done any programming with GPUs before, just a fair amount of programming in various languages. Unfortunately I do not have time to learn opencl programming right now to the level that would be necessary.
7) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104278) Posted 2 May 2021 by goben_2003 Post: By the way, the person(binii) who I was trying to help whose problem I set about replicating enabled cpu computing and started getting tasks. This was after having tried many things including running linux - all without success. Thanks man! This fixed the problem. I instantly got GPU-packages, when I enabled CPU computing on the web preferences... a bit weird I must admid. Any idea what's the logic behind this? :-) Is it safe to disable CPU computing now? Also(after my response): Makes sense. My friend called me last week and told that his laptop started using GPU after he enabled CPU computing on the web. It sounded so absurd I didnt even figured out to test that
8) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104276) Posted 2 May 2021 by goben_2003 Post: Just saw reports that work was flowing more freely, bumped the cache, and got 14 new tasks for iGPU. So that rules out a limit, unless they've been changing settings while we experiment. I've turned off CPU tasks for that profile, so we'll see how it holds up, maybe with another cache boost later (when the rush has died back down again - downloads are busy ATM). Yeah, I noticed that the downloads are busy. I got task for the iGPU on every request until I hit the limit I set(50). I had set it higher to test, I am going to lower it back down because that can be over 2 days depending on task length.
9) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104275) Posted 2 May 2021 by goben_2003 Post: Having work constantly available enabled me to run tests and replicate the issue on command repeatedly. We have scripts for that now!. I am guessing you mean requesting new work every X seconds? Mine is a bit more complicated than what people have posted as it checks and only requests if the requested delay has been exceeded(as well as doing other things). Even with that you can still not get work for quite a few requests in a row when there is only 2000 every 30 minutes. That makes it take longer to show this issue, as not getting work can be for normal reasons. Although it does give the "Project has no tasks available" when it is because there is no tasks available and it does not give that message with this issue(as you can see from my last post). Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN.
10) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104273) Posted 2 May 2021 by goben_2003 Post: I restarted it into the mode where is sets the work_req to the highest req_secs (without it being anonymous platform) Here are the last 2 requests before restarting it: 02-May-2021 21:36:47 [World Community Grid] update requested by user 02-May-2021 21:36:50 [World Community Grid] [sched_op] sched RPC pending: Requested by user 02-May-2021 21:36:50 [World Community Grid] [sched_op] Starting scheduler request 02-May-2021 21:36:50 [World Community Grid] Sending scheduler request: Requested by user. 02-May-2021 21:36:50 [World Community Grid] Requesting new tasks for Intel GPU 02-May-2021 21:36:50 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 02-May-2021 21:36:50 [World Community Grid] [sched_op] Intel GPU work request: 28209.89 seconds; 0.00 devices 02-May-2021 21:36:53 [World Community Grid] Scheduler request completed: got 0 new tasks 02-May-2021 21:36:53 [World Community Grid] [sched_op] Server version 701 02-May-2021 21:36:53 [World Community Grid] Project requested delay of 121 seconds 02-May-2021 21:36:54 [World Community Grid] [sched_op] Deferring communication for 00:02:01 02-May-2021 21:36:54 [World Community Grid] [sched_op] Reason: requested by project 02-May-2021 21:37:00 [World Community Grid] Started upload of OPNG_0032778_00146_1_r1271073597_0 02-May-2021 21:37:07 [World Community Grid] Finished upload of OPNG_0032778_00146_1_r1271073597_0 02-May-2021 21:38:55 [World Community Grid] update requested by user 02-May-2021 21:38:59 [World Community Grid] [sched_op] sched RPC pending: Requested by user 02-May-2021 21:38:59 [World Community Grid] [sched_op] Starting scheduler request 02-May-2021 21:38:59 [World Community Grid] Sending scheduler request: Requested by user. 02-May-2021 21:38:59 [World Community Grid] Reporting 1 completed tasks 02-May-2021 21:38:59 [World Community Grid] Requesting new tasks for Intel GPU 02-May-2021 21:38:59 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 02-May-2021 21:38:59 [World Community Grid] [sched_op] Intel GPU work request: 28190.47 seconds; 0.00 devices 02-May-2021 21:39:02 [World Community Grid] Scheduler request completed: got 0 new tasks 02-May-2021 21:39:02 [World Community Grid] [sched_op] Server version 701 02-May-2021 21:39:02 [World Community Grid] Project requested delay of 121 seconds 02-May-2021 21:39:02 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task OPNG_0032778_00146_1 02-May-2021 21:39:02 [World Community Grid] [sched_op] Deferring communication for 00:02:01 02-May-2021 21:39:02 [World Community Grid] [sched_op] Reason: requested by project Here are the first 2 from after(technically 3, the first failed contact though): 02-May-2021 21:42:51 [World Community Grid] update requested by user 02-May-2021 21:42:53 [World Community Grid] [sched_op] sched RPC pending: Requested by user 02-May-2021 21:42:53 [World Community Grid] [sched_op] Starting scheduler request 02-May-2021 21:42:53 [World Community Grid] Sending scheduler request: Requested by user. 02-May-2021 21:42:53 [World Community Grid] Requesting new tasks for Intel GPU 02-May-2021 21:42:53 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 02-May-2021 21:42:53 [World Community Grid] [sched_op] Intel GPU work request: 28339.31 seconds; 0.00 devices 02-May-2021 21:43:15 [World Community Grid] Scheduler request failed: Timeout was reached 02-May-2021 21:43:15 [World Community Grid] [sched_op] Deferring communication for 00:01:44 02-May-2021 21:43:15 [World Community Grid] [sched_op] Reason: Scheduler request failed 02-May-2021 21:43:17 [---] Project communication failed: attempting access to reference site 02-May-2021 21:43:18 [---] Internet access OK - project servers may be temporarily down. 02-May-2021 21:43:48 [World Community Grid] update requested by user 02-May-2021 21:43:51 [World Community Grid] [sched_op] sched RPC pending: Requested by user 02-May-2021 21:43:51 [World Community Grid] [sched_op] Starting scheduler request 02-May-2021 21:43:51 [World Community Grid] Sending scheduler request: Requested by user. 02-May-2021 21:43:51 [World Community Grid] Requesting new tasks for Intel GPU 02-May-2021 21:43:51 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 02-May-2021 21:43:51 [World Community Grid] [sched_op] Intel GPU work request: 28369.33 seconds; 0.00 devices 02-May-2021 21:43:53 [World Community Grid] Scheduler request completed: got 2 new tasks 02-May-2021 21:43:53 [World Community Grid] [sched_op] Server version 701 02-May-2021 21:43:53 [World Community Grid] Project requested delay of 121 seconds 02-May-2021 21:43:53 [World Community Grid] [sched_op] estimated total CPU task duration: 0 seconds 02-May-2021 21:43:53 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 2099 seconds 02-May-2021 21:43:53 [World Community Grid] [sched_op] Deferring communication for 00:02:01 02-May-2021 21:43:53 [World Community Grid] [sched_op] Reason: requested by project 02-May-2021 21:43:55 [World Community Grid] Started download of 02d83d4ada272aed6648cee61dab1e14.pdbqt 02-May-2021 21:43:55 [World Community Grid] Started download of 638aef65a57713ded99a450194f0a126.gpf 02-May-2021 21:44:08 [World Community Grid] Finished download of 638aef65a57713ded99a450194f0a126.gpf 02-May-2021 21:44:08 [World Community Grid] Started download of 33a196b92894c8647e3a400b61cef5e5.job 02-May-2021 21:44:15 [World Community Grid] Finished download of 02d83d4ada272aed6648cee61dab1e14.pdbqt 02-May-2021 21:44:15 [World Community Grid] Started download of c45f4ac109e6ef772fbe9064cf7108a7.zip 02-May-2021 21:44:20 [World Community Grid] Finished download of 33a196b92894c8647e3a400b61cef5e5.job 02-May-2021 21:44:20 [World Community Grid] Started download of fe02357443356d28b66b322244c5e850.pdbqt 02-May-2021 21:44:27 [World Community Grid] Finished download of c45f4ac109e6ef772fbe9064cf7108a7.zip 02-May-2021 21:44:27 [World Community Grid] Started download of 29ac650391f41cfe48a0f7c716aec40c.gpf 02-May-2021 21:44:39 [World Community Grid] Finished download of fe02357443356d28b66b322244c5e850.pdbqt 02-May-2021 21:44:39 [World Community Grid] Started download of 41c9297f892eb65b40b2975daf245f30.job 02-May-2021 21:44:40 [World Community Grid] Finished download of 29ac650391f41cfe48a0f7c716aec40c.gpf 02-May-2021 21:44:40 [World Community Grid] Started download of 501697aa6533d2ab5566263f5a2258b5.zip 02-May-2021 21:44:51 [World Community Grid] Finished download of 41c9297f892eb65b40b2975daf245f30.job 02-May-2021 21:44:52 [World Community Grid] Temporarily failed download of 501697aa6533d2ab5566263f5a2258b5.zip: transient HTTP error 02-May-2021 21:44:52 [World Community Grid] Backing off 00:03:12 on download of 501697aa6533d2ab5566263f5a2258b5.zip 02-May-2021 21:44:59 [World Community Grid] Started download of 501697aa6533d2ab5566263f5a2258b5.zip 02-May-2021 21:45:11 [World Community Grid] Finished download of 501697aa6533d2ab5566263f5a2258b5.zip 02-May-2021 21:45:54 [World Community Grid] [sched_op] Starting scheduler request 02-May-2021 21:45:54 [World Community Grid] Sending scheduler request: To fetch work. 02-May-2021 21:45:54 [World Community Grid] Requesting new tasks for Intel GPU 02-May-2021 21:45:54 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 02-May-2021 21:45:54 [World Community Grid] [sched_op] Intel GPU work request: 133032.53 seconds; 0.00 devices 02-May-2021 21:45:57 [World Community Grid] Scheduler request completed: got 2 new tasks 02-May-2021 21:45:57 [World Community Grid] [sched_op] Server version 701 02-May-2021 21:45:57 [World Community Grid] Project requested delay of 121 seconds 02-May-2021 21:45:57 [World Community Grid] [sched_op] estimated total CPU task duration: 0 seconds 02-May-2021 21:45:57 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 2099 seconds 02-May-2021 21:45:57 [World Community Grid] [sched_op] Deferring communication for 00:02:01 02-May-2021 21:45:57 [World Community Grid] [sched_op] Reason: requested by project
11) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104272) Posted 2 May 2021 by goben_2003 Post: This seems to be behaving as expected - CPU and Intel GPU computing is enabled and you get both CPU and Intel GPU. Since I set my test machine back to stock boinc this morning(almost 13 hours ago) it has not gotten Intel GPU tasks. Are you running any sort of 'retry' automation? Otherwise, the backoffs will cut you down to very few requests. Affirmative, otherwise the backoffs would be even worse with the extra under sea cables I have to go through.
12) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104270) Posted 2 May 2021 by goben_2003 Post: Now I'm up to five iGPU tasks: 02/05/2021 19:06:44 \| World Community Grid \| Sending scheduler request: To fetch work. 02/05/2021 19:06:44 \| World Community Grid \| Requesting new tasks for CPU and Intel GPU 02/05/2021 19:06:44 \| World Community Grid \| [sched_op] CPU work request: 11778.07 seconds; 0.85 devices 02/05/2021 19:06:44 \| World Community Grid \| [sched_op] Intel GPU work request: 1093.86 seconds; 0.00 devices 02/05/2021 19:06:45 \| World Community Grid \| Scheduler request completed: got 1 new tasks 02/05/2021 19:06:45 \| World Community Grid \| [sched_op] estimated total CPU task duration: 0 seconds 02/05/2021 19:06:45 \| World Community Grid \| [sched_op] estimated total Intel GPU task duration: 2559 seconds Though the first is due to finish in about five minutes. Edit - iGPU completed and reported, and I got my third CPU task in return. No configuration changes in the last two hours. 02/05/2021 19:26:05 \| World Community Grid \| Reporting 1 completed tasks 02/05/2021 19:26:05 \| World Community Grid \| Requesting new tasks for CPU 02/05/2021 19:26:05 \| World Community Grid \| [sched_op] CPU work request: 13271.04 seconds; 0.85 devices 02/05/2021 19:26:05 \| World Community Grid \| [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 02/05/2021 19:26:06 \| World Community Grid \| Scheduler request completed: got 1 new tasks 02/05/2021 19:26:06 \| World Community Grid \| [sched_op] estimated total CPU task duration: 15263 seconds 02/05/2021 19:26:06 \| World Community Grid \| [sched_op] estimated total Intel GPU task duration: 0 seconds This seems to be behaving as expected - CPU and Intel GPU computing is enabled and you get both CPU and Intel GPU. Since I set my test machine back to stock boinc this morning(almost 13 hours ago) it has not gotten Intel GPU tasks.
13) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104267) Posted 2 May 2021 by goben_2003 Post: If the end of work is that soon, is it time to post about the intel_gpu issue in the WCG forums? Or did you want to do some more testing first? It's not the end of work, just the end of the stress test. Then back to a trickle of 2,000 every half hour, or whatever it was. I'd imagine they'd want to process the resulting server load issues first: I'd imagine it'll be better to wait until we have a constructive diagnosis to pass on. Sorry, I meant the end of near constant work availability. Having work constantly available enabled me to run tests and replicate the issue on command repeatedly.
14) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104264) Posted 2 May 2021 by goben_2003 Post: Kevin just said: There are between 40-48 hours of work left to be run at the current pace of the stress test (so around Tuesday 12:00 UTC +/- 3-4 hours). Good thing tomorrow is a public holiday in the UK - and forecast to be very wet. I can stay indoors and keep trying. Happy May Day! :) If the end of work is that soon, is it time to post about the intel_gpu issue in the WCG forums? Or did you want to do some more testing first?
15) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104263) Posted 2 May 2021 by goben_2003 Post: The 'school' venue is one I usually reserve for an Android tablet to run CPU tasks. It was set to maximum 2 tasks in WCG device profiles, but by 17:21 I'd realised that and removed the restriction. The only limit I can think of after that would be 'four per (intel) GPU', which has never been mentioned, and I think I've seen exceeded on 'big' machines. Yes, I have gotten a lot more than four intel gpu units. When I set the limit to 50 and did any of the 3 things I mentioned earlier to get intel gpu tasks, it kept getting them until it got to the limit of 50 that I set in the wcg profile. (I have seen it get much higher, but I have avoided trying to figure out why as it is harder to chase 2 potential scheduler issues at the same time).
16) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104262) Posted 2 May 2021 by goben_2003 Post: But I can't get beyond here. Computer is a quad-core plus iGPU: I wanted to run 3xCPU + iGPU, but instead I've got 2xCPU (both running) and 4xiGPU (one running). And 02/05/2021 17:21:05 \| World Community Grid \| Computer location: school 02/05/2021 17:21:05 \| \| Number of usable CPUs has changed from 3 to 4. 02/05/2021 17:21:44 \| World Community Grid \| [sched_op] CPU work request: 17544.75 seconds; 1.85 devices 02/05/2021 17:21:44 \| World Community Grid \| [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 02/05/2021 17:21:45 \| World Community Grid \| No tasks sent 02/05/2021 17:21:45 \| World Community Grid \| This computer has reached a limit on tasks in progress Do we know that limit? Other than it being from Job Limits, we have observations but not an explicit declaration from WCG. I was mistaken before. This can be from the config.xml limits or the user's project preferences. sched_types.h L492 bool max_jobs_exceeded() { if (max_jobs_on_host_exceeded) return true; for (int i=0; i<NPROC_TYPES; i++) { if (max_jobs_on_host_proc_type_exceeded[i]) return true; } return false; } sched_send.cpp L783 // check user-specified project prefs limit on # of jobs in progress // int mj = g_wreq->project_prefs.max_jobs_in_progress; if (mj && config.max_jobs_in_progress.project_limits.total.njobs >= mj) { if (config.debug_send) { log_messages.printf(MSG_NORMAL, "[send] user project preferences job limit exceeded\n" ); } g_wreq->max_jobs_on_host_exceeded = true; return false; } <snip> if (!some_type_allowed) { if (config.debug_send) { log_messages.printf(MSG_NORMAL, "[send] config.xml max_jobs_in_progress limit exceeded\n" ); } g_wreq->max_jobs_on_host_exceeded = true; return false; }
17) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104261) Posted 2 May 2021 by goben_2003 Post: By the way, I noticed something while looking in sched_send.cpp. Take a look and see if what you think. This is from sched_send.cpp L1645 I apologize for the formatting, I chose quote so I could bold where the order of send_work_{old \| locality} is in relation to the rest. This part seems fine: if (drand() < config.locality_scheduler_fraction) { if (config.debug_locality) { log_messages.printf(MSG_NORMAL, "[mixed] sending locality work first\n" ); } send_work_locality(); // save 'insufficient' flags from the first scheduler bool disk_insufficient = g_wreq->disk.insufficient; bool speed_insufficient = g_wreq->speed.insufficient; bool mem_insufficient = g_wreq->mem.insufficient; bool no_allowed_apps_available = g_wreq->no_allowed_apps_available; // reset 'insufficient' flags for the second scheduler g_wreq->disk.insufficient = false; g_wreq->speed.insufficient = false; g_wreq->mem.insufficient = false; g_wreq->no_allowed_apps_available = false; if (config.debug_locality) { log_messages.printf(MSG_NORMAL, "[mixed] sending non-locality work second\n" ); } send_work_old(); // recombine the 'insufficient' flags from the two schedulers g_wreq->disk.insufficient = g_wreq->disk.insufficient && disk_insufficient; g_wreq->speed.insufficient = g_wreq->speed.insufficient && speed_insufficient; g_wreq->mem.insufficient = g_wreq->mem.insufficient && mem_insufficient; g_wreq->no_allowed_apps_available = g_wreq->no_allowed_apps_available && no_allowed_apps_available; } However this one does not: else { if (config.debug_locality) { log_messages.printf(MSG_NORMAL, "[mixed] sending non-locality work first\n" ); } // save 'insufficient' flags from the first scheduler bool disk_insufficient = g_wreq->disk.insufficient; bool speed_insufficient = g_wreq->speed.insufficient; bool mem_insufficient = g_wreq->mem.insufficient; bool no_allowed_apps_available = g_wreq->no_allowed_apps_available; // reset 'insufficient' flags for the second scheduler g_wreq->disk.insufficient = false; g_wreq->speed.insufficient = false; g_wreq->mem.insufficient = false; g_wreq->no_allowed_apps_available = false; send_work_old(); if (config.debug_locality) { log_messages.printf(MSG_NORMAL, "[mixed] sending locality work second\n" ); } send_work_locality(); // recombine the 'insufficient' flags from the two schedulers g_wreq->disk.insufficient = g_wreq->disk.insufficient && disk_insufficient; g_wreq->speed.insufficient = g_wreq->speed.insufficient && speed_insufficient; g_wreq->mem.insufficient = g_wreq->mem.insufficient && mem_insufficient; g_wreq->no_allowed_apps_available = g_wreq->no_allowed_apps_available && no_allowed_apps_available; } Notice how it says it is saving the 'insufficient' flags from the first scheduler, but it calls the first scheduler after it saves the flags. I am not saying it is affecting us in this case as I do not know how config.locality_scheduling, config.sched_old, or config.locality_scheduler_fraction are set. Also if we do get to this section, it appears that the effect would be to always have the 'insufficient' flags as false when the drand() sends it to the second part. Thus making the "No tasks are available for the applications you have selected." not show up even if it should be true.
18) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104259) Posted 2 May 2021 by goben_2003 Post: I would take this as the case you mentioned before. Meaning where the scheduler happens to not have work(for the CPU or intel GPU with whichever wcg projects are selected) at the precise time when you requested it. I think that is from sched_send.cpp L1295 in the "if client asked for work and we're not sending any, explain why" section: if (g_wreq->no_allowed_apps_available) { g_reply->insert_message( _("No tasks are available for the applications you have selected."), "low" ); The problem with that one is that it's so badly implemented (at all projects, not just WCG) that it chucks out every possible excuse. That last 17:21 reply, in full, was: 02/05/2021 17:21:45 \| World Community Grid \| No tasks sent 02/05/2021 17:21:45 \| World Community Grid \| No tasks are available for OpenPandemics - COVID 19 02/05/2021 17:21:45 \| World Community Grid \| No tasks are available for OpenPandemics - COVID-19 - GPU 02/05/2021 17:21:45 \| World Community Grid \| Tasks for NVIDIA GPU are available, but your preferences are set to not accept them 02/05/2021 17:21:45 \| World Community Grid \| Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them 02/05/2021 17:21:45 \| World Community Grid \| This computer has reached a limit on tasks in progress 02/05/2021 17:21:45 \| World Community Grid \| Project has no tasks available 02/05/2021 17:21:45 \| World Community Grid \| Project requested delay of 121 seconds Take your choice: the final one is usually the most reliable. I could be following the code wrong, but I think that both of those can be true. The computer has reached a limit on tasks in progress is from either the per host or per processor type being exceeded sched_types.h bool max_jobs_exceeded() { if (max_jobs_on_host_exceeded) return true; for (int i=0; i<NPROC_TYPES; i++) { if (max_jobs_on_host_proc_type_exceeded[i]) return true; } return false; } No tasks available - Either it was not ready or it searched through wu_results and did not find any available. shmem.cpp L328 // see if there's any work. // If there is, reserve it for this process // (if we don't do this, there's a race condition where lots // of servers try to get a single work item) // bool SCHED_SHMEM::no_work(int pid) { if (!ready) return true; for (int i=0; i<max_wu_results; i++) { if (wu_results[i].state == WR_STATE_PRESENT) { wu_results[i].state = pid; return false; } } return true; }
19) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104258) Posted 2 May 2021 by goben_2003 Post: But I can't get beyond here. Computer is a quad-core plus iGPU: I wanted to run 3xCPU + iGPU, but instead I've got 2xCPU (both running) and 4xiGPU (one running). And 02/05/2021 17:21:05 \| World Community Grid \| Computer location: school 02/05/2021 17:21:05 \| \| Number of usable CPUs has changed from 3 to 4. 02/05/2021 17:21:44 \| World Community Grid \| [sched_op] CPU work request: 17544.75 seconds; 1.85 devices 02/05/2021 17:21:44 \| World Community Grid \| [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 02/05/2021 17:21:45 \| World Community Grid \| No tasks sent 02/05/2021 17:21:45 \| World Community Grid \| This computer has reached a limit on tasks in progress Do we know that limit? Other than it being from Job Limits, we have observations but not an explicit declaration from WCG.
20) Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled (Message 104254) Posted 2 May 2021 by goben_2003 Post: And we have lift-off: 02/05/2021 16:46:08 \| World Community Grid \| Computer location: school 02/05/2021 16:48:10 \| World Community Grid \| [sched_op] Starting scheduler request 02/05/2021 16:48:12 \| World Community Grid \| Sending scheduler request: To fetch work. 02/05/2021 16:48:12 \| World Community Grid \| Requesting new tasks for CPU and Intel GPU 02/05/2021 16:48:12 \| World Community Grid \| [sched_op] CPU work request: 28512.00 seconds; 3.00 devices 02/05/2021 16:48:12 \| World Community Grid \| [sched_op] Intel GPU work request: 9504.00 seconds; 1.00 devices 02/05/2021 16:48:13 \| World Community Grid \| Scheduler request completed: got 4 new tasks 02/05/2021 16:48:13 \| World Community Grid \| [sched_op] estimated total CPU task duration: 30176 seconds 02/05/2021 16:48:13 \| World Community Grid \| [sched_op] estimated total Intel GPU task duration: 5119 seconds Did you only get tasks when the CPU was enabled? Or maybe not: 02/05/2021 16:50:16 \| World Community Grid \| Sending scheduler request: To fetch work. 02/05/2021 16:50:16 \| World Community Grid \| Requesting new tasks for CPU and Intel GPU 02/05/2021 16:50:16 \| World Community Grid \| [sched_op] CPU work request: 8724.32 seconds; 0.85 devices 02/05/2021 16:50:16 \| World Community Grid \| [sched_op] Intel GPU work request: 4439.86 seconds; 0.00 devices 02/05/2021 16:50:17 \| World Community Grid \| Scheduler request completed: got 0 new tasks 02/05/2021 16:50:17 \| World Community Grid \| No tasks are available for the applications you have selected. I would take this as the case you mentioned before. Meaning where the scheduler happens to not have work(for the CPU or intel GPU with whichever wcg projects are selected) at the precise time when you requested it. I think that is from sched_send.cpp L1295 in the "if client asked for work and we're not sending any, explain why" section: if (g_wreq->no_allowed_apps_available) { g_reply->insert_message( _("No tasks are available for the applications you have selected."), "low" );

Next 20

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.

Posts by goben_2003