1)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104322)
Posted 7 May 2021 by goben_2003 Post: Well, I saw it and had a quick look... Thank you. I should have looked deeper, I missed that send_work_old() does not set those flags. Other than it appears to set g_wreq->no_allowed_apps_available on L168 in sched_array.cpp in the quick_check() function which is called by scan_work_array(), which is called by send_work_old(). Perhaps my c++ is rusty, but if they set the flags to false and the other 3 flags do not get set by send_work_old(), won't they be set to false even if schedule_locality() sets them to true? Example:
g_wreq->disk.insufficient gets set back to false send_work_old() is called, but does not set flag. false && true = false g_wreq->disk.insufficient gets set to false even though send_work_locality() set it to true
// recombine the 'insufficient' flags from the two schedulers So, although in the 'debug' case, the 'save and reset' precaution is in the wrong place, (send_work_old(); should be around line 1685), in reality there aren't any volatile flags that need to be preserved. That is interesting, I did not know that it was specifically written for Einstein. I guess there is no reason to change anything if Einstein is the only one using it and it is not causing them issues. |
2)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104319)
Posted 7 May 2021 by goben_2003 Post: Hi Richard, I was wondering what you thought about my post earlier about the possible scheduler bug that I found(unrelated to this issue)? It looks like there are two orders for calling sched_work_old and sched_work_locality, but 1 of the times the result is saved before the scheduler is actually called. |
3)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104312)
Posted 5 May 2021 by goben_2003 Post: It does look like you don't only have to enable CPU work, you have to request it as well. I agree. That is the direction I was headed when I noticed I could reproduce the issue with CPU enabled. Edit - ouch! That poor little celeron is going to have it's work cut out. Of those 8 tasks, 7 were for the iGPU. Estimated at 41 minutes, but actually running for nearly five hours. And they were all _2, _3, _4 resends, with a 36-hour 'hurry up' deadline. I've cut down the CPU workload. That is a short deadline! Unfortunately I have had several server aborts and aborted for not starting by the deadline from getting too many intel GPU units(on the nvidia + intel GPU) from setting the units to unlimited(so that the total units could be above 64) combined with the estimates being way off. As an update to this: Oh, and by the way, my machine with 1 NV and 1 intel GPU is up to 50 NV units and 96(!) intel GPU units. That is stock boinc and cpu computing disabled in preferences. It only stopped at 96 due to cache size. I had raised the cache to see how high it would go. I am tempted to raise the cache a bit more just to see if it stops at 100. However, the time estimates are way off, so it will have trouble completing them before the new 3 day deadline. So, I did increase the cache, apparently by too much. I walked away to do something and came back and there were 172(!) intel GPU units and 38 NV GPU units for a total of 210 between the 2 GPUs. I am not sure that the intel gpu units have the same 50 per GPU limit. I lowered the cache settings right away, it stopped requesting intel GPU tasks and it built up to 50 NV. To be clear, this is on the machine running unmodified BOINC 7.6.11. |
4)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104286)
Posted 4 May 2021 by goben_2003 Post: Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? I thought of and tried something else today. I also reproduced the issue of no work for the intel GPU while having CPU enabled. I did this by putting project_max_concurrent=2(I run 2 intel gpu units), and set the preferences to only allow 12% of the CPUs to be used(1 for this 4c/8t CPU). It then did not request cpu tasks due to saturation, but was requesting intel GPU tasks. It never received any. I saved the sched_ files(along with some other ones) from the last request before exiting BOINC. I then ran BOINC with setting the work_req_seconds to the highest of the req_secs. It then got intel GPU tasks on every request until it hit the limit from the WCG profile for it. So it may not actually be because of CPU being disabled, but that it does not send tasks unless work_req_seconds is > 0. |
5)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104281)
Posted 2 May 2021 by goben_2003 Post: Yes, I think we can pretty well conclude that the effect is real - there's definitely a causal link between 'no CPU work' and 'no Intel GPU work'. But the question we asked ourselves at the beginning was - is that WCG's fault, or BOINC's fault? (I think we can rule out the client, by now). Neither of us has found a smoking gun in the BOINC code, so are we inclining towards the WCG modifications? I am inclined towards WCG, whether that is modifications or using an older version of the scheduler(which may or may not be modified). Yeah, same here. It is pretty late. |
6)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104280)
Posted 2 May 2021 by goben_2003 Post: Since you mention SETI, I posted the instructions for offline testing in SETI message 2072928Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work.Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. Thank you! I bookmarked it. :) I have been looking through the files, especially winstringify.h. I have not done any programming with GPUs before, just a fair amount of programming in various languages. Unfortunately I do not have time to learn opencl programming right now to the level that would be necessary. |
7)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104278)
Posted 2 May 2021 by goben_2003 Post: By the way, the person(binii) who I was trying to help whose problem I set about replicating enabled cpu computing and started getting tasks. This was after having tried many things including running linux - all without success. Thanks man! This fixed the problem. I instantly got GPU-packages, when I enabled CPU computing on the web preferences... a bit weird I must admid. Any idea what's the logic behind this? :-) Is it safe to disable CPU computing now? Also(after my response): Makes sense. My friend called me last week and told that his laptop started using GPU after he enabled CPU computing on the web. It sounded so absurd I didnt even figured out to test that |
8)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104276)
Posted 2 May 2021 by goben_2003 Post: Just saw reports that work was flowing more freely, bumped the cache, and got 14 new tasks for iGPU. So that rules out a limit, unless they've been changing settings while we experiment. I've turned off CPU tasks for that profile, so we'll see how it holds up, maybe with another cache boost later (when the rush has died back down again - downloads are busy ATM). Yeah, I noticed that the downloads are busy. I got task for the iGPU on every request until I hit the limit I set(50). I had set it higher to test, I am going to lower it back down because that can be over 2 days depending on task length. |
9)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104275)
Posted 2 May 2021 by goben_2003 Post: Having work constantly available enabled me to run tests and replicate the issue on command repeatedly.We have scripts for that now!. I am guessing you mean requesting new work every X seconds? Mine is a bit more complicated than what people have posted as it checks and only requests if the requested delay has been exceeded(as well as doing other things). Even with that you can still not get work for quite a few requests in a row when there is only 2000 every 30 minutes. That makes it take longer to show this issue, as not getting work can be for normal reasons. Although it does give the "Project has no tasks available" when it is because there is no tasks available and it does not give that message with this issue(as you can see from my last post). Also, I converted one of the early Betas to run offline at a command prompt, which removed the dependency on new work. Cool! I was thinking about looking how to do that to run the tests in the link grumpy_swede posted about how much running CPU tasks can slow down the iGPU. I ended up just running them with the SETI AP WUs, but the slowdown was not representative of what was shown with my data collection during the beta. There was a lot more slowdown with AP than with OPN. |
10)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104273)
Posted 2 May 2021 by goben_2003 Post: I restarted it into the mode where is sets the work_req to the highest req_secs (without it being anonymous platform) Here are the last 2 requests before restarting it: 02-May-2021 21:36:47 [World Community Grid] update requested by user Here are the first 2 from after(technically 3, the first failed contact though): 02-May-2021 21:42:51 [World Community Grid] update requested by user |
11)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104272)
Posted 2 May 2021 by goben_2003 Post: This seems to be behaving as expected - CPU and Intel GPU computing is enabled and you get both CPU and Intel GPU. Since I set my test machine back to stock boinc this morning(almost 13 hours ago) it has not gotten Intel GPU tasks.Are you running any sort of 'retry' automation? Otherwise, the backoffs will cut you down to very few requests. Affirmative, otherwise the backoffs would be even worse with the extra under sea cables I have to go through. |
12)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104270)
Posted 2 May 2021 by goben_2003 Post: Now I'm up to five iGPU tasks: This seems to be behaving as expected - CPU and Intel GPU computing is enabled and you get both CPU and Intel GPU. Since I set my test machine back to stock boinc this morning(almost 13 hours ago) it has not gotten Intel GPU tasks. |
13)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104267)
Posted 2 May 2021 by goben_2003 Post: If the end of work is that soon, is it time to post about the intel_gpu issue in the WCG forums? Or did you want to do some more testing first?It's not the end of work, just the end of the stress test. Then back to a trickle of 2,000 every half hour, or whatever it was. I'd imagine they'd want to process the resulting server load issues first: I'd imagine it'll be better to wait until we have a constructive diagnosis to pass on. Sorry, I meant the end of near constant work availability. Having work constantly available enabled me to run tests and replicate the issue on command repeatedly. |
14)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104264)
Posted 2 May 2021 by goben_2003 Post: Kevin just said: Happy May Day! :) If the end of work is that soon, is it time to post about the intel_gpu issue in the WCG forums? Or did you want to do some more testing first? |
15)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104263)
Posted 2 May 2021 by goben_2003 Post: The 'school' venue is one I usually reserve for an Android tablet to run CPU tasks. It was set to maximum 2 tasks in WCG device profiles, but by 17:21 I'd realised that and removed the restriction. The only limit I can think of after that would be 'four per (intel) GPU', which has never been mentioned, and I think I've seen exceeded on 'big' machines. Yes, I have gotten a lot more than four intel gpu units. When I set the limit to 50 and did any of the 3 things I mentioned earlier to get intel gpu tasks, it kept getting them until it got to the limit of 50 that I set in the wcg profile. (I have seen it get much higher, but I have avoided trying to figure out why as it is harder to chase 2 potential scheduler issues at the same time). |
16)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104262)
Posted 2 May 2021 by goben_2003 Post: But I can't get beyond here. Computer is a quad-core plus iGPU: I wanted to run 3xCPU + iGPU, but instead I've got 2xCPU (both running) and 4xiGPU (one running). And I was mistaken before. This can be from the config.xml limits or the user's project preferences. sched_types.h L492 bool max_jobs_exceeded() { if (max_jobs_on_host_exceeded) return true; for (int i=0; i<NPROC_TYPES; i++) { if (max_jobs_on_host_proc_type_exceeded[i]) return true; } return false; } sched_send.cpp L783 // check user-specified project prefs limit on # of jobs in progress // int mj = g_wreq->project_prefs.max_jobs_in_progress; if (mj && config.max_jobs_in_progress.project_limits.total.njobs >= mj) { if (config.debug_send) { log_messages.printf(MSG_NORMAL, "[send] user project preferences job limit exceeded\n" ); } g_wreq->max_jobs_on_host_exceeded = true; return false; } <snip> if (!some_type_allowed) { if (config.debug_send) { log_messages.printf(MSG_NORMAL, "[send] config.xml max_jobs_in_progress limit exceeded\n" ); } g_wreq->max_jobs_on_host_exceeded = true; return false; } |
17)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104261)
Posted 2 May 2021 by goben_2003 Post: By the way, I noticed something while looking in sched_send.cpp. Take a look and see if what you think. This is from sched_send.cpp L1645 I apologize for the formatting, I chose quote so I could bold where the order of send_work_{old | locality} is in relation to the rest. This part seems fine: if (drand() < config.locality_scheduler_fraction) { However this one does not: else { Notice how it says it is saving the 'insufficient' flags from the first scheduler, but it calls the first scheduler after it saves the flags. I am not saying it is affecting us in this case as I do not know how config.locality_scheduling, config.sched_old, or config.locality_scheduler_fraction are set. Also if we do get to this section, it appears that the effect would be to always have the 'insufficient' flags as false when the drand() sends it to the second part. Thus making the "No tasks are available for the applications you have selected." not show up even if it should be true. |
18)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104259)
Posted 2 May 2021 by goben_2003 Post: I would take this as the case you mentioned before. Meaning where the scheduler happens to not have work(for the CPU or intel GPU with whichever wcg projects are selected) at the precise time when you requested it.The problem with that one is that it's so badly implemented (at all projects, not just WCG) that it chucks out every possible excuse. That last 17:21 reply, in full, was: I could be following the code wrong, but I think that both of those can be true. The computer has reached a limit on tasks in progress is from either the per host or per processor type being exceeded sched_types.h bool max_jobs_exceeded() { if (max_jobs_on_host_exceeded) return true; for (int i=0; i<NPROC_TYPES; i++) { if (max_jobs_on_host_proc_type_exceeded[i]) return true; } return false; } No tasks available - Either it was not ready or it searched through wu_results and did not find any available. shmem.cpp L328 // see if there's any work. // If there is, reserve it for this process // (if we don't do this, there's a race condition where lots // of servers try to get a single work item) // bool SCHED_SHMEM::no_work(int pid) { if (!ready) return true; for (int i=0; i<max_wu_results; i++) { if (wu_results[i].state == WR_STATE_PRESENT) { wu_results[i].state = pid; return false; } } return true; } |
19)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104258)
Posted 2 May 2021 by goben_2003 Post: But I can't get beyond here. Computer is a quad-core plus iGPU: I wanted to run 3xCPU + iGPU, but instead I've got 2xCPU (both running) and 4xiGPU (one running). And Other than it being from Job Limits, we have observations but not an explicit declaration from WCG. |
20)
Message boards :
Questions and problems :
GPU not receiving tasks when CPU computing disabled
(Message 104254)
Posted 2 May 2021 by goben_2003 Post: And we have lift-off: Did you only get tasks when the CPU was enabled? Or maybe not: I would take this as the case you mentioned before. Meaning where the scheduler happens to not have work(for the CPU or intel GPU with whichever wcg projects are selected) at the precise time when you requested it. I think that is from sched_send.cpp L1295 in the "if client asked for work and we're not sending any, explain why" section: if (g_wreq->no_allowed_apps_available) { g_reply->insert_message( _("No tasks are available for the applications you have selected."), "low" ); |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.