GPU not receiving tasks when CPU computing disabled

Author	Message
Grumpy Swede Send message Joined: 30 Mar 20 Posts: 376	Message 104194 - Posted: 30 Apr 2021, 10:22:49 UTC - in response to Message 104192. I have absolutely no problems getting iGPU tasks. I have always had CPU tasks set to NO. I have one iGPU, and one discrete GTX980 in the same computer, and I have tested to set even the GTX980 (NVIDIA) to NO, and keep the iGPU to YES. No problems getting tasks for the iGPU then either. I tried the other way around too, and no problems getting tasks for NVIDIA. So, in my case, CPU NO, and NVIDIA and iGPU to YES, delivers what both GPU's want. (even if I have no AMD GPU, I have AMD set to YES also) Hi Grumpy_Swede, good to see you over here too! :) I am only able to reproduce this issue when the iGPU is the only GPU that the machine has. The same goben. I've been here on the BOINC site for a long time. Have you set all GPU settings to YES, even if you don't have any NVIDIA or AMD GPU's in that computer? Early in BETA Uplinger mentioned something about that. ID: 104194 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104195 - Posted: 30 Apr 2021, 10:28:00 UTC - in response to Message 104194. I have absolutely no problems getting iGPU tasks. I have always had CPU tasks set to NO. I have one iGPU, and one discrete GTX980 in the same computer, and I have tested to set even the GTX980 (NVIDIA) to NO, and keep the iGPU to YES. No problems getting tasks for the iGPU then either. I tried the other way around too, and no problems getting tasks for NVIDIA. So, in my case, CPU NO, and NVIDIA and iGPU to YES, delivers what both GPU's want. (even if I have no AMD GPU, I have AMD set to YES also) Hi Grumpy_Swede, good to see you over here too! :) I am only able to reproduce this issue when the iGPU is the only GPU that the machine has. The same goben. I've been here on the BOINC site for a long time. Have you set all GPU settings to YES, even if you don't have any NVIDIA or AMD GPU's in that computer? Early in BETA Uplinger mentioned something about that. Yes, I have all 4 settings under Graphics Card Usage to Yes. Meaning: Graphics Card Usage Do work on my graphics card while computer is in use? Yes Use my AMD/ATI graphics card if possible: Yes Use my Intel graphics card if possible: Yes Use my NVIDIA graphics card if possible: Yes If it was not clear earlier in the thread. There is only one thing I change in the project preferences to switch between getting the intel GPU tasks and not getting the intel GPU tasks: "Allow research to run on my CPU?" When set to No -> No gpu tasks are received When set to Yes-> intel gpu tasks are received ID: 104195 ·

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 376	Message 104196 - Posted: 30 Apr 2021, 10:34:46 UTC - in response to Message 104195. Last modified: 30 Apr 2021, 10:41:56 UTC Yes, I have all 4 settings under Graphics Card Usage to Yes. Meaning: Graphics Card Usage Do work on my graphics card while computer is in use? Yes Use my AMD/ATI graphics card if possible: Yes Use my Intel graphics card if possible: Yes Use my NVIDIA graphics card if possible: Yes If it was not clear earlier in the thread. There is only one thing I change in the project preferences to switch between getting the intel GPU tasks and not getting the intel GPU tasks: "Allow research to run on my CPU?" When set to No -> No gpu tasks are received When set to Yes-> intel gpu tasks are received OK! Also early on in the BETA, both Richard and I found it very hard to start getting any iGPU tasks at one point. What we did then, was to set the settings in BOINC for "Store at least" and "Store up to an additional" to very low numbers. "Store at least" to 0.05 or 0.10 days, and "Store up to an additional" to 0.01 days. Then tasks for the iGPU started to flow. After it started working then it was possible to increase those numbers. Edit, added: But then of course iGPU tasks is not something you get at every request, since WCG has the strange idea that iGPU must be matched with another iGPU. There's a lot fewer iGPU's than NVIDIA and AMD by the looks of things. Same for NVIDIA and AMD. They must be matched with the same device type, ID: 104196 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104197 - Posted: 30 Apr 2021, 10:50:09 UTC - in response to Message 104196. Last modified: 30 Apr 2021, 10:51:52 UTC Yes, I have all 4 settings under Graphics Card Usage to Yes. Meaning: Graphics Card Usage Do work on my graphics card while computer is in use? Yes Use my AMD/ATI graphics card if possible: Yes Use my Intel graphics card if possible: Yes Use my NVIDIA graphics card if possible: Yes If it was not clear earlier in the thread. There is only one thing I change in the project preferences to switch between getting the intel GPU tasks and not getting the intel GPU tasks: "Allow research to run on my CPU?" When set to No -> No gpu tasks are received When set to Yes-> intel gpu tasks are received OK! Also early on in the BETA, both Richard and I found it very hard to start getting any iGPU tasks at one point. What we did then, was to set the settings in BOINC for "Store at least" and "Store up to an additional" to very low numbers. "Store at least" to 0.05 or 0.10 days, and "Store up to an additional" to 0.01 days. Then tasks for the iGPU started to flow. After it started working then it was possible to increase those numbers. Edit, added: But then of course iGPU tasks is not something you get at every request, since WCG has the strange idea that iGPU must be matched with another iGPU. There's a lot fewer iGPU's than NVIDIA and AMD by the looks of things. Same for NVIDIA and AMD. They must be matched with the same device type, The thing is that I do not find it hard to get iGPU units. I always get them (up until the max # I have set) when CPU is enabled. When CPU is disabled I never get them. This is the case whether I have set 0.25+0.05 days or 0.75+0.75 days. If I have it set really low, like 0.10+0.01 days, this still occurs(other than being up until it gets to 0.11 days instead of the max # of tasks I have set for opn since the igpu takes a while to complete each igpu task.) Thus I take this as a server side issue. Please note that I only discovered this trying to replicate the problem that another wcg user was having getting iGPU units on his machine that only had an iGPU. This is because I only had CPU set to No on my machine with the nvidia + intel. Edit: changed "to run" to "to complete each igpu task" ID: 104197 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104198 - Posted: 30 Apr 2021, 10:54:14 UTC - in response to Message 104195. Last modified: 30 Apr 2021, 11:08:32 UTC If it was not clear earlier in the thread. There is only one thing I change in the project preferences to switch between getting the intel GPU tasks and not getting the intel GPU tasks: "Allow research to run on my CPU?" When set to No -> No gpu tasks are received When set to Yes-> intel gpu tasks are received I think that is possibly a specific WCG problem with the implementation of those controls: their way of transferring those choices from the web site to the scheduler will be unique to them (because their website is unique). But I still intend to look through the generic BOINC server code when I get a chance, but it'll be slow - I have open diagnosis calls to both Einstein and CPDN on the go at the moment. Edit - ah, I'd forgotten that WCG also has a user selection for 'max tasks in progress'. That's another complication: I was going to say 'unique to WCG', but I've recently seen it on 'traditional' BOINC projects too. It'll be a thread on this board - can anyone remind me while I finish breakfast? OK, you can stop looking: message 103817 ID: 104198 ·

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 376	Message 104199 - Posted: 30 Apr 2021, 10:56:44 UTC - in response to Message 104197. The thing is that I do not find it hard to get iGPU units. I always get them (up until the max # I have set) when CPU is enabled. When CPU is disabled I never get them. This is the case whether I have set 0.25+0.05 days or 0.75+0.75 days. If I have it set really low, like 0.10+0.01 days, this still occurs(other than being up until it gets to 0.11 days instead of the max # of tasks I have set for opn since the igpu takes a while to complete each igpu task.) Thus I take this as a server side issue. Please note that I only discovered this trying to replicate the problem that another wcg user was having getting iGPU units on his machine that only had an iGPU. This is because I only had CPU set to No on my machine with the nvidia + intel. Edit: changed "to run" to "to complete each igpu task" OK, got it. Then I'm at a loss, and I do not have any more ideas to come up with. :-( ID: 104199 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104201 - Posted: 30 Apr 2021, 11:34:45 UTC - in response to Message 104198. If it was not clear earlier in the thread. There is only one thing I change in the project preferences to switch between getting the intel GPU tasks and not getting the intel GPU tasks: "Allow research to run on my CPU?" When set to No -> No gpu tasks are received When set to Yes-> intel gpu tasks are received I think that is possibly a specific WCG problem with the implementation of those controls: their way of transferring those choices from the web site to the scheduler will be unique to them (because their website is unique). But I still intend to look through the generic BOINC server code when I get a chance, but it'll be slow - I have open diagnosis calls to both Einstein and CPDN on the go at the moment. Edit - ah, I'd forgotten that WCG also has a user selection for 'max tasks in progress'. That's another complication: I was going to say 'unique to WCG', but I've recently seen it on 'traditional' BOINC projects too. It'll be a thread on this board - can anyone remind me while I finish breakfast? OK, you can stop looking: message 103817 OK, if it may be WCG specific, I will post there outlining the issue. For the max tasks: Yes, they do have that setting. I only run into it if the cache size is set higher than what the max tasks will allow. In case it is interesting, setting "OpenPandemics - COVID-19" under "Project Limits" on the profile page results in this in the account_www.worldcommunitygrid.org.xml file. <limited_app> <limited_app_id>92</limited_app_id> <limited_app_max_in_progress>25</limited_app_max_in_progress> </limited_app> <limited_app> <limited_app_id>94</limited_app_id> <limited_app_max_in_progress>25</limited_app_max_in_progress> </limited_app> Is there a way for me to edit this file and have BOINC use it to test having OPN1(I think 92) set to 1 and OPNG(I think 94) still set to 25? Just for testing to be able give more info to the WCG people. That way it could have CPU enabled but only get a max of 1 task at a time. Unless something (maybe "-1"?) could be put in to mean no tasks. Or probably better to remove 92 from: <apps_selected> <app_id>92</app_id> <app_id>94</app_id> <app_id>82</app_id> </apps_selected> OK, got it. Then I'm at a loss, and I do not have any more ideas to come up with. :-( It is ok. :) I mostly am trying to pin down where the issue is so that it is more likely to get fixed. ID: 104201 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104202 - Posted: 30 Apr 2021, 12:02:39 UTC - in response to Message 104201. Is there a way for me to edit this file and have BOINC use it to test having OPN1(I think 92) set to 1 and OPNG(I think 94) still set to 25? Just for testing to be able give more info to the WCG people. That way it could have CPU enabled but only get a max of 1 task at a time. Unless something (maybe "-1"?) could be put in to mean no tasks. Or probably better to remove 92 from: <apps_selected> <app_id>92</app_id> <app_id>94</app_id> <app_id>82</app_id> </apps_selected> Yes, that should be possible - for testing, at least. But I think it might only work once: it can only have reached your machine via a scheduler request/reply, and it might be repeated in all subsequent replies. Have a look in sched_reply_[WCG].xml (root of BOINC data folder), and see what you can find in there. It might be safest to stop the client before making that change, and see what happens on restart. The value to use for 'no restriction' in BOINC is undefined. Sometimes, David has used 0, and sometimes he's used -1. There's even one file where he's used both within 15 lines of code. [https://github.com/BOINC/boinc/blob/master/lib/common_defs.h#L86]. I think he flipped somewhere around 2015, so 'recent' changes like this one (2016) probably use -1, if it's defined at all. I'll look. ID: 104202 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104203 - Posted: 30 Apr 2021, 14:03:31 UTC - in response to Message 104202. Yes, that should be possible - for testing, at least. But I think it might only work once: it can only have reached your machine via a scheduler request/reply, and it might be repeated in all subsequent replies. Have a look in sched_reply_[WCG].xml (root of BOINC data folder), and see what you can find in there. It might be safest to stop the client before making that change, and see what happens on restart. The value to use for 'no restriction' in BOINC is undefined. Sometimes, David has used 0, and sometimes he's used -1. There's even one file where he's used both within 15 lines of code. [https://github.com/BOINC/boinc/blob/master/lib/common_defs.h#L86]. I think he flipped somewhere around 2015, so 'recent' changes like this one (2016) probably use -1, if it's defined at all. I'll look. The information from the account_ file is in the sched_reply file, which gets overwritten during each schedule request. So far I have not gotten it to use the changes. It seems to get overwritten on startup when it contacts the project for prefs. ID: 104203 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104206 - Posted: 30 Apr 2021, 17:47:49 UTC Would it be correct to say that if I get it to use the locally modified account_ file and it starts getting the intel gpu tasks that it may be an issue with the client? ID: 104206 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104207 - Posted: 30 Apr 2021, 18:21:09 UTC - in response to Message 104206. Would it be correct to say that if I get it to use the locally modified account_ file and it starts getting the intel gpu tasks that it may be an issue with the client? I think it would be fair to say that if you got it to work - we don't know what the heck will happen! Your mission, should you choose to accept it, is to ... I'm not getting the feeling that this is a client problem. You're getting the right sort of figures for 'work request' in the event log, and I think we'd have noticed by now if those figures didn't also appear in sched_request (though that's something you could check). The problem seems to happen after you send the request, and before you get the reply - and the only thing between those two events is the server. ID: 104207 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104210 - Posted: 30 Apr 2021, 19:11:03 UTC - in response to Message 104207. Would it be correct to say that if I get it to use the locally modified account_ file and it starts getting the intel gpu tasks that it may be an issue with the client? I think it would be fair to say that if you got it to work - we don't know what the heck will happen! Your mission, should you choose to accept it, is to ... I'm not getting the feeling that this is a client problem. You're getting the right sort of figures for 'work request' in the event log, and I think we'd have noticed by now if those figures didn't also appear in sched_request (though that's something you could check). The problem seems to happen after you send the request, and before you get the reply - and the only thing between those two events is the server. I did get it to work using the locally modified account_ file. What was changed: no_cpu was changed from 0 to 1 app_id 92 was removed from apps_selected. opn1=92 I will have to revert to the account_ file and check the sched_request file. ID: 104210 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104217 - Posted: 1 May 2021, 10:39:19 UTC So I saved the work request from using the account_ file from the server and then using the locally modified account_ file. It looks like the only significant difference is: non modified: scheduler request -> work_req_seconds is 0 scheduler request -> cpu_req_seconds is 0 modified: scheduler request -> work_req_seconds is >0 scheduler request -> cpu_req_seconds is >0 both have: coproc_intel_gpu->req_secs >0 I think the next thing that would be interesting is if scheduler request -> work_req_seconds was >0 and scheduler request -> cpu_req_seconds = 0. If so, I would take it as a client issue. If not, I would take it as a server issue. Do you have any thoughts/ideas? ID: 104217 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104218 - Posted: 1 May 2021, 11:29:54 UTC - in response to Message 104217. Last modified: 1 May 2021, 11:40:30 UTC The interesting question is "why do we have separate lines for <work_req_seconds> and <cpu_req_seconds>?". For the project where I get my CPU tasks from, both are populated, and identical to the microsecond. I suspect the answer may be historical: when BOINC was first written, there was only one possible sort of work (GPUs couldn't compute), and maybe the wording was changed in newer servers? The alternative may (and probably should) have been retained for backward compatibility. I'll add that to my code reading list ('blame' is useful in these situations). The other thing I'll look at is the timing evidence from the modification datestamp on the various files - notably, request/reply and the account file. Edit - interesting post at https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43386_offset,570#657527: Just mentioned that my GPU cruncher with 1660 Super, 280X and A12-9800 APU gets up to 150 NV wu's and almost non for the ATI/AMD. That's no real problem because I can handle this with a second Boinc instance. With a 50 per GPU limit, is the limit being applied globally? Does the NVidia card get all 150, and leave no quota for the other two? That makes some sense, because NVidia was the first GPU type to be handled by BOINC, followed by AMD and finally Intel. The requests are almost certainly handled in that order in the server code, but I'll check. ID: 104218 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104219 - Posted: 1 May 2021, 12:24:10 UTC - in response to Message 104218. Last modified: 1 May 2021, 12:41:38 UTC The interesting question is "why do we have separate lines for <work_req_seconds> and <cpu_req_seconds>?". The 'historical legacy' answer seems to hold up. David Jan 10 2009 - client: work_req_seconds is CPU req, not max(CPU req, CUDA req). My 'account_[WCG]' file still has a datestamp of 28 April - so it's not re-written every time. Probably only when the setting on the server has changed. ID: 104219 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104220 - Posted: 1 May 2021, 12:41:23 UTC - in response to Message 104218. The interesting question is "why do we have separate lines for <work_req_seconds> and <cpu_req_seconds>?". For the project where I get my CPU tasks from, both are populated, and identical to the microsecond. I suspect the answer may be historical: when BOINC was first written, there was only one possible sort of work (GPUs couldn't compute), and maybe the wording was changed in newer servers? The alternative may (and probably should) have been retained for backward compatibility. I'll add that to my code reading list ('blame' is useful in these situations). The other thing I'll look at is the timing evidence from the modification datestamp on the various files - notably, request/reply and the account file. Yes, legacy as you pointed out. It appears to be related to the anonymous computing platform. It is in work_fetch.cpp If project is anonymous_platform, then it sets work_req_seconds to the highest of the req_secs. if project is anonymous platform, set the overall work req to the max of the requests of resource types for which we have versions. Otherwise projects with old schedulers won't send us work. I am guessing that WCG falls under the projects with old schedulers even though it is not anonymous_platform. Edit - interesting post at https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43386_offset,570#657527: Just mentioned that my GPU cruncher with 1660 Super, 280X and A12-9800 APU gets up to 150 NV wu's and almost non for the ATI/AMD. That's no real problem because I can handle this with a second Boinc instance. With a 50 per GPU limit, is the limit being applied globally? Does the NVidia card get all 150, and leave no quota for the other two? That makes some sense, because NVidia was the first GPU type to be handled by BOINC, followed by AMD and finally Intel. The requests are almost certainly handled in that order in the server code, but I'll check. IIRC, the limit is 50 per GPU with a max of 200 per machine. People with multiple GPUs have pointed that out before.[url][/url] ID: 104220 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104221 - Posted: 1 May 2021, 12:43:33 UTC - in response to Message 104220. if project is anonymous platform, set the overall work req to the max of the requests of resource types for which we have versions. Otherwise projects with old schedulers won't send us work. I am guessing that WCG falls under the projects with old schedulers even though it is not anonymous_platform. This seems to be held up by the results of my testing. If work_req_seconds is set to the highest of req_secs, then WCG correctly sends iGPU work units. This is without using the locally modified account_ file. ID: 104221 ·

goben_2003 Send message Joined: 29 Apr 21 Posts: 50	Message 104222 - Posted: 1 May 2021, 12:57:57 UTC - in response to Message 104219. The interesting question is "why do we have separate lines for <work_req_seconds> and <cpu_req_seconds>?". The 'historical legacy' answer seems to hold up. David Jan 10 2009 - client: work_req_seconds is CPU req, not max(CPU req, CUDA req). My 'account_[WCG]' file still has a datestamp of 28 April - so it's not re-written every time. Probably only when the setting on the server has changed. If I am interpreting the code properly, the settings are sent with every sched_reply. The account_[WCG] file is overwritten if "project->gui_urls != old_gui_urls \|\| update_project_prefs". update_projects_prefs is set if the venue has changed or if the sent project settings are different from the current project settings. See cs_scheduler.cpp ID: 104222 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104223 - Posted: 1 May 2021, 13:24:08 UTC - in response to Message 104220. IIRC, the limit is 50 per GPU with a max of 200 per machine. People with multiple GPUs have pointed that out before. Yes, we've got that one sorted. But I was drawing attention to "gets up to 150 NV wu's" in the post I quoted. He shouldn't have space for 150 tasks for a single NV GPU if they were allocated strictly '50 for the NV, 50 for the ATI, 50 for the APU'. It seems to '150 in total - first come, first served' - that's the effect we're chasing ("Why are there none left for tail-end Charlie?"). ID: 104223 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 104224 - Posted: 1 May 2021, 13:29:42 UTC - in response to Message 104222. If I am interpreting the code properly, the settings are sent with every sched_reply. The account_[WCG] file is overwritten if "project->gui_urls != old_gui_urls \|\| update_project_prefs". update_projects_prefs is set if the venue has changed or if the sent project settings are different from the current project settings. See cs_scheduler.cpp The alternative way of reading that is "don't send the settings if nothing's changed". I must have changed mine on Wednesday, but not since. They probably haven't considered the case of "user modified client record, so now it's different from what the server remembered" - maybe that's the mis-match that prompts your send. ID: 104224 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.