GPU not receiving tasks when CPU computing disabled

Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104225 - Posted: 1 May 2021, 13:48:53 UTC - in response to Message 104223.  
Last modified: 1 May 2021, 13:50:16 UTC

IIRC, the limit is 50 per GPU with a max of 200 per machine. People with multiple GPUs have pointed that out before.
Yes, we've got that one sorted. But I was drawing attention to "gets up to 150 NV wu's" in the post I quoted. He shouldn't have space for 150 tasks for a single NV GPU if they were allocated strictly '50 for the NV, 50 for the ATI, 50 for the APU'. It seems to '150 in total - first come, first served' - that's the effect we're chasing ("Why are there none left for tail-end Charlie?").

Ah ok. Yes, I misread that. It would be interesting to know if the client is only requesting nvidia since presumably they have their cache set high enough for 150 NV or if the client is requesting all of them and the WCG server sends for the NV GPU until: the 50 per GPU or the req_secs is reached.

If I am interpreting the code properly, the settings are sent with every sched_reply. The account_[WCG] file is overwritten if "project->gui_urls != old_gui_urls || update_project_prefs". update_projects_prefs is set if the venue has changed or if the sent project settings are different from the current project settings. See cs_scheduler.cpp
The alternative way of reading that is "don't send the settings if nothing's changed". I must have changed mine on Wednesday, but not since. They probably haven't considered the case of "user modified client record, so now it's different from what the server remembered" - maybe that's the mis-match that prompts your send.

I said it the way that I did because the server(WCG server at least) seems to send the settings with every sched_reply. At least that is what I take from them always being in the sched_reply file regardless of whether the settings have changed.
So, "don't save the settings if nothing's changed" instead of "don't send the settings if nothing's changed"
ID: 104225 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104226 - Posted: 1 May 2021, 14:20:01 UTC - in response to Message 104225.  

Ah ok. Yes, I misread that. It would be interesting to know if the client is only requesting nvidia since presumably they have their cache set high enough for 150 NV or if the client is requesting all of them and the WCG server sends for the NV GPU until: the 50 per GPU or the req_secs is reached.
Mine is behaving a little differently. I'm working at a Windows machine, with WCG enabled for NV (x2) and Intel_gpu, but not for CPU. It's also got work for other projects, but I'm not fetching from them at the moment. I'm stable at 120 tasks: 8 WCG for intel, 12 from other projects, and thus 100 for NV - and it says "This computer has reached a limit on tasks in progress". Which is what the book says, but not what's been reported.

Meanwhile, see what you think of https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits. That's the instruction manual for server operators. It's not bang up to date, but four of the last five changes have been made by Kevin Reed - another of the WCG admins - so I guess they're paying attention. Neither that section ('Job limits'), not the next section down ('Job limits (advanced)') seems to allow them to set different limits for different classes of GPU - so, "All GPUs are created equal". ???

So, "don't save the settings if nothing's changed" instead of "don't send the settings if nothing's changed"
Yup, I'll go with that - looked inside the reply file this time.
ID: 104226 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104228 - Posted: 1 May 2021, 15:29:44 UTC
Last modified: 1 May 2021, 15:51:28 UTC

I'm getting twitchy about https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L739:
// return true if additional work is needed,
// and there's disk space left,
// and we haven't exceeded result per RPC limit,
// and we haven't exceeded results per day limit
That leads on to https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L796:
// check config.xml limits on in-progress jobs
That should loop through
#define PROC_TYPE_CPU 0
#define PROC_TYPE_NVIDIA_GPU 1
#define PROC_TYPE_AMD_GPU 2
#define PROC_TYPE_INTEL_GPU 3
and when the maximum is exceeded for one proc_type, zero out the work request for that type, and proceed to check the others.

But I've got a log on a different machine that appears to say different. Hold on, and I'll edit it in.

01/05/2021 16:29:47 | World Community Grid | Requesting new tasks for NVIDIA GPU and Intel GPU
01/05/2021 16:29:47 | World Community Grid | [sched_op] NVIDIA GPU work request: 41076.75 seconds; 0.00 devices
01/05/2021 16:29:47 | World Community Grid | [sched_op] Intel GPU work request: 36306.05 seconds; 0.00 devices
01/05/2021 16:29:48 | World Community Grid | Scheduler request completed: got 0 new tasks
01/05/2021 16:29:48 | World Community Grid | This computer has reached a limit on tasks in progress
That machine is at the same 100 limit for NV, which accounts for the 'limit' message, but why didn't it go on to check the intel request? I've only got three of those, so nothing to stop it.

But now I've reported a few NV tasks, it's gone into a different mode:
01/05/2021 16:38:54 | World Community Grid | Sending scheduler request: To fetch work.
01/05/2021 16:38:54 | World Community Grid | Reporting 1 completed tasks
01/05/2021 16:38:54 | World Community Grid | Requesting new tasks for NVIDIA GPU and Intel GPU
01/05/2021 16:38:54 | World Community Grid | [sched_op] NVIDIA GPU work request: 41158.52 seconds; 0.00 devices
01/05/2021 16:38:54 | World Community Grid | [sched_op] Intel GPU work request: 25191.31 seconds; 0.00 devices
01/05/2021 16:38:57 | World Community Grid | Scheduler request completed: got 3 new tasks
01/05/2021 16:38:57 | World Community Grid | [sched_op] estimated total NVIDIA GPU task duration: 0 seconds
01/05/2021 16:38:57 | World Community Grid | [sched_op] estimated total Intel GPU task duration: 8728 seconds
Later, it started fetching a litle NV, and even later - drum roll! - both types together. I'll try and extract a summary of the log.
ID: 104228 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104229 - Posted: 1 May 2021, 15:55:08 UTC - in response to Message 104226.  

Ah ok. Yes, I misread that. It would be interesting to know if the client is only requesting nvidia since presumably they have their cache set high enough for 150 NV or if the client is requesting all of them and the WCG server sends for the NV GPU until: the 50 per GPU or the req_secs is reached.
Mine is behaving a little differently. I'm working at a Windows machine, with WCG enabled for NV (x2) and Intel_gpu, but not for CPU. It's also got work for other projects, but I'm not fetching from them at the moment. I'm stable at 120 tasks: 8 WCG for intel, 12 from other projects, and thus 100 for NV - and it says "This computer has reached a limit on tasks in progress". Which is what the book says, but not what's been reported.

Interesting. That is different than what has been reported.
Mine is also behaving differently. In case you find it interesting, here is what I have seen on my machine with NV x1 + intel GPU with CPU disabled:
The intel seems to be first for it. Right now, it has 48 NV(10 currently running) + 56 Intel(1 currently running). The last scheduler request reported 1 NV and received 2 intel.
While typing it, another scheduler request happened: 1 NV reported and received 3 NV. That brought it up to 50 NV + 56 Intel.
The one after that was 3NV reported, 2 NV + 1 intel received, bringing it to 49 NV + 57 intel.
The next was 1NV reported, 2NV received, bringing NV back to 50 and still having 57 intel.

Meanwhile, see what you think of https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits. That's the instruction manual for server operators. It's not bang up to date, but four of the last five changes have been made by Kevin Reed - another of the WCG admins - so I guess they're paying attention. Neither that section ('Job limits'), not the next section down ('Job limits (advanced)') seems to allow them to set different limits for different classes of GPU - so, "All GPUs are created equal". ???

Interesting. I would guess they are using the advanced job limits. Something along the lines of:
<app>
		<app_name>opng</app_name>
		<total_limit>
			<jobs>200</jobs>
		</total_limit>
		<gpu_limit>
			<jobs>50</jobs>
			<per_proc/>
		<zgpu_limit>
</app>

I agree, it does not seem like it allows them to set different limits for different classes of GPU. I tend to think that part should be handled on the server side. While the user could try set their cache so that it is distributed more fairly, that would have to be redone whenever the app changed enough to change the ratio in estimate per job per cpu type. Not that the estimate is accurate. On my machine with 1 NV and 1 intel GPU, BOINC estimates 19:26 for the NV and 15:54 for the intel. The NV is close enough to accurate now that the runtimes are longer, but the intel is way off. Which is one reason I think that part is something that should be handled on the server side.
ID: 104229 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 375
Sweden
Message 104230 - Posted: 1 May 2021, 16:30:44 UTC - in response to Message 104223.  

IIRC, the limit is 50 per GPU with a max of 200 per machine. People with multiple GPUs have pointed that out before.
Yes, we've got that one sorted. But I was drawing attention to "gets up to 150 NV wu's" in the post I quoted. He shouldn't have space for 150 tasks for a single NV GPU if they were allocated strictly '50 for the NV, 50 for the ATI, 50 for the APU'. It seems to '150 in total - first come, first served' - that's the effect we're chasing ("Why are there none left for tail-end Charlie?").

My GTX980 only gets a max of 50. The iGPU also gets max 50. After that the usual BOINC message "This computer has reached a limit on tasks in progress", appears in the log.
So in my case the 50/GPU seems to work. (I have set unlimited in the WCG profile)
ID: 104230 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104231 - Posted: 1 May 2021, 16:34:21 UTC - in response to Message 104228.  

Summary log from the second machine:
01-May-2021 16:31:50 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:31:50 [World Community Grid] [sched_op] NVIDIA GPU work request: 41413.98 seconds; 0.00 devices
01-May-2021 16:31:50 [World Community Grid] [sched_op] Intel GPU work request: 36455.24 seconds; 0.00 devices
01-May-2021 16:31:51 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 976 seconds
01-May-2021 16:31:51 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 0 seconds
01-May-2021 16:33:58 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:33:59 [World Community Grid] [sched_op] NVIDIA GPU work request: 40652.56 seconds; 0.00 devices
01-May-2021 16:33:59 [World Community Grid] [sched_op] Intel GPU work request: 36604.44 seconds; 0.00 devices
01-May-2021 16:34:01 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 0 seconds
01-May-2021 16:34:01 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 5818 seconds
01-May-2021 16:36:51 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:36:51 [World Community Grid] [sched_op] NVIDIA GPU work request: 40845.92 seconds; 0.00 devices
01-May-2021 16:36:51 [World Community Grid] [sched_op] Intel GPU work request: 30935.17 seconds; 0.00 devices
01-May-2021 16:36:52 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 0 seconds
01-May-2021 16:36:52 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 5818 seconds
01-May-2021 16:38:54 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:38:54 [World Community Grid] [sched_op] NVIDIA GPU work request: 41158.52 seconds; 0.00 devices
01-May-2021 16:38:54 [World Community Grid] [sched_op] Intel GPU work request: 25191.31 seconds; 0.00 devices
01-May-2021 16:38:57 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 0 seconds
01-May-2021 16:38:57 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 8728 seconds
01-May-2021 16:41:51 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:41:52 [World Community Grid] [sched_op] NVIDIA GPU work request: 41495.04 seconds; 0.00 devices
01-May-2021 16:41:52 [World Community Grid] [sched_op] Intel GPU work request: 16539.24 seconds; 0.00 devices
01-May-2021 16:41:53 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 0 seconds
01-May-2021 16:41:53 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 8728 seconds
01-May-2021 16:43:56 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:43:56 [World Community Grid] [sched_op] NVIDIA GPU work request: 41752.40 seconds; 0.00 devices
01-May-2021 16:43:56 [World Community Grid] [sched_op] Intel GPU work request: 7959.73 seconds; 0.00 devices
01-May-2021 16:43:57 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 967 seconds
01-May-2021 16:43:57 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 5818 seconds
01-May-2021 16:46:52 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:46:53 [World Community Grid] [sched_op] NVIDIA GPU work request: 42095.51 seconds; 0.00 devices
01-May-2021 16:46:53 [World Community Grid] [sched_op] Intel GPU work request: 2216.89 seconds; 0.00 devices
01-May-2021 16:46:54 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 1934 seconds
01-May-2021 16:46:54 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 2909 seconds
01-May-2021 16:48:57 [World Community Grid] [sched_op] Starting scheduler request
01-May-2021 16:48:58 [World Community Grid] [sched_op] NVIDIA GPU work request: 41290.60 seconds; 0.00 devices
01-May-2021 16:48:58 [World Community Grid] [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
01-May-2021 16:48:59 [World Community Grid] [sched_op] estimated total NVIDIA GPU task duration: 1918 seconds
01-May-2021 16:48:59 [World Community Grid] [sched_op] estimated total Intel GPU task duration: 0 seconds
So, it eventually filled the Intel queue, but it took several goes, and started topping up NV along the way.

Final thought, before I take a break - my head's spinning. In pulling out that summary, I deleted all the references to "Server version 701". In theory, that dates back to 2013 or even earlier - almost as old as Einstein's! - but I suspect that in reality, they've forked off the mainline and done their own tweaks.
ID: 104231 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104233 - Posted: 1 May 2021, 17:04:34 UTC

I can empathize with the head spinning. I have been looking through the code quite a bit. In doing so, I came up with a question that I cannot check myself. I do not have a machine with multiple of the same type of gpu. So I have a question for you about your sched_request:
Are there 2 Nvidia coprocessors listed with count set to 1 each or 1 listed with count set to 2? Or something else? Meaning 2 of:
<coproc_cuda>
   <count>1</count>

or 1 of
<coproc_cuda>
   <count>2</count>

or something else?
ID: 104233 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104234 - Posted: 1 May 2021, 17:30:56 UTC - in response to Message 104233.  
Last modified: 1 May 2021, 17:34:33 UTC

Thereby hangs a very nasty tale. This machine actually reports
<coproc_cuda>
   <count>2</count>
   <name>GeForce GTX 1660 SUPER</name>
but in practice it's got one 1660 and one 1650. Left to its own devices, the 1650 would be idle, but you can wake it up with <use_all_gpus>1</use_all_gpus> in cc_config.xml

Having two different cards, but only reporting the characteristics of the 'better' one, causes real problems. If the 'smaller' one has less memory, or can't do double precision, or is pre-OpenCL v1.2, the server will still send impossible tasks, and the client will still try to run them (unless you exclude that project on that GPU, also in cc_config.xml).

I went to the 2014 BOINC workshop in Budapest (my local low-cost airline happened to fly direct to Hungary from the airport just up the road, on the right day of the week both out and back. I could hardly pass that one up!), and I heard the 'historical summary' talk David Anderson gave that year. I heard him say that the decision not to separately identify GPUs was a mistake, in hindsight. I don't think either the text or a recording of that talk is available in the public domain, but the slides can be downloaded from https://boinc.berkeley.edu/trac/attachment/wiki/WorkShop14/workshop_14.pdf. Slide 52 was on screen as he said that: "Reflections on software: things we need to change - Coprocessor model". But nobody has.
ID: 104234 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104235 - Posted: 1 May 2021, 19:37:28 UTC - in response to Message 104231.  

Thereby hangs a very nasty tale. This machine actually reports
<coproc_cuda>
   <count>2</count>
   <name>GeForce GTX 1660 SUPER</name>
but in practice it's got one 1660 and one 1650. Left to its own devices, the 1650 would be idle, but you can wake it up with <use_all_gpus>1</use_all_gpus> in cc_config.xml

Having two different cards, but only reporting the characteristics of the 'better' one, causes real problems. If the 'smaller' one has less memory, or can't do double precision, or is pre-OpenCL v1.2, the server will still send impossible tasks, and the client will still try to run them (unless you exclude that project on that GPU, also in cc_config.xml).

I went to the 2014 BOINC workshop in Budapest (my local low-cost airline happened to fly direct to Hungary from the airport just up the road, on the right day of the week both out and back. I could hardly pass that one up!), and I heard the 'historical summary' talk David Anderson gave that year. I heard him say that the decision not to separately identify GPUs was a mistake, in hindsight. I don't think either the text or a recording of that talk is available in the public domain, but the slides can be downloaded from https://boinc.berkeley.edu/trac/attachment/wiki/WorkShop14/workshop_14.pdf. Slide 52 was on screen as he said that: "Reflections on software: things we need to change - Coprocessor model". But nobody has.

Very interesting. Also, I agree on it being hard to pass up the opportunity!

In pulling out that summary, I deleted all the references to "Server version 701". In theory, that dates back to 2013 or even earlier - almost as old as Einstein's! - but I suspect that in reality, they've forked off the mainline and done their own tweaks.

I guess that means all the time I spent looking at the current code was semi pointless. Oh well, at least I was amused by all the messages between Bruce and David in the code. :)

Also, I guess that even if I went back through git to ~2013, that it still would not be representative since the WCG specific tweaks are unknown. There is no "ifdef WORLD_COMMUNITY_GRID"(or _WCG) in the scheduler code like there is for einstein.
Which I guess go back to is this a boinc issue or a WCG specific issue and getting the attention of one of the WCG admins to look into these issues. Meaning:
(1) Not sending intel gpu work if (a)it is the only gpu and (b) cpu computing is disabled
(My machine with only an intel GPU is back down to 0 GPU tasks running stock boinc and cpu computing disabled in preferences)
(2) Uneven distribution of work between coprocessor types

Unless you have any other ideas of things to look at?

Oh, and by the way, my machine with 1 NV and 1 intel GPU is up to 50 NV tasks and 96(!) intel GPU tasks. That is stock boinc and cpu computing disabled in preferences. It only stopped at 96 due to cache size. I had raised the cache to see how high it would go. I am tempted to raise the cache a bit more just to see if it stops at 100. However, the time estimates are way off, so it will have trouble completing them before the new 3 day deadline.[/list]
ID: 104235 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104236 - Posted: 1 May 2021, 22:13:05 UTC - in response to Message 104235.  

It's not necessarily as bad as that. I found a problem with another project earlier this year, and their response when I pointed it out to them was along the lines of "we maintain our own separate version of the server code, except for the scheduler, where we've brought in the latest version as a complete replacement" (or words to that effect - I'm relying on memory).

All the issues we've discussed have been related to the scheduler, and It's entirely possible that WCG will have done something similar. It's unlikely that they will have written an entirely separate replacement: as you and I have found, it's fiendishly complicated, and I think even experienced coders like Bruce have refrained from meddling as much as possible. The BOINC code is the best guide we've got, and it's likely to be pretty close - it has to be compatible with the requests from the newest clients, at the very least.
ID: 104236 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104241 - Posted: 2 May 2021, 9:47:49 UTC

I am trying out the boinc client simulator(locally compiled). I have not used the client simulator before. Should work_fetch be accurate for it if the server was running the current code and had tasks available(as should be the case during this stress test)?
Also, should all the log flags in cc_config.xml take effect with the simulator? Specifically sched_op_debug and work_fetch_debug.
ID: 104241 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104242 - Posted: 2 May 2021, 10:18:14 UTC - in response to Message 104236.  

It's not necessarily as bad as that. I found a problem with another project earlier this year, and their response when I pointed it out to them was along the lines of "we maintain our own separate version of the server code, except for the scheduler, where we've brought in the latest version as a complete replacement" (or words to that effect - I'm relying on memory).

All the issues we've discussed have been related to the scheduler, and It's entirely possible that WCG will have done something similar. It's unlikely that they will have written an entirely separate replacement: as you and I have found, it's fiendishly complicated, and I think even experienced coders like Bruce have refrained from meddling as much as possible. The BOINC code is the best guide we've got, and it's likely to be pretty close - it has to be compatible with the requests from the newest clients, at the very least.

It is true that it was not really a waste of time. I learned a lot about the boinc client and scheduling code.
It definitely is fiendishly complicated though. I have not found a way(with the current code) yet for it to not schedule a work request given the sched_request file given that would schedule it with everything else being the same other than work_req_seconds > 0.
If I am following the code properly:
work_req_seconds is not used very much, It it is used to set the value of g_request->work_req_seconds and from that g_wreq->seconds_to_fill. All the places that I see that use either of those should also give work given that a coprocessor was included in the sched_request_ file. The coprocessor should have g_wreq->rsc_spec_request set to true and g_wreq->need_proc_type(i) return true with having the coprocessor with req_secs > 0.
ID: 104242 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104243 - Posted: 2 May 2021, 10:57:07 UTC
Last modified: 2 May 2021, 11:24:03 UTC

I've never used the simulator, except in its web-interface form. It often lags behind the master code: I had a problem involving an app_config.xml file, but I couldn't include it in the simulation run. That was added to the input page, but my next problem seemed to involve the use of two app_config.xml files. And so on. But if you can overcome those constraints by building locally, it may give you some insights. NB - it can't simulate the server response "... reached a limit on tasks in progress".

On the server side, your description is fine. But be aware it leaves out the dimension of time. For your req_seconds, the scheduler picks one candidate task, locks it, and performs a number of checks to see if it's right for your machine. If it is, it's assigned to you, added to a buffer, and the whole process starts again. Adding multiple tasks to the buffer - sequentially - takes time, and is optimised as far as possible.

The main optimisation is the use of a 'feeder'. The feeder preforms all the database lookups to populate a small pool of available tasks for allocation. On small projects, that pool contains maybe 100 or 200 tasks at a time, and is held in the fastest-access memory available. I shudder to think how many may be needed for WCG in this mode! That single pool has to be used by every scheduler instance currently running (multiple requests are processed in parallel). So, while your scheduler is plodding through your request, other schedulers are fishing in the same pool. For a non-trivial request, it's entirely possible for the feeder pool to have been emptied before your scheduler has found enough compatible tasks to add up to your request quantum. You'll get a reply comprising less work than you asked for, and you ask again a couple of minutes later. By which time, the feeder pool will have been emptied and refilled many times over (I think the usual cycle time is 2 seconds).

That may even account for the missing intel_gpu tasks on its own. As we saw, your request is processed device by device - NV, AMD, Intel (in that order). The intel_gpu is most likely to hit 'empty feeder syndrome'. Look back at my summary log (message 104231) - does that explanation fit?
ID: 104243 · Report as offensive
goben_2003

Send message
Joined: 29 Apr 21
Posts: 50
Message 104244 - Posted: 2 May 2021, 11:29:23 UTC - in response to Message 104243.  

(All the following is related to a machine with just 1 gpu: intel_gpu with no_cpu set to 1 in the preferences.)

I've never used the simulator, except in its web-interface form. It often lags behind the master code: I had a problem involving an app_config.xml file, but I couldn't include it in the simulation run. That was added to the input page, but my next problem seemed to involve the use of two app_config.xml files. And so on. But if you can overcome those constraints by building locally, it may give you some insights. NB - it can't simulate the server response "... reached a limit on tasks in progress".

I realize it can't actually simulate reasons to not return work like the server not having the task in the feeder, or "... reached a limit on tasks in progress", etc. I think what I meant was if the simulator shows tasks being received, then does it mean that the intel_gpu not getting any tasks probably is not a problem with the client or its config. That is leaving out the app_config.xml from the simulator though.

On the server side, your description is fine. But be aware it leaves out the dimension of time. For your req_seconds, the scheduler picks one candidate task, locks it, and performs a number of checks to see if it's right for your machine. If it is, it's assigned to you, added to a buffer, and the whole process starts again. Adding multiple tasks to the buffer - sequentially - takes time, and is optimised as far as possible.

The main optimisation is the use of a 'feeder'. The feeder preforms all the database lookups to populate a small pool of available tasks for allocation. On small projects, that pool contains maybe 100 or 200 tasks at a time, and is held in the fastest-access memory available. I shudder to think how many may be needed for WCG in this mode! That single pool has to be used by every scheduler instance currently running (multiple requests are processed in parallel). So, while your scheduler is plodding through your request, other schedulers are fishing in the same pool. For a non-trivial request, it's entirely possible for the feeder pool to have been emptied before your scheduler has found enough compatible tasks to add up to your request quantum. You'll get a reply comprising less work than you asked for, and you ask again a couple of minutes later. By which time, the feeder pool will have been emptied and refilled many times over (I think the usual cycle time is 2 seconds).

I agree with this for individual requests. However, it can and has gone hundreds of requests in a row without receiving any tasks, in fact it has never received intel_gpu tasks while cpu tasks are disabled on the profile page. Then with changing just one of the following:
1. modify the profile for the machine on the wcg website to allow cpu tasks
2. modify the the account_ file to have no_cpu set to 0
3. remove the if anonymous_platform so that it always sets work_req_seconds to the highest req_secs of the coprocessors. work_fetch.cpp
Changing any 1 of those 3 results in receiving tasks a few on each request until the limit is reached (cache or the profile max tasks for OPN)

That may even account for the missing intel_gpu tasks on its own. As we saw, your request is processed device by device - NV, AND, Intel (in that order). The intel_gpu is most likely to hit 'empty feeder syndrome'. Look back at my summary log (message 104231) - does that explanation fit?

It may for the machines with multiple gpus that we discussed earlier. I think that it does not fit for machines with only 1 gpu which is an intel_gpu.

All this to say that all the investigations I have done point to it probably being a wcg boinc server specific problem instead of a mainline boinc client/server problem. I am open to hearing your thoughts if you disagree though. I know you have spent a lot more time over the years with boinc internals than I have since I only started looking into when this issue occurred.
ID: 104244 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104246 - Posted: 2 May 2021, 11:54:15 UTC - in response to Message 104244.  

It may for the machines with multiple gpus that we discussed earlier. I think that it does not fit for machines with only 1 gpu which is an intel_gpu.

All this to say that all the investigations I have done point to it probably being a wcg boinc server specific problem instead of a mainline boinc client/server problem. I am open to hearing your thoughts if you disagree though. I know you have spent a lot more time over the years with boinc internals than I have since I only started looking into when this issue occurred.
Yup, setting one of the other devices to allow work seems to liberate the intel_gpu - allowing CPU in your case, allowing NV in mine.

And yup, I've spent a lot of time looking at BOINC - it's my retirement hobby, to keep the little grey cells supple. But I've spent more time looking at the clients, rather than the server. You're moving at least as fast as me on this one.

I do have a suitable 'CPU plus intel_gpu only' machine to explore with. It's just running a few PrimeGrid MT tasks to keep warm at the moment, but I'll flush those out (2-3 hours), and set it up on a separate WCG profile so I can fiddle with different settings.
ID: 104246 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104248 - Posted: 2 May 2021, 15:17:28 UTC

OK, now we're sucking diesel - or diesel fumes, at any rate.

02/05/2021 16:12:11 | World Community Grid | Sending scheduler request: Project initialization.
02/05/2021 16:12:11 | World Community Grid | Requesting new tasks for CPU and Intel GPU
02/05/2021 16:12:12 | World Community Grid | Scheduler request completed: got 0 new tasks
02/05/2021 16:12:12 | World Community Grid | No tasks are available for OpenPandemics - COVID-19 - GPU
02/05/2021 16:12:12 | World Community Grid | Tasks for CPU are available, but your preferences are set to not accept them
02/05/2021 16:12:12 | World Community Grid | Tasks for NVIDIA GPU are available, but your preferences are set to not accept them
02/05/2021 16:12:12 | World Community Grid | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
02/05/2021 16:12:12 | World Community Grid | Project requested delay of 121 seconds
02/05/2021 16:12:12 | World Community Grid | General prefs: from World Community Grid (last modified 02-May-2021 14:35:25)
02/05/2021 16:12:12 | World Community Grid | Computer location: home
02/05/2021 16:14:15 | World Community Grid | Sending scheduler request: To fetch work.
02/05/2021 16:14:15 | World Community Grid | Requesting new tasks for Intel GPU
02/05/2021 16:14:15 | World Community Grid | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
02/05/2021 16:14:15 | World Community Grid | [sched_op] Intel GPU work request: 9504.00 seconds; 1.00 devices
02/05/2021 16:14:16 | World Community Grid | Scheduler request completed: got 0 new tasks
So far, so bad. I'll keep trying a few different combinations.
ID: 104248 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104249 - Posted: 2 May 2021, 15:24:29 UTC - in response to Message 104248.  

First oddity: the reply after initialisation says "Computer location: home". But the WCG website says it's on 'default'. I think default is the default on my account, but I'll check...
ID: 104249 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104250 - Posted: 2 May 2021, 15:44:20 UTC

Changing to a new location and back again 'sort-of' works:

02/05/2021 16:35:35 | World Community Grid | New computer location:
02/05/2021 16:35:35 | World Community Grid | General prefs: from World Community Grid (last modified 02-May-2021 16:26:31)
02/05/2021 16:35:35 | World Community Grid | Host location: none
02/05/2021 16:35:35 | World Community Grid | General prefs: using your defaults
02/05/2021 16:37:39 | World Community Grid | Sending scheduler request: To fetch work.
02/05/2021 16:37:39 | World Community Grid | Requesting new tasks for Intel GPU
02/05/2021 16:37:39 | World Community Grid | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
02/05/2021 16:37:39 | World Community Grid | [sched_op] Intel GPU work request: 9504.00 seconds; 1.00 devices
02/05/2021 16:37:40 | World Community Grid | Scheduler request completed: got 0 new tasks
ID: 104250 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104251 - Posted: 2 May 2021, 15:52:07 UTC

And we have lift-off:

02/05/2021 16:46:08 | World Community Grid | Computer location: school
02/05/2021 16:48:10 | World Community Grid | [sched_op] Starting scheduler request
02/05/2021 16:48:12 | World Community Grid | Sending scheduler request: To fetch work.
02/05/2021 16:48:12 | World Community Grid | Requesting new tasks for CPU and Intel GPU
02/05/2021 16:48:12 | World Community Grid | [sched_op] CPU work request: 28512.00 seconds; 3.00 devices
02/05/2021 16:48:12 | World Community Grid | [sched_op] Intel GPU work request: 9504.00 seconds; 1.00 devices
02/05/2021 16:48:13 | World Community Grid | Scheduler request completed: got 4 new tasks
02/05/2021 16:48:13 | World Community Grid | [sched_op] estimated total CPU task duration: 30176 seconds
02/05/2021 16:48:13 | World Community Grid | [sched_op] estimated total Intel GPU task duration: 5119 seconds
Or maybe not:

02/05/2021 16:50:16 | World Community Grid | Sending scheduler request: To fetch work.
02/05/2021 16:50:16 | World Community Grid | Requesting new tasks for CPU and Intel GPU
02/05/2021 16:50:16 | World Community Grid | [sched_op] CPU work request: 8724.32 seconds; 0.85 devices
02/05/2021 16:50:16 | World Community Grid | [sched_op] Intel GPU work request: 4439.86 seconds; 0.00 devices
02/05/2021 16:50:17 | World Community Grid | Scheduler request completed: got 0 new tasks
02/05/2021 16:50:17 | World Community Grid | No tasks are available for the applications you have selected.
ID: 104251 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 104252 - Posted: 2 May 2021, 16:26:14 UTC

But I can't get beyond here. Computer is a quad-core plus iGPU: I wanted to run 3xCPU + iGPU, but instead I've got 2xCPU (both running) and 4xiGPU (one running). And

02/05/2021 17:21:05 | World Community Grid | Computer location: school
02/05/2021 17:21:05 | | Number of usable CPUs has changed from 3 to 4.
02/05/2021 17:21:44 | World Community Grid | [sched_op] CPU work request: 17544.75 seconds; 1.85 devices
02/05/2021 17:21:44 | World Community Grid | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
02/05/2021 17:21:45 | World Community Grid | No tasks sent
02/05/2021 17:21:45 | World Community Grid | This computer has reached a limit on tasks in progress
Do we know that limit?
ID: 104252 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Questions and problems : GPU not receiving tasks when CPU computing disabled

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.