Problem with "max concurrent" in app config.

Message boards : Questions and problems : Problem with "max concurrent" in app config.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103448 - Posted: 7 Mar 2021, 18:52:22 UTC - in response to Message 103447.  

CPU work request: 38618.23 seconds; 0.00 devices
estimated total CPU task duration: 39316 seconds
Is not, of itself, a significant over-allocation of work.

If the problem is that the human eye sees no need for extra work to be requested, but the BOINC eye does, we need evidence of that.

At first glance, your comments suggest something similar to the issue I've been discussing most recently with Raistmer in GPU tasks skipped after scheduler overcommits CPU cores, and before that in Client: loses track of new work requirements.

Those are complex problems, and they need forensic teamwork to address. Who's in?
ID: 103448 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 103449 - Posted: 7 Mar 2021, 18:52:41 UTC

One thing that I see that could be giving you a bit of an issue are these two lines:
20756 Kryptos@Home 07-03-2021 02:54 PM You are attached to this project twice. Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/


&

21070 Kryptos@Home 07-03-2021 03:55 PM You are attached to this project twice. Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/


In the past (dark and distant) doing this gave me some strange work-fetch issues.....
ID: 103449 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 103452 - Posted: 7 Mar 2021, 20:19:29 UTC

I just set Work_fetch_debug for a few minutes and, in the midst of thousands of lines I found this bit:
07/03/2021 19:52:30 | | [work_fetch] No project chosen for work fetch
07/03/2021 19:53:30 | | choose_project(): 1615146810.159882
07/03/2021 19:53:30 | | [work_fetch] ------- start work fetch state -------
07/03/2021 19:53:30 | | [work_fetch] target work buffer: 86400.00 + 864.00 sec
07/03/2021 19:53:30 | | [work_fetch] --- project states ---


My cache is set to 1 day, plus 0.01 days, and I'm running 4 cores.
One day is 86400 seconds, and thus 0.01 days is 864 seconds, with no correction for the number of cores in use.....
(Cancel one thought I had - I was expecting these figures to be multiplied by the number of cores in use.)

I've just forced a finished WCG task to report - and the first few lines are:
07/03/2021 20:02:46 | World Community Grid | update requested by user
07/03/2021 20:02:46 | | [work_fetch] Request work fetch: project updated by user
07/03/2021 20:02:50 | World Community Grid | [sched_op] sched RPC pending: Requested by user
07/03/2021 20:02:50 | World Community Grid | piggyback_work_request()
07/03/2021 20:02:50 | | [work_fetch] ------- start work fetch state -------
07/03/2021 20:02:50 | | [work_fetch] target work buffer: 86400.00 + 864.00 sec

Then a whole load of stuff relating to suspended and no new tasks projects.

Then, eventually:
07/03/2021 20:02:46 | World Community Grid | update requested by user
07/03/2021 20:02:46 | | [work_fetch] Request work fetch: project updated by user
07/03/2021 20:02:50 | World Community Grid | [sched_op] sched RPC pending: Requested by user
07/03/2021 20:02:50 | World Community Grid | piggyback_work_request()
07/03/2021 20:02:50 | | [work_fetch] ------- start work fetch state -------
07/03/2021 20:02:50 | | [work_fetch] target work buffer: 86400.00 + 864.00 sec


and still more stuff about projects that can't get tasks (as before)
07/03/2021 20:07:04 | World Community Grid | [work_fetch] share 1.000
07/03/2021 20:07:04 | | [work_fetch] --- state for NVIDIA GPU ---
07/03/2021 20:07:04 | | [work_fetch] shortfall 174528.00 nidle 2.00 saturated 0.00 busy 0.00


Which doesn't make sense to me, apart from being about double my buffer size.

It would be interesting to see what your system is saying & doing
(But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time.
ID: 103452 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103453 - Posted: 7 Mar 2021, 20:34:41 UTC - in response to Message 103452.  

In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.

As I did in issue 4117, comment 738194152, the best way for this problem is to use the GUI event log configurator in BOINC Manager. Check both rr_simulation and work_fetch_debug. Click apply: uncheck both, count to five, and save. You should then have one, or at most two, cycles of each. Drop the complete segment (no edits) in a dropbucket somewhere, give me a link, and I'll start looking in the morning. It'll take some time.
ID: 103453 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 103454 - Posted: 7 Mar 2021, 21:46:21 UTC - in response to Message 103453.  

In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.

Thanks Richard - I'm less confused than I was.

(PM on its way)
ID: 103454 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103460 - Posted: 8 Mar 2021, 12:16:54 UTC
Last modified: 8 Mar 2021, 12:17:07 UTC

Well, doesn't it just the same max_concurrent issue discussed here: https://boinc.berkeley.edu/forum_thread.php?id=14146 ?
ID: 103460 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103461 - Posted: 8 Mar 2021, 12:27:03 UTC - in response to Message 103451.  
Last modified: 8 Mar 2021, 12:28:16 UTC


At first glance, your comments suggest something similar to the issue I've been discussing most recently with Raistmer in GPU tasks skipped after scheduler overcommits CPU cores, and before that in Client: loses track of new work requirements.

Those are complex problems, and they need forensic teamwork to address. Who's in?


Aha, the same indeed.
It's too complex problem that require re-design of BOINC rather base approach to scheduling.
Currently BOINC considers work for same devices (lets speak about only CPU, for example) from different projects as different.
So it can account for different project shares.
But it doesn't consider different app_classes from same project as different work so it just inherently can't correctly account for different app_classes limitations like different numbers of available cores for different apps inside same project. So, to correctly limit number of tasks for particular app in particular project is impossible with current design, no matter will simulation be done for restricted tasks in work fetch or not.

Should be possible w/o big re-design for separate projects.
Project with max_concurrent should be simulated versus max_concurrent (provided it's less than cpu_num) number of CPUs.
This should give correct shortfall for that project (if there are 4 cores and only 2 available for project this particular project can't create shortfall as 4*wall_clock_time, there will be always twice too much work in fetch requests).
ID: 103461 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103462 - Posted: 8 Mar 2021, 12:31:27 UTC - in response to Message 103452.  

(But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time.

I run it always these days - older lines just go away.
ID: 103462 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103463 - Posted: 8 Mar 2021, 12:35:54 UTC - in response to Message 103425.  

On my 24 core Ryzen, I've put this in app config:

<app_config>
   <app>
      <name>kryptos-plato</name>
      <max_concurrent>4</max_concurrent>
   </app>
</app_config>

This is because they're virtualbox programs that make the computer sluggish if I do too many at once, but that reason is irrelevant to this discussion.

I have set a buffer of 0+3 hours, and everything other than the above app sticks to that. It waits until it's about to run out, then downloads 3 hours. But the above is accumulating a huge amount of tasks, because Boinc doesn't seem to be accounting for the app config setting when downloading them, I think it's assuming I'm going to run 24 at once. I've currently got a queue of 7 days 1 hour, which would take 7 hours on all 24 cores, so near enough. But on only 4 cores, 42 hours, nowhere near the 3 I asked for.


The single current solution is to micromanage.
Download as many tasks as you can tolerate/process before deadline then put one of them into suspended.
This will stop work fetch requests for that particular project.
As work exhausted re-enable that task and wait for new work queue overload. Then cycle will complete.
ID: 103463 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103464 - Posted: 8 Mar 2021, 12:40:30 UTC - in response to Message 103453.  

In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.

When shortfall variable get increase?
ID: 103464 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103465 - Posted: 8 Mar 2021, 12:51:29 UTC - in response to Message 103464.  

In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.
When shortfall variable get increase?
In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps.

Shortfall is the difference between target and saturated.

At the moment, tasks from projects with a max_concurrent set are sometimes excluded from saturated, so saturated is low, and shortall is high. You can see the exclusions in the output of rr_simulation.

That's a very crude over-simplification from memory. There are numerous ifs, buts, maybes, and edge-cases, but that's the principle so far as I've been able to derive it.
ID: 103465 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103472 - Posted: 9 Mar 2021, 13:28:52 UTC - in response to Message 103465.  
Last modified: 9 Mar 2021, 13:35:34 UTC

In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.
When shortfall variable get increase?
In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps.

Shortfall is the difference between target and saturated.

At the moment, tasks from projects with a max_concurrent set are sometimes excluded from saturated, so saturated is low, and shortall is high. You can see the exclusions in the output of rr_simulation.

That's a very crude over-simplification from memory. There are numerous ifs, buts, maybes, and edge-cases, but that's the principle so far as I've been able to derive it.


So one need:
1) not to exclude max_concurrent'ed tasks from work fetch. Actually, from BOTH work_fetch and scheduling (so, our current way of action by separating scheduling and workfetch paths could be unnesessary!)
2) if(max_concurrent) then replace num_cpus to min(num_cpus, max_concurrent) everywhere where core-seconds calculated.

The BIG problem I see here: tasks (as you said) estimated just sum up. Then compared versus cache_size*num_cpus.
IF so, there is no easy way to account for different cpu number for unlimited and max_concurrent tasks.
"Correct" procedure would be to sum estimated/min(cpu_num, max_concurrent) and compare sum versus cache_size.
This way one could correctly account for different weight for limited and unlimited tasks.

EDIT: of course division should be replaced by multiplication for speed but it still should be for each type of tasks (inside sum!), not just single multiplication for cache_size.
ID: 103472 · Report as offensive
Previous · 1 · 2

Message boards : Questions and problems : Problem with "max concurrent" in app config.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.