Problem with "max concurrent" in app config.

Author	Message
Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 103448 - Posted: 7 Mar 2021, 18:52:22 UTC - in response to Message 103447. CPU work request: 38618.23 seconds; 0.00 devices estimated total CPU task duration: 39316 seconds Is not, of itself, a significant over-allocation of work. If the problem is that the human eye sees no need for extra work to be requested, but the BOINC eye does, we need evidence of that. At first glance, your comments suggest something similar to the issue I've been discussing most recently with Raistmer in GPU tasks skipped after scheduler overcommits CPU cores, and before that in Client: loses track of new work requirements. Those are complex problems, and they need forensic teamwork to address. Who's in? ID: 103448 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1283	Message 103449 - Posted: 7 Mar 2021, 18:52:41 UTC One thing that I see that could be giving you a bit of an issue are these two lines: 20756 Kryptos@Home 07-03-2021 02:54 PM You are attached to this project twice. Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/ & 21070 Kryptos@Home 07-03-2021 03:55 PM You are attached to this project twice. Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/ In the past (dark and distant) doing this gave me some strange work-fetch issues..... ID: 103449 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1283	Message 103452 - Posted: 7 Mar 2021, 20:19:29 UTC I just set Work_fetch_debug for a few minutes and, in the midst of thousands of lines I found this bit: 07/03/2021 19:52:30 \| \| [work_fetch] No project chosen for work fetch 07/03/2021 19:53:30 \| \| choose_project(): 1615146810.159882 07/03/2021 19:53:30 \| \| [work_fetch] ------- start work fetch state ------- 07/03/2021 19:53:30 \| \| [work_fetch] target work buffer: 86400.00 + 864.00 sec 07/03/2021 19:53:30 \| \| [work_fetch] --- project states --- My cache is set to 1 day, plus 0.01 days, and I'm running 4 cores. One day is 86400 seconds, and thus 0.01 days is 864 seconds, with no correction for the number of cores in use..... (Cancel one thought I had - I was expecting these figures to be multiplied by the number of cores in use.) I've just forced a finished WCG task to report - and the first few lines are: 07/03/2021 20:02:46 \| World Community Grid \| update requested by user 07/03/2021 20:02:46 \| \| [work_fetch] Request work fetch: project updated by user 07/03/2021 20:02:50 \| World Community Grid \| [sched_op] sched RPC pending: Requested by user 07/03/2021 20:02:50 \| World Community Grid \| piggyback_work_request() 07/03/2021 20:02:50 \| \| [work_fetch] ------- start work fetch state ------- 07/03/2021 20:02:50 \| \| [work_fetch] target work buffer: 86400.00 + 864.00 sec Then a whole load of stuff relating to suspended and no new tasks projects. Then, eventually: 07/03/2021 20:02:46 \| World Community Grid \| update requested by user 07/03/2021 20:02:46 \| \| [work_fetch] Request work fetch: project updated by user 07/03/2021 20:02:50 \| World Community Grid \| [sched_op] sched RPC pending: Requested by user 07/03/2021 20:02:50 \| World Community Grid \| piggyback_work_request() 07/03/2021 20:02:50 \| \| [work_fetch] ------- start work fetch state ------- 07/03/2021 20:02:50 \| \| [work_fetch] target work buffer: 86400.00 + 864.00 sec and still more stuff about projects that can't get tasks (as before) 07/03/2021 20:07:04 \| World Community Grid \| [work_fetch] share 1.000 07/03/2021 20:07:04 \| \| [work_fetch] --- state for NVIDIA GPU --- 07/03/2021 20:07:04 \| \| [work_fetch] shortfall 174528.00 nidle 2.00 saturated 0.00 busy 0.00 Which doesn't make sense to me, apart from being about double my buffer size. It would be interesting to see what your system is saying & doing (But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time. ID: 103452 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 103453 - Posted: 7 Mar 2021, 20:34:41 UTC - in response to Message 103452. In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. As I did in issue 4117, comment 738194152, the best way for this problem is to use the GUI event log configurator in BOINC Manager. Check both rr_simulation and work_fetch_debug. Click apply: uncheck both, count to five, and save. You should then have one, or at most two, cycles of each. Drop the complete segment (no edits) in a dropbucket somewhere, give me a link, and I'll start looking in the morning. It'll take some time. ID: 103453 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1283	Message 103454 - Posted: 7 Mar 2021, 21:46:21 UTC - in response to Message 103453. In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. Thanks Richard - I'm less confused than I was. (PM on its way) ID: 103454 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103460 - Posted: 8 Mar 2021, 12:16:54 UTC Last modified: 8 Mar 2021, 12:17:07 UTC Well, doesn't it just the same max_concurrent issue discussed here: https://boinc.berkeley.edu/forum_thread.php?id=14146 ? ID: 103460 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103461 - Posted: 8 Mar 2021, 12:27:03 UTC - in response to Message 103451. Last modified: 8 Mar 2021, 12:28:16 UTC At first glance, your comments suggest something similar to the issue I've been discussing most recently with Raistmer in GPU tasks skipped after scheduler overcommits CPU cores, and before that in Client: loses track of new work requirements. Those are complex problems, and they need forensic teamwork to address. Who's in? Aha, the same indeed. It's too complex problem that require re-design of BOINC rather base approach to scheduling. Currently BOINC considers work for same devices (lets speak about only CPU, for example) from different projects as different. So it can account for different project shares. But it doesn't consider different app_classes from same project as different work so it just inherently can't correctly account for different app_classes limitations like different numbers of available cores for different apps inside same project. So, to correctly limit number of tasks for particular app in particular project is impossible with current design, no matter will simulation be done for restricted tasks in work fetch or not. Should be possible w/o big re-design for separate projects. Project with max_concurrent should be simulated versus max_concurrent (provided it's less than cpu_num) number of CPUs. This should give correct shortfall for that project (if there are 4 cores and only 2 available for project this particular project can't create shortfall as 4*wall_clock_time, there will be always twice too much work in fetch requests). ID: 103461 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103462 - Posted: 8 Mar 2021, 12:31:27 UTC - in response to Message 103452. (But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time. I run it always these days - older lines just go away. ID: 103462 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103463 - Posted: 8 Mar 2021, 12:35:54 UTC - in response to Message 103425. On my 24 core Ryzen, I've put this in app config: <app_config> <app> <name>kryptos-plato</name> <max_concurrent>4</max_concurrent> </app> </app_config> This is because they're virtualbox programs that make the computer sluggish if I do too many at once, but that reason is irrelevant to this discussion. I have set a buffer of 0+3 hours, and everything other than the above app sticks to that. It waits until it's about to run out, then downloads 3 hours. But the above is accumulating a huge amount of tasks, because Boinc doesn't seem to be accounting for the app config setting when downloading them, I think it's assuming I'm going to run 24 at once. I've currently got a queue of 7 days 1 hour, which would take 7 hours on all 24 cores, so near enough. But on only 4 cores, 42 hours, nowhere near the 3 I asked for. The single current solution is to micromanage. Download as many tasks as you can tolerate/process before deadline then put one of them into suspended. This will stop work fetch requests for that particular project. As work exhausted re-enable that task and wait for new work queue overload. Then cycle will complete. ID: 103463 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103464 - Posted: 8 Mar 2021, 12:40:30 UTC - in response to Message 103453. In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. When shortfall variable get increase? ID: 103464 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 103465 - Posted: 8 Mar 2021, 12:51:29 UTC - in response to Message 103464. In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. When shortfall variable get increase? In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps. Shortfall is the difference between target and saturated. At the moment, tasks from projects with a max_concurrent set are sometimes excluded from saturated, so saturated is low, and shortall is high. You can see the exclusions in the output of rr_simulation. That's a very crude over-simplification from memory. There are numerous ifs, buts, maybes, and edge-cases, but that's the principle so far as I've been able to derive it. ID: 103465 ·

Raistmer Send message Joined: 9 Apr 06 Posts: 302	Message 103472 - Posted: 9 Mar 2021, 13:28:52 UTC - in response to Message 103465. Last modified: 9 Mar 2021, 13:35:34 UTC In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpusbuffer. When shortfall variable get increase? In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps. Shortfall is the difference between target and saturated. At the moment, tasks from projects with a max_concurrent set are sometimes excluded from saturated, so saturated is low, and shortall is high. You can see the exclusions in the output of rr_simulation. That's a very crude over-simplification from memory. There are numerous ifs, buts, maybes, and edge-cases, but that's the principle so far as I've been able to derive it. So one need: 1) not to exclude max_concurrent'ed tasks from work fetch. Actually, from BOTH work_fetch and scheduling (so, our current way of action by separating scheduling and workfetch paths could be unnesessary!) 2) if(max_concurrent) then replace num_cpus to min(num_cpus, max_concurrent) everywhere where core-seconds calculated. The BIG problem I see here: tasks (as you said) estimated just sum up. Then compared versus cache_sizenum_cpus. IF so, there is no easy way to account for different cpu number for unlimited and max_concurrent tasks. "Correct" procedure would be to sum estimated/min(cpu_num, max_concurrent) and compare sum versus cache_size. This way one could correctly account for different weight for limited and unlimited tasks. EDIT: of course division should be replaced by multiplication for speed but it still should be for each type of tasks (inside sum!), not just single multiplication for cache_size. ID: 103472 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.