Message boards :
Questions and problems :
Problem with "max concurrent" in app config.
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 5 Oct 06 Posts: 5081 |
CPU work request: 38618.23 seconds; 0.00 devicesIs not, of itself, a significant over-allocation of work. If the problem is that the human eye sees no need for extra work to be requested, but the BOINC eye does, we need evidence of that. At first glance, your comments suggest something similar to the issue I've been discussing most recently with Raistmer in GPU tasks skipped after scheduler overcommits CPU cores, and before that in Client: loses track of new work requirements. Those are complex problems, and they need forensic teamwork to address. Who's in? |
Send message Joined: 25 May 09 Posts: 1283 |
One thing that I see that could be giving you a bit of an issue are these two lines: 20756 Kryptos@Home 07-03-2021 02:54 PM You are attached to this project twice. Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/ &
In the past (dark and distant) doing this gave me some strange work-fetch issues..... |
Send message Joined: 25 May 09 Posts: 1283 |
I just set Work_fetch_debug for a few minutes and, in the midst of thousands of lines I found this bit: 07/03/2021 19:52:30 | | [work_fetch] No project chosen for work fetch My cache is set to 1 day, plus 0.01 days, and I'm running 4 cores. One day is 86400 seconds, and thus 0.01 days is 864 seconds, with no correction for the number of cores in use..... (Cancel one thought I had - I was expecting these figures to be multiplied by the number of cores in use.) I've just forced a finished WCG task to report - and the first few lines are: 07/03/2021 20:02:46 | World Community Grid | update requested by user Then a whole load of stuff relating to suspended and no new tasks projects. Then, eventually: 07/03/2021 20:02:46 | World Community Grid | update requested by user and still more stuff about projects that can't get tasks (as before) 07/03/2021 20:07:04 | World Community Grid | [work_fetch] share 1.000 Which doesn't make sense to me, apart from being about double my buffer size. It would be interesting to see what your system is saying & doing (But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time. |
Send message Joined: 5 Oct 06 Posts: 5081 |
In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. As I did in issue 4117, comment 738194152, the best way for this problem is to use the GUI event log configurator in BOINC Manager. Check both rr_simulation and work_fetch_debug. Click apply: uncheck both, count to five, and save. You should then have one, or at most two, cycles of each. Drop the complete segment (no edits) in a dropbucket somewhere, give me a link, and I'll start looking in the morning. It'll take some time. |
Send message Joined: 25 May 09 Posts: 1283 |
In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. Thanks Richard - I'm less confused than I was. (PM on its way) |
Send message Joined: 9 Apr 06 Posts: 302 |
Well, doesn't it just the same max_concurrent issue discussed here: https://boinc.berkeley.edu/forum_thread.php?id=14146 ? |
Send message Joined: 9 Apr 06 Posts: 302 |
Aha, the same indeed. It's too complex problem that require re-design of BOINC rather base approach to scheduling. Currently BOINC considers work for same devices (lets speak about only CPU, for example) from different projects as different. So it can account for different project shares. But it doesn't consider different app_classes from same project as different work so it just inherently can't correctly account for different app_classes limitations like different numbers of available cores for different apps inside same project. So, to correctly limit number of tasks for particular app in particular project is impossible with current design, no matter will simulation be done for restricted tasks in work fetch or not. Should be possible w/o big re-design for separate projects. Project with max_concurrent should be simulated versus max_concurrent (provided it's less than cpu_num) number of CPUs. This should give correct shortfall for that project (if there are 4 cores and only 2 available for project this particular project can't create shortfall as 4*wall_clock_time, there will be always twice too much work in fetch requests). |
Send message Joined: 9 Apr 06 Posts: 302 |
(But as Richard said, only leave Work_fetch_debug set for a few minutes - it produces a frightening number of lines in very little time. I run it always these days - older lines just go away. |
Send message Joined: 9 Apr 06 Posts: 302 |
On my 24 core Ryzen, I've put this in app config: The single current solution is to micromanage. Download as many tasks as you can tolerate/process before deadline then put one of them into suspended. This will stop work fetch requests for that particular project. As work exhausted re-enable that task and wait for new work queue overload. Then cycle will complete. |
Send message Joined: 9 Apr 06 Posts: 302 |
In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer. When shortfall variable get increase? |
Send message Joined: 5 Oct 06 Posts: 5081 |
In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps.In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.When shortfall variable get increase? Shortfall is the difference between target and saturated. At the moment, tasks from projects with a max_concurrent set are sometimes excluded from saturated, so saturated is low, and shortall is high. You can see the exclusions in the output of rr_simulation. That's a very crude over-simplification from memory. There are numerous ifs, buts, maybes, and edge-cases, but that's the principle so far as I've been able to derive it. |
Send message Joined: 9 Apr 06 Posts: 302 |
In theory, the current effective cache is enumerated by cycling through all known cached tasks and adding the estimated runtimes to reach a 'saturated' figure for core-seconds, or separately for device-seconds in the case of GPU apps.In general, 'target work buffer' is measured in wall time, and 'shortfall' is measured in core-time. If cache is empty, shortfall=ncpus*buffer.When shortfall variable get increase? So one need: 1) not to exclude max_concurrent'ed tasks from work fetch. Actually, from BOTH work_fetch and scheduling (so, our current way of action by separating scheduling and workfetch paths could be unnesessary!) 2) if(max_concurrent) then replace num_cpus to min(num_cpus, max_concurrent) everywhere where core-seconds calculated. The BIG problem I see here: tasks (as you said) estimated just sum up. Then compared versus cache_size*num_cpus. IF so, there is no easy way to account for different cpu number for unlimited and max_concurrent tasks. "Correct" procedure would be to sum estimated/min(cpu_num, max_concurrent) and compare sum versus cache_size. This way one could correctly account for different weight for limited and unlimited tasks. EDIT: of course division should be replaced by multiplication for speed but it still should be for each type of tasks (inside sum!), not just single multiplication for cache_size. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.