GPU tasks skipped after scheduler overcommits CPU cores

Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103314 - Posted: 28 Feb 2021, 20:58:00 UTC - in response to Message 103296.  
Last modified: 28 Feb 2021, 21:06:17 UTC

2/28/2021 23:34:42 PM | | [work_fetch] shortfall 18135.97 nidle 0.00 saturated 81224.03 busy 0.00
2/28/2021 23:49:32 PM | | [work_fetch] shortfall 27263.01 nidle 0.00 saturated 76364.55 busy 0.00

For ~900 seconds shortfall increased to 9127 and saturated decreased only to 4860, approx twice low.

Should it be so?

And another observation: time to complete of CPU tasks is changing on GPU task completion. Why? Different devices...
ID: 103314 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103318 - Posted: 28 Feb 2021, 22:15:01 UTC - in response to Message 103314.  

And another observation: time to complete of CPU tasks is changing on GPU task completion. Why? Different devices...

Which project?

If Einstein: they boycotted Credit New. They're still using the original DCF - and it's single-valued. All tasks, of whatever types, belonging to the project are adjusted.
ID: 103318 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103320 - Posted: 28 Feb 2021, 23:26:33 UTC - in response to Message 103318.  

OK, it's E@h indeed. Well, prediction changes not too big so lets move on...

Before ask for work:

3/1/2021 2:08:06 AM | | [work_fetch] shortfall 27985.80 nidle 0.00 saturated 81484.01 busy 0.00
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] share 1.000
3/1/2021 2:08:06 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 2:08:06 AM | | [work_fetch] shortfall 10109.81 nidle 0.00 saturated 89250.19 busy 0.00
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] share 0.990
3/1/2021 2:08:06 AM | | [work_fetch] ------- end work fetch state -------
3/1/2021 2:08:06 AM | Einstein@Home | piggyback: resource CPU
3/1/2021 2:08:06 AM | Einstein@Home | piggyback: SETI@home Beta Test can't fetch work
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] using MC shortfall 27985.799618 instead of shortfall 27985.799618
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] set_request() for CPU: ninst 4 nused_total 18.00 nidle_now 0.00 fetch share 1.00 req_inst 0.00 req_secs 27985.80
3/1/2021 2:08:06 AM | Einstein@Home | piggyback: resource NVIDIA GPU
3/1/2021 2:08:06 AM | Einstein@Home | piggyback: SETI@home Beta Test can't fetch work
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] using MC shortfall 10109.809262 instead of shortfall 10109.809262
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] set_request() for NVIDIA GPU: ninst 1 nused_total 10.00 nidle_now 0.00 fetch share 0.99 req_inst 0.00 req_secs 10109.81
3/1/2021 2:08:06 AM | Einstein@Home | [work_fetch] request: CPU (27985.80 sec, 0.00 inst) NVIDIA GPU (10109.81 sec, 0.00 inst)
3/1/2021 2:08:07 AM | Einstein@Home | Sending scheduler request: To report completed tasks.
3/1/2021 2:08:07 AM | Einstein@Home | Reporting 1 completed tasks
3/1/2021 2:08:07 AM | Einstein@Home | Requesting new tasks for CPU and NVIDIA GPU
3/1/2021 2:08:09 AM | Einstein@Home | Scheduler request completed: got 3 new tasks

Host received both CPU (1 GW task) and GPU (2 FGRP task)
So, right after that:

3/1/2021 2:08:14 AM | | [work_fetch] ------- start work fetch state -------
3/1/2021 2:08:14 AM | | [work_fetch] target work buffer: 38880.00 + 60480.00 sec
3/1/2021 2:08:14 AM | | [work_fetch] --- project states ---
3/1/2021 2:08:14 AM | Einstein@Home | [work_fetch] REC 24694.346 prio -0.810 can't request work: scheduler RPC backoff (54.91 sec)
3/1/2021 2:08:14 AM | | [work_fetch] --- state for CPU ---
3/1/2021 2:08:14 AM | | [work_fetch] shortfall 17888.34 nidle 0.00 saturated 81471.66 busy 0.00
3/1/2021 2:08:14 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:08:14 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 2:08:14 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 106930.65 busy 0.00
3/1/2021 2:08:14 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:08:14 AM | | [work_fetch] ------- end work fetch state -------
3/1/2021 2:08:14 AM | Einstein@Home | choose_project: scanning
3/1/2021 2:08:14 AM | Einstein@Home | skip: scheduler RPC backoff

Bingo! Bug showed itself!
I highlighted fields of interest. For GPU: saturated ("actual work" in my understanding) increased, shortfall zeroed (so host got all what it asked for).
But for CPU shortfall reduced but not fully, and saturated remained the same. Host got new task but no increase in stored work amount? How so??
ID: 103320 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103321 - Posted: 28 Feb 2021, 23:35:59 UTC - in response to Message 103320.  
Last modified: 28 Feb 2021, 23:47:24 UTC

And while remaining CPU FGRP tasks are processing "saturated" field will reduce more and more. And cause it doesn't increase on new work download host will ask for more and more CPU work until hard project limit will be hit.
That's how this bug starts...

And another example:

3/1/2021 2:37:17 AM | | [work_fetch] --- state for CPU ---
3/1/2021 2:37:17 AM | | [work_fetch] shortfall 22859.50 nidle 0.00 saturated 76500.50 busy 0.00
3/1/2021 2:37:17 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:37:17 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 2:37:17 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 99510.95 busy 0.00
3/1/2021 2:37:17 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:37:17 AM | | [work_fetch] ------- end work fetch state -------
3/1/2021 2:37:17 AM | Einstein@Home | choose_project: scanning
3/1/2021 2:37:17 AM | Einstein@Home | skip: some task is suspended via Manager
3/1/2021 2:37:17 AM | | [work_fetch] No project chosen for work fetch
3/1/2021 2:37:19 AM | Einstein@Home | task h1_0950.90_O2C02Cl5In0__O2MD1S3_Spotlight_951.95Hz_1858_3 resumed by user
3/1/2021 2:38:17 AM | | choose_project(): 1614555497.609367
3/1/2021 2:38:17 AM | | [work_fetch] ------- start work fetch state -------
3/1/2021 2:38:17 AM | | [work_fetch] target work buffer: 38880.00 + 60480.00 sec
3/1/2021 2:38:17 AM | | [work_fetch] --- project states ---
3/1/2021 2:38:17 AM | Einstein@Home | [work_fetch] REC 24705.823 prio -1.569 can request work
3/1/2021 2:38:17 AM | | [work_fetch] --- state for CPU ---
3/1/2021 2:38:17 AM | | [work_fetch] shortfall 22931.45 nidle 0.00 saturated 76428.55 busy 0.00

I suspended GW task. Then resumed it - "saturated" field continued to slowly decrease. It did not notice resumed task appearance.
But it did before, when I suspend resume FGRP task.

Repeated example here:
3/1/2021 2:42:01 AM | | [work_fetch] --- state for CPU ---
3/1/2021 2:42:01 AM | | [work_fetch] shortfall 41284.22 nidle 0.00 saturated 58391.14 busy 0.00
3/1/2021 2:42:01 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:42:01 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 2:42:01 AM | | [work_fetch] shortfall 315.37 nidle 0.00 saturated 99044.63 busy 0.00
3/1/2021 2:42:01 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 2:42:01 AM | | [work_fetch] ------- end work fetch state -------
3/1/2021 2:42:01 AM | Einstein@Home | choose_project: scanning
3/1/2021 2:42:01 AM | Einstein@Home | skip: some task is suspended via Manager
3/1/2021 2:42:01 AM | | [work_fetch] No project chosen for work fetch
3/1/2021 2:42:25 AM | Einstein@Home | task LATeah1077F_88.0_3840_-4.9999999999999995e-11_1 resumed by user
3/1/2021 2:43:02 AM | | choose_project(): 1614555782.200201
3/1/2021 2:43:02 AM | | [work_fetch] ------- start work fetch state -------
3/1/2021 2:43:02 AM | | [work_fetch] target work buffer: 38880.00 + 60480.00 sec
3/1/2021 2:43:02 AM | | [work_fetch] --- project states ---
3/1/2021 2:43:02 AM | Einstein@Home | [work_fetch] REC 24707.715 prio -1.566 can request work
3/1/2021 2:43:02 AM | | [work_fetch] --- state for CPU ---
3/1/2021 2:43:02 AM | | [work_fetch] shortfall 23657.86 nidle 0.00 saturated 76116.43 busy 0.00
3/1/2021 2:43:02 AM | Einstein@Home | [work_fetch] share 1.000
3/1/2021 2:43:02 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 2:43:02 AM | | [work_fetch] shortfall 414.30 nidle 0.00 saturated 98945.70 busy 0.00

Clear sharp increase in work amount after resuming task for another app, not listed in app_config !
ID: 103321 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103322 - Posted: 1 Mar 2021, 0:15:37 UTC - in response to Message 103321.  

And here resuming of FGRP task that downloaded while GW tasks present in cache:

3/1/2021 3:08:58 AM | | [work_fetch] --- state for CPU ---
3/1/2021 3:08:58 AM | | [work_fetch] shortfall 24251.31 nidle 0.00 saturated 75108.69 busy 0.00
3/1/2021 3:08:58 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 3:08:58 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 3:08:58 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 109085.20 busy 0.00
3/1/2021 3:08:58 AM | Einstein@Home | [work_fetch] share 0.000
3/1/2021 3:08:58 AM | | [work_fetch] ------- end work fetch state -------
3/1/2021 3:08:58 AM | Einstein@Home | choose_project: scanning
3/1/2021 3:08:58 AM | Einstein@Home | skip: some task is suspended via Manager
3/1/2021 3:08:58 AM | | [work_fetch] No project chosen for work fetch
3/1/2021 3:09:40 AM | Einstein@Home | task LATeah1077F_1416.0_450926_0.0_1 resumed by user
3/1/2021 3:09:59 AM | | choose_project(): 1614557399.032806
3/1/2021 3:09:59 AM | | [work_fetch] ------- start work fetch state -------
3/1/2021 3:09:59 AM | | [work_fetch] target work buffer: 38880.00 + 60480.00 sec
3/1/2021 3:09:59 AM | | [work_fetch] --- project states ---
3/1/2021 3:09:59 AM | Einstein@Home | [work_fetch] REC 24717.003 prio -1.620 can request work
3/1/2021 3:09:59 AM | | [work_fetch] --- state for CPU ---
3/1/2021 3:09:59 AM | | [work_fetch] shortfall 24370.72 nidle 0.00 saturated 74989.28 busy 0.00
3/1/2021 3:09:59 AM | Einstein@Home | [work_fetch] share 1.000
3/1/2021 3:09:59 AM | | [work_fetch] --- state for NVIDIA GPU ---
3/1/2021 3:09:59 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 109045.72 busy 0.00


So, this one also don't affect saturated field!
All that all downloaded after bug triggered (and most probably trigger is downloading or starting task of app listed in app_config) will not increase saturated field.
So, no escape even forbidding new GW work, new FGRP work will behave the same...
ID: 103322 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103330 - Posted: 1 Mar 2021, 12:34:51 UTC

Yes, that's the conclusion I reached in early December, and wrote up in the later stages of #4117 - specifically, https://github.com/BOINC/boinc/issues/4117#issuecomment-744042005. The patched version I mention there is still running fine.

The calculation of current cache size is done in rr_sim, which is used for two distinct purposes:
1) CPU sched - checking for missed deadlines etc.
2) Work fetch - seeing if there's a shortfall.

If a task isn't going to run at all (well, not yet) because the project is at max_concurrent, it's excluded from rr_sim. At the time that code was added, we weren't going to fetch at all from that project (see comment above linked comment). So it didn't matter.

But the code was changed again. Now, projects at max_concurrent can fetch work. But rr_sim wasn't updated. The exclusion of max_concurrent tasks from the work fetch version of rr_sim is at least in part (and I suggest the major part) responsible for the over-calculation of shortfall that you're seeing.

What happens next depends on what you're trying to achieve.

If you want your personal copy of BOINC to run Einstein smoothly - reduce your cache (especially 'additional' days), and be careful (minimal) in your use of max_concurrent
If you want BOINC to work better for everyone - join BOINC's coding team, and help persuade David to solve the problem he created.

I find the C++ programming language, and David's telegraphic coding style, intensely difficult to follow - I was raised on Algol, 60 and W. And he's stopped talking to me. See if you can find the magic button to make him listen.
ID: 103330 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103331 - Posted: 1 Mar 2021, 16:05:10 UTC - in response to Message 103330.  
Last modified: 1 Mar 2021, 16:18:42 UTC

And he's stopped talking to me. See if you can find the magic button to make him listen.

Oh no! That sounds like the end of days if even you titled "messenger-pigeon" so long time failed %)

Unfortunately, "staying low" with low cache didn't help. Maybe if it would dissolve in swarm of other projects it would work, but with E@h only...
Looks like diving into The Code is the only choice, but this would be rather long way now...

EDIT: but I don't quite understand, in title post there you say "patched version". So patch exists already? Where is the problem then?
ID: 103331 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103333 - Posted: 1 Mar 2021, 17:50:40 UTC - in response to Message 103331.  

EDIT: but I don't quite understand, in title post there you say "patched version". So patch exists already? Where is the problem then?
Well, it exists on one test machine here. Did I say I hate this language? I hacked it: it's a horrible hack, and I'm not going to try to display it (much of the horribleness is because David passed a text string as a debug message: I needed to use it as a test for action choices). And it's three months ago, and I've forgotten most of what I hacked. But here goes...

The client does a round-robin simulation on the current cache, to check what's about to happen. It's called in two distinct places:

cpu_sched.cpp, #L882 ("rr_simulation("CPU sched");")
work_fetch.cpp, #L659 ("rr_simulation("work fetch");")

(end of part 1, "too many links" for the anti-spam watchdog)
ID: 103333 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103334 - Posted: 1 Mar 2021, 17:50:57 UTC - in response to Message 103333.  

Part 2:

In the "work_fetch" case (only, I think), we have to stop it ignoring max_concurrent. The current test and action are at #L340 (test) and #L299 (action, both in client/rr_sim.cpp). The action is

            } else if (p->pwf.at_max_concurrent_limit) {
                rsc_pwf.pending_iter = rsc_pwf.pending.erase(
                    rsc_pwf.pending_iter
                );
That 'erase' has to go, in the work fetch case.

The tricky bits are:
1) Which mode we're in - sched or fetch - are passed in a const char* variable called "why". Why, indeed? I'd define them as symbolic integer constants, I think.
2) There's an optimisation in rr_sim.cpp, #L662 that skips all the heavy lifting if we've done all the heavy lifting already this second. But that has to come out if we're doing it for the other reason this time.

I think the only 'functional' code that remains after all my botched hacks are the 'erase' action in L299, and the removed optimisation in L662. But that damned 'why' variable has to be passed through several subroutine calls.
ID: 103334 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103356 - Posted: 2 Mar 2021, 10:37:15 UTC - in response to Message 103334.  

And why not to include it as is into repository to implement this patch or hack (usually it's just the same :) ) in next binary?
Ok, I'll have to look in more details.

And more observations:
As all GW tasks are finished BOINC client suddenly realized that it has TOO much work:

3/2/2021 13:24:21 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 4545980.39 busy 3479878.76

So the trigger is GW task (that app in app_config) in queue. No tasks no problems.
ID: 103356 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103357 - Posted: 2 Mar 2021, 10:50:32 UTC - in response to Message 103356.  
Last modified: 2 Mar 2021, 11:19:53 UTC


void rr_simulation(const char* why) {
    static double last_time=0;
    bool work_fetch=(why=="work fetch"?true:false);

And no more headaches with why char string.

if (rp->rrsim_done) {
rsc_pwf.pending_iter = rsc_pwf.pending.erase(
rsc_pwf.pending_iter
);
} else if (p->pwf.at_max_concurrent_limit) {
rsc_pwf.pending_iter = rsc_pwf.pending.erase(
rsc_pwf.pending_iter
);
}
I would read it as "if simulation done - skip activity, if concurrent limit active - skip activity too"
So, you want to replace it with smth like:

if (rp->rrsim_done) {
rsc_pwf.pending_iter = rsc_pwf.pending.erase(
rsc_pwf.pending_iter
);
} else if (p->pwf.at_max_concurrent_limit && !work_fetch) {
rsc_pwf.pending_iter = rsc_pwf.pending.erase(
rsc_pwf.pending_iter
);
}

?

But first of all, why simulation should be skipped if concurrent limit in place?
W/o that simulation task can miss deadline, not?
So, there is deliberate remove of deadline control for tasks in "concurrent limit" mode - but why??
ID: 103357 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103358 - Posted: 2 Mar 2021, 11:07:07 UTC - in response to Message 103357.  
Last modified: 2 Mar 2021, 11:07:25 UTC

BTW, why one would limit number of concurrent tasks per project?
That is, CPU +GPU tasks, any app.
I would say there is only decorative, not technical reasons. If any troubles in simulation exist for such limit I would deprecate it at all.
And regarding concurrent tasks per particular app - they are "rightful members" between other tasks so shouldn't miss deadlines too.

If one limits number of concurrent app instances (and each task=instance I assume here) one actually limits number of computing devices available for that particular app.
So, in simulation such work should be simulated, but against min(real_num_of_devices, number_of_concurrent_tasks) number of computing devices cause at any time only such number of devices can be busy with such tasks.
In other aspects they are just usual tasks with deadlines, requirements and so on.
ID: 103358 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103359 - Posted: 2 Mar 2021, 11:17:56 UTC - in response to Message 103357.  

But first of all, why simulation should be skipped if concurrent limit in place?
W/o that simulation task can miss deadline, not?
So, there is deliberate remove of deadline control for tasks in "concurrent limit" mode - but why??
https://github.com/BOINC/boinc/commit/40f0cb44f4fcd11eb2789408dfc868de63e42242

Buried in all that mucking about is:

- work fetch: if a project is at a max concurrent limit,
    don't fetch work from it.
    The jobs we get (possibly) wouldn't be runnable.
    NOTE: we currently provide max concurrent limits
    at both project and app level.
    The problem with app level is that apps can have versions that
    use different resources.
    It would be better to have limits at the resource level instead.
We still don't have limits at the resource level: Project-level configuration

max_concurrent is available at the <app> level.
plan_class (which implies resource) is available at the <app_version> level.
Never the twain shall meet.

These days, David is operating as a coder, but he's alienated all the skilled program designers. Coding without designing first is a recipe for failure.
ID: 103359 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103360 - Posted: 2 Mar 2021, 11:39:49 UTC - in response to Message 103359.  
Last modified: 2 Mar 2021, 11:42:29 UTC

Well, all those first statements are good ones, as it should be, indeed.
But one you cited perhaps not into "how it should be", but into "how I implemented it right now" area.
Completely disabling work fetch will make option unusable in real life.

Agree that app can list few different devices (perhaps you BOINC-mans call them "resources" but computing device and resources as memory or HDD space I prefer to distinguish somehow) for its operation and this would make scheduling more complex.
But app corresponds particular type of work project does, this understandable entity for user, I would not move from it.

BOINC client knows what devices app requires. Just substitute number of available devices as I proposed, as minimal between really available and what user allowed to run simultaneously.
And this effective number of particular device type can be used for simulation (of work for that particular app)
Do this with each separate device type app uses - and you will account for apps with CPU + GPU requirements too (for example).
ID: 103360 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103361 - Posted: 2 Mar 2021, 11:45:02 UTC - in response to Message 103359.  


max_concurrent is available at the <app> level.
plan_class (which implies resource) is available at the <app_version> level.

It's OK !
One defines needed resources per instance.
Another defines allowed number of instances. It can be done on those levels.
ID: 103361 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103362 - Posted: 2 Mar 2021, 12:01:46 UTC

Just curious:
BOINC can request work for CPU and GPU separately. But debug messages show only total number of seconds.
Can BOINC client request number of seconds of particular plan_class work? (for particular app)
ID: 103362 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 103363 - Posted: 2 Mar 2021, 12:41:01 UTC - in response to Message 103362.  
Last modified: 2 Mar 2021, 12:43:25 UTC

Just curious:
BOINC can request work for CPU and GPU separately. But debug messages show only total number of seconds.
Can BOINC client request number of seconds of particular plan_class work? (for particular app)
No.

02/03/2021 12:33:51 | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
02/03/2021 12:33:51 | Einstein@Home | [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices
02/03/2021 12:33:51 | Einstein@Home | [sched_op] Intel GPU work request: 1059.49 seconds; 0.00 devices
The client will report all app_versions it has available (read sched_request.xml!), but the server will decide what to send.

Edit: at the moment, I'm only choosing to run one of the two available intel_gpu apps, but that's a preference set on the project, not in the client.
ID: 103363 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103410 - Posted: 4 Mar 2021, 17:47:55 UTC - in response to Message 103363.  
Last modified: 4 Mar 2021, 18:10:52 UTC

Just curious:
BOINC can request work for CPU and GPU separately. But debug messages show only total number of seconds.
Can BOINC client request number of seconds of particular plan_class work? (for particular app)
No.


Then how one can expect correct implementation of max_concurrent at all?
I would say it's not bug but design flaw that prevents implementation at all.

As I said earlier max_concurrent essentially changes number of computation devices for particular type of work.
But BOINC client can't distinguish between work for one app and work for another as long as they both for same device type (CPU, for example).

Lets consider max_concurrent applied for tasks A _and_ (as "bugfix" supposed to do) they correctly accounted in work queue.
Host asks for work, project gives it arbitrary (cause NO request for particular type!) mix of type A and type B CPU work.
OK, host starts to compute. Cause A limited in number of instances (and B not) it's quite possible that B work will exhausted faster in queue.
Then host asks for work again and again receives arbitrary mix of A and B. Even (!) if both A and B in abundance on server and server itself provides 50% shares of both type of work host will gradually form A-only type of tasks in queue (B processed on more devices). But real-life situation may be even worse, project provides A-type of work mostly (and again, client can't regulate it!).
So after some time of work under such conditions there will be queue of A-type tasks on host w/o B-type at all.
And queue is full.
So, what client should do? Ask for more CPU work in hope server will provide B-type? It can't reject A-type so this will overload queue in case server provide A-type again.
Not request work at all - then computational devices will sit idle.

So, w/o re-design even no sense attempt to "bugfix", it's not a bug, it just can't work.

All scheduling should be app type-centric, work for different apps should be distinguished by BOINC just as it distinguishes work for different projects.
ID: 103410 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103411 - Posted: 4 Mar 2021, 17:59:04 UTC - in response to Message 103410.  
Last modified: 4 Mar 2021, 18:07:43 UTC

The hack-type solution of this w/o proper re-design:

The client will report all app_versions it has available (read sched_request.xml!), but the server will decide what to send.
(don't see corresponding field in that file currently but suppose they are in).


Client knows what types in its own queue.
If it sees that computational device idle (or could become idle soon if no other type of work than A-type from prev post will be provided) it EXCLUDES A-type app plan class from request to server.
So server will reply with type-B work or with no work at all and client will ask again w/o overflowing queue.
ID: 103411 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103412 - Posted: 4 Mar 2021, 18:20:59 UTC - in response to Message 103410.  
Last modified: 4 Mar 2021, 18:26:26 UTC

And same issue shows itself here too:



2 CPUs just sit idle cause host can't fit more tasks of GWnew type in memory and have queue full of GWnew tasks.
While FGRP-type work present on server host will not ask for it (it cant) so it just can't fill idle CPUs with different work from single project cause it can't distinguish types of work in work request!

EDIT: BTW, what nidle means then??
3/4/2021 21:22:20 PM | | [work_fetch] --- state for CPU ---
3/4/2021 21:22:20 PM | | [work_fetch] shortfall 47990.74 nidle 0.00 saturated 224169.26 busy 0.00
3/4/2021 21:22:20 PM | Einstein@Home | [work_fetch] share 1.000
3/4/2021 21:22:20 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/4/2021 21:22:20 PM | SETI@home Beta Test | [work_fetch] share 0.000

2CPUs are idle and BOINC doesn't see this??
ID: 103412 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.