Why project schedule priority so high?

Message boards : Questions and problems : Why project schedule priority so high?
Message board moderation

To post messages, you must log in.

AuthorMessage
GeneAZ

Send message
Joined: 28 Jun 14
Posts: 27
United States
Message 63033 - Posted: 14 Jul 2015, 4:45:55 UTC

System: Linux x64, Boinc 7.4.23, 4-core CPU + Nvidia GPU, 4 projects with individual resource shares as shown below:
   Project  Res.Share   Project Schedule Prio. 
 Seti        85              -1.08
 Einstein    10              -1.13
 Asteroids    4              -0.49
 NFS          1              -0.66

Work buffer parameters set for 1 day + 0.5 day and the system has been running with this configuration for over a month.

I frequently have to set No New Tasks for Asteroids as it will otherwise overload the work buffer and starve other CPU applications by reason of "CPU not highest priority project." And then "Suspend" the Asteroids project, at least momentarily, to allow CPU work fetch for other projects.

For example, 4 hours ago the work buffer had 13 Seti (CPU) tasks at roughly 2.5 hours each; plus 44 Asteroids (CPU) tasks at roughly 2.4 hours each. A scheduler request was initiated to Asteroids and boinc fetched 4 more(!) CPU tasks. Why?

Meanwhile, the Nvidia work flow (for Seti and Einstein) seems to allocate time on the GPU roughly in accordance with the respective resource shares.

I have limited the Asteroids project to 1 <max_concurrent> and have allowed Seti to have 3 <max_concurrent>. Those constraints are observed. But Asteroids runs its 1 allowed tasks 24/7 yet never seems to get its scheduler priority below the other projects. The -0.49 value (in the table above) is the lowest I have seen recently and a value of -0.30 is more typical.

I apparently don't understand how Resource Share is "supposed" to work, especially with projects running a mix of CPU and GPU work.

Any instructive comments will be greatly appreciated.
ID: 63033 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15478
Netherlands
Message 63036 - Posted: 14 Jul 2015, 8:23:37 UTC - in response to Message 63033.  

I frequently have to set No New Tasks for Asteroids...

a) you choose to set No New Tasks for any project, nothing is forcing you to do so.
b) by choosing to set NNT on a project, you enforce on BOINC your own form of scheduling, thereby breaking BOINC its capability of learning about these projects. The same thing when you go manually abort tasks etc.
c) the only way to get BOINC to learn everything about the idiosyncrasies of all the projects, and the amount of work they send, is for you to let go and leave BOINC do the scheduling without interrupting, even if that takes months to work out and even if that means that some of the work will be auto-aborted by BOINC because of deadline problems. To use a weird analogy, it cannot learn without breaking some eggs.

Crudely said, until that time, none of us can explain how things work, because you are not ready to hear it anyway.
ID: 63036 · Report as offensive
floyd
Help desk expert

Send message
Joined: 23 Apr 12
Posts: 77
Message 63037 - Posted: 14 Jul 2015, 9:36:49 UTC - in response to Message 63033.  

I don't know either what exactly is influenced by resource share, but I think we can safely say it's some measure for "work done", and "work done" is not proportional to run time. It's probably somehow based on credits.

The problem in your case is that a GPU can do much more work than a CPU, and Asteroids doesn't reach its 4% share running on CPU only. Yet it tries 24/7. Your latest constraints have made things even worse. If you let Asteroids just run using all resources it needs, it might eventually reach 4% so some other CPU tasks could run. Some, not necessarily as many as you want. On the other hand, even full CPU power might be insufficient to reach 4% or it could take a very long time.

You could lower the resource share a bit to make it easier for BOINC, but the smallest possible change from 4 to 3 is already more than a bit. I noticed you made your resource shares sum up to 100. When I was in this situation, I used a sum of 1000 or even 10000 so I could make finer adjustments. But even then you won't find a setting that works permanently. A single GPU task can throw everything way out of balance and it will take a lot of time to adjust even if there are no more disturbances meanwhile.

All this put short: Forget it. You can't run a GPU/CPU project together with a CPU only project the way you want it. In theory it could work, in practice it won't. The best thing you can do is not run CPU tasks on your GPU projects so the projects only have to compete in their own class.
ID: 63037 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 63038 - Posted: 14 Jul 2015, 11:20:14 UTC - in response to Message 63037.  

The whitepaper for that is https://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen.

Proposal: credit-driven scheduling

The idea is to make resource share apply to overall credit, not to individual resources types. If two projects have the same resource share, they should have the same RAC. Scheduling decisions should give preference to projects whose share of RAC is less than their resource share.

There are problems with using project-granted credit as a basis for this approach:

There may be a long and variable delay between completing a job and getting credit for it.
Jobs may fail to get credit, e.g. because they don't validate.

Hence we will use a surrogate called estimated credit, maintained by the client. If projects grant credit fairly, and if all jobs validate, then estimated credit is roughly equal to granted credit over the long term.

And that's what the current BOINC v7 clients do.

Because REC estimates credit much more closely in alignment with the formal definition of the cobblestone than many projects do, REC (and consequently Resource Share) follow flopcounting much more closely than they follow actual granted credit.

I personally suspect that the 'flops' which are 'counted' for REC are those declared by the programmer/deployment administrator in <rsc_fpops_est>. That's my theory (partly) for why SETI's MB and AP tasks award different RAC, and likewise Albert's Arecibo and Perseus Arm surveys (IIRC those were the two apps which trended to different limit values).
ID: 63038 · Report as offensive
William
Avatar

Send message
Joined: 19 Feb 10
Posts: 97
Message 63042 - Posted: 14 Jul 2015, 13:35:52 UTC

client/cpu_sched.cpp#L548

x += p->rsc_pwf[j].secs_this_rec_interval * f * rsc_work_fetch[j].relative_speed;


f = gstate.host_info.p_fpops;

havne't worked out the 'relative_speed' yet, but possibly something derived from device [GPU] speed.

IOW as far as I can see, REC is calculated from runtime and has nothing to do with actual credit or even actual work done by an app/task.

In yet other words, recent estimated credit is really recent time the project was running, factoring in that GPUs are faster than CPUs.

Let me illustrate. A one CPU rig runs 3 projects at equal resource shares.
The client will allocate 1/3 of its uptime to each project. If for some reason one project doesn't have work, the others get more time. As soon as the 3rd project has work again the client will switch to that project until it has caught up. In numbers: for 6 hours only two projects run. that makes 3 hours each. for the next 3 hours the 3rd project runs _exclusively_ so that it too has had 3 hours crunchtime.

With unequal resource shares, factor that in.

Now, say 2 CPU, 1 GPU of which boinc thinks it has 4 times the speed of the CPUs. Two projects, equal shares, one CPU only (projectA), the other both (projectB).

Let's assume you get 1 estimated credit per hour, to make this easier...

run 1 hour. Project B gets the GPU. It gets 4 REC (GPU is 4x faster). Project A gets the CPUs REC: 2 (2 CPU * 1 hour).
So project B will accumulate REC twice as fast. Project A has less REC but according to resource share you want them even. CPU will always ask for work from project A first.

Project priority for A will always be higher. (Assuming always work to be
had).

So, assuming asteroids runs CPU only, it will get most of the CPU and SETI will get the GPUs. If you want less Asterioids and more seti, you need to set relative resource shares so the exceed the relative speed of the GPUs.

in the example above, if you want project B to do CPU tasks also, you need to set the resource share for B more than 4x higher than A

If you upset the balance by suspending projects, boinc will have more catchup to do...

HTH
ID: 63042 · Report as offensive
Profile Elektra*
Avatar

Send message
Joined: 12 Jul 14
Posts: 6
Germany
Message 63044 - Posted: 14 Jul 2015, 14:45:56 UTC - in response to Message 63038.  
Last modified: 14 Jul 2015, 14:51:13 UTC

The whitepaper for that is https://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen.

Proposal: credit-driven scheduling

The idea is to make resource share apply to overall credit, not to individual resources types. If two projects have the same resource share, they should have the same RAC. Scheduling decisions should give preference to projects whose share of RAC is less than their resource share.

There are problems with using project-granted credit as a basis for this approach:

There may be a long and variable delay between completing a job and getting credit for it.
Jobs may fail to get credit, e.g. because they don't validate.

Hence we will use a surrogate called estimated credit, maintained by the client. If projects grant credit fairly, and if all jobs validate, then estimated credit is roughly equal to granted credit over the long term.

And that's what the current BOINC v7 clients do.

Because REC estimates credit much more closely in alignment with the formal definition of the cobblestone than many projects do, REC (and consequently Resource Share) follow flopcounting much more closely than they follow actual granted credit.

I personally suspect that the 'flops' which are 'counted' for REC are those declared by the programmer/deployment administrator in <rsc_fpops_est>. That's my theory (partly) for why SETI's MB and AP tasks award different RAC, and likewise Albert's Arecibo and Perseus Arm surveys (IIRC those were the two apps which trended to different limit values).


Richard mentioned the whitepaper which says how CreditNew should work, and here's another document that shows why CreditNew doesn't work at present and even won't work in the future:

https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/EinsteinAtHome/BOINC/EvaluationOfCreditNew

The crunchpoint is the <dont_use_dcf> tag that many projects use. With my BOINC 6 clients I've reverted to, this tag is no problem as BOINC 6 doesn't know it, and the clients adjust themselves reliably with DCF. But with <dont_use_dcf> the newer clients ain't able to adjust themselves. They are able to see that the actual runtime of a task is diverging from the preempted runtime and then correct the supposed remaining time of this running task according to its progress, but the client will never adjust the preempted runtimes of the tasks in the buffer ready to start. So the Boinc clients of us volunteers are dependant on the project schedulers and their runtime estimates of the tasks they send. And the projects often badly underestimate the lengths of the tasks they send, and ain't able to adjust to the client sided conditions. There are some hints that some projects do this intentionally to gather higher shares (ressource theft).

For me a drawback to BOINC 6 solved my problems, but that might be not an option for you as I'm only crunching CPU tasks.
Love, Michi
ID: 63044 · Report as offensive
GeneAZ

Send message
Joined: 28 Jun 14
Posts: 27
United States
Message 63087 - Posted: 16 Jul 2015, 17:22:08 UTC

#Ageless:
I did not feel "forced" to set NNT, I just got impatient with resource share settings I tried back in February - Asteroids, at RS=5%, filled the work buffer, Seti, at RS=90%, exhausted its supply of tasks and 3 cores sat idle while Asteroids ran 1 core 24/7, refilling the buffer as needed and always running at higher scheduler priority to block any Seti downloads. After a couple of days in this state I intervened manually, knowing very well that it would interfere with the Boinc "learning" process. My thinking/strategy was: suspend Asteroids occasionally to allow Seti work to download but otherwise let both projects do some work and hope that Boinc would eventually reach some equilibrium and back off on the Asteroids project as I had intended in the resource share settings.

#floyd:
You could lower the resource share a bit

I have now set Asteroids to 1% share (and given the lost 3% to Seti). In retrospect my inititial thinking of gradually reducing the Asteroids share until it got its "proper" share was flawed. I should have started at 1% and then, if appropriate, raised the share. I.e. home in on a share setting from the underutilized side.

You may be right, that there is no way in the current Boinc manager to mix CPU and GPU work/projects and get them to play together.

#Richard & Elektra:
CreditNew doesn't work at present and even won't work in the future:

I read the two papers. The second one alludes to some control system theory that was very heavy going. I think I see the point of the argument but the details are over my pay grade.
I have seen CreditNew threads in other project forums, of course. My own feeling is that local host run-times would have been the best basis for resource share but I do understand the intent of the CreditNew scheme. We play with the hand we're dealt! For my four active projects I see (via work_fetch debug) wide divergence between the RAC and REC values. For example, present Seti RAC is 5225 but the REC in work fetch is 28812. Einstein is much closer at 4472 vs. 4660.

#William:
as far as I can see, REC is calculated from runtime and has nothing to do with actual credit

I can't reach this conclusion from my observations (see above paragraph) but it could be true. I commend you for your efforts to dig into the actual Boinc client code. (I have looked at the code, months ago, seeking an explanation of an entirely different Boinc issue but became hopelessly lost in data structures, variable names, and function scope.) The relative performance of the GPU and CPU resources should enter into the scheduling process but it will differ for each project and likely depends on programming efficiency, etc. I have stopped running Asteroids GPU tasks because they actually run *longer* than the CPU tasks. (And Asteroids grants fixed credit for each work unit regardless of actual run time.)

If you upset the balance by suspending projects

Maybe my strategy is misguided. I only suspend a project when it has (over) filled the work buffer, thus blocking any other projects from work fetch, and there are idle CPU cores as a result. Suspending just for a minute or two is sufficient for work fetch to proceed for another project and then resume the project and let all the cores crunch away as intended. I don't see the harm in this use of project "suspend."

I have scanned the work buffer to get total estimated hours of work for all four active projects, separately for CPU and GPU where relevant. I'm willing to see how this develops over the next week (or more). At present, Asteroids has 56 hours of work pending and it can satisfy all deadlines but only by running pretty much 100%. The question is: as the work buffer is drawn down will it fetch more work far in excess of the 1% (intended) resource share.

Thanks to everybody for your insight from various perspectives.

GeneAZ
ID: 63087 · Report as offensive
GeneAZ

Send message
Joined: 28 Jun 14
Posts: 27
United States
Message 63108 - Posted: 18 Jul 2015, 5:19:42 UTC

As of 10 minutes ago, all Seti CPU work has been drained out of the buffer. There is CPU work for the other three projects and they are all using one core each. There is Nvidia (GPU) work for Seti and Einstein and I think they are sharing the GPU resource more or less according to the share settings.
Now I shall try to be patient and leave things alone to see what happens.

Gene;
ID: 63108 · Report as offensive
Profile Gary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2462
United States
Message 63112 - Posted: 18 Jul 2015, 14:21:55 UTC - in response to Message 63108.  

As of 10 minutes ago, all Seti CPU work has been drained out of the buffer. There is CPU work for the other three projects and they are all using one core each. There is Nvidia (GPU) work for Seti and Einstein and I think they are sharing the GPU resource more or less according to the share settings.
Now I shall try to be patient and leave things alone to see what happens.

Gene;

FYI I popped over to the status page and results ready to send is ZERO
http://setiathome.berkeley.edu/sah_status.html for AP tasks.
ID: 63112 · Report as offensive
GeneAZ

Send message
Joined: 28 Jun 14
Posts: 27
United States
Message 63284 - Posted: 28 Jul 2015, 23:46:44 UTC

Here's an update on the progress of boinc learning how to manage the flow of CPU tasks for my system:
  project  Res.share  Buffer:tasks/hours  sched.priority
    Seti       88          0 / 0               -1.11
    Einstein   10          0 / 0               -1.39
    NFS         1         38 / 63              -1.15
    Asteroids   1         28 / 65              -1.11


The "hours" indicated is the sum of the tasks estimates. It is pretty close since hundreds of tasks have been done and a decent average is established.

This (table above) was a snapshot of the buffer at 8 a.m. this morning. It is similar to the buffer content on each of the two preceeding days. Occasionally a Seti CPU task gets downloaded, and run immediately since there is an idle core. (NFS & Asteroids seem to be using one core each and one core feeds the GPU.)

I will continue to be patient with boinc as it tries to figure out what I want done. So, no changes to resource shares until further notice, but one can see why my "micro-management" finger is getting itchy.
ID: 63284 · Report as offensive

Message boards : Questions and problems : Why project schedule priority so high?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.