Slow scheduling

Message boards : Questions and problems : Slow scheduling
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90003 - Posted: 11 Feb 2019, 7:15:50 UTC
Last modified: 11 Feb 2019, 7:59:41 UTC

Today 29 of my machines missed out a release of work from Project 3 and I need a solution on how to prevent this.

The cores must be occupied 24/7 so there are a few projects set to 0 resource share and the cores are occupied with cache of about 3 - 5 WU of each of those.

Three high priority projects are set to 49 share.
Project 1 checked for WU about every ~3 hours but had no work.
Project 2 checked every ~6 hours and found no work (although the server did give some out to other clients in that period, so missed out on those).
Project 3 asked for work once in ~12 hours, released work in that period, and only my one laptop caught some because it started requesting work frequently in the afternoon (I had suspended all other projects in order to focus on one project that was nearing it's deadline).

The scheduling rate would increase if I left cores idle but it's -10C outside and the machines heat the house.

Project 3 had been working on WU's that are 2 to 7 days long but the WU that appeared today were 1 to 8 hour.

If I set <rec_half_life_days>X</rec_half_life_days> to 30 days would that improve scheduling or maybe setting it to the other extreme 1, help?

Would setting it to 0 force the scheduler to request work hourly and ignore the lengths of past work units?

Is there any other mechanism to speed scheduling?
ID: 90003 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2516
United Kingdom
Message 90006 - Posted: 11 Feb 2019, 9:00:04 UTC - in response to Message 90003.  

I suspect some of this may be down to how the individual projects behave. After a schedule request CPDN won't allow another one for an hour which gets reset if you manually try before the hour is up. This means that with small batches they can come and go within that hour period. At one time it was possible to monitor the server status page and manually request work when it appeared. These days it isn't updated often enough for that to be an option.

I don't know how other projects manage it, except that there are some others that have a backoff time.
ID: 90006 · Report as offensive
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90032 - Posted: 12 Feb 2019, 3:54:47 UTC - in response to Message 90003.  
Last modified: 12 Feb 2019, 4:03:04 UTC

BOINC version 7.8.3, machine independent, Windows 7.
6 projects, 3@ 0 resource share, 3@ 49 resource share.


If I set <rec_half_life_days>X</rec_half_life_days> to 30 days would that improve scheduling or maybe setting it to the other extreme 1, help?

Would setting it to 0 force the scheduler to request work hourly and ignore the lengths of past work units?

Is there any other mechanism to speed scheduling?
ID: 90032 · Report as offensive
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90090 - Posted: 13 Feb 2019, 4:17:40 UTC

Set the value to 0 on 8 machines and the same issues with slow requests from project 3 but eventually the project had a flood of WU's for hours and they got work by the evening.

On the other 16 machines, set 8 to 1 and left the other at 30 days.

The machines set to 1 all got work in the morning.
The machines set to 30 all got work by evening.

Anyone working help desk want to weigh in on what I'm seeing and give me a place to read about the functionality of <rec_half_life_days>X</rec_half_life_days> and a detailed example of how this setting effects scheduling requests?
ID: 90090 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 90100 - Posted: 13 Feb 2019, 7:47:15 UTC

From the BOINC documentation Wiki https://boinc.berkeley.edu/wiki/User_manual
<rec_half_life_days>X</rec_half_life_days>
    A project's scheduling priority is determined by its estimated credit in the last X days. Default is 10; set it larger if you run long high-priority jobs. 


Long (duration) tasks - those taking several days to complete - one project I know of that has very long duration tasks is CPDN, but their tasks run at low priority, so this parameter may be of no use to you.
Without knowing what your projects are it is very difficult to give you any guidance as to what value to use. Assuming you are running projects that take minutes to hours to run then it will have very little impact on scheduling.

You would do better to look at the cache sizes. There are two values, the first is the total cache size, for most people a cache of between two and five days is adequate. The second one is confusingly named "store additional x days", in reality this determines how often you do a call for work from each project, the lower the value the more frequent the check is made to see if you need work, and it can be a very small number (fraction of a day, 0.1 is about 2.4 hour intervals).

Also of course many projects have periods of time when there is no work available for a whole range of reasons, and this can really upset inter-project scheduling, particularly when one considers BOINC works on a medium/long term deficit basis to determine what projects get work and what projects are "ignored".
ID: 90100 · Report as offensive
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90125 - Posted: 14 Feb 2019, 2:12:11 UTC - in response to Message 90100.  



Assuming you are running projects that take minutes to hours to run then it will have very little impact on scheduling.


The project that is not scheduling frequently enough has WU's that are as short as 5 minutes and as long as 10 days.
So in order to assure that WU's are secured from that project, the scheduler has to be set as if any work is a 5 minute long WU.

The second one is confusingly named "store additional x days", in reality this determines how often you do a call for work from each project, the lower the value the more frequent the check is made to see if you need work, and it can be a very small number (fraction of a day, 0.1 is about 2.4 hour intervals).

Thankyou!

This is where my understanding of the scheduler is flawed. Currently have this set to 1.5 days and that is why the 3 to 12 hour delays.

What is the effect of setting it to 0?
Very frequent updates or does the algorithm assume 0 is to use default value, and what is the default value? (I think default BOINC install was 1.5 days).

Also of course many projects have periods of time when there is no work available for a whole range of reasons, and this can really upset inter-project scheduling, particularly when one considers BOINC works on a medium/long term deficit basis to determine what projects get work and what projects are "ignored".


This is becoming a norm for most my projects as I attempt to get at least 5,000 hours (WUProps) on any project WU that I decide to take on. Much of the time my computers are now waiting to get the infrequent work units.
Also, it seems to my intuition that computing power to BOINC projects has increased over the last decade and that work shortages are more common, but some cross project analysis would need to measure work loads.

I have separate issues on taming the 0 resource share projects that are backups or need to complete their requisite work-hour obligations.

One of two most problematic projects is Asteroid@home. It's set to 0 resource share, yet floods the cache with WU's that are estimated at 110 minutes and then actually take 13 hours. The issue seems to be that the Period Search application is the only application and is both CPU and GPU. cc_config is already set not to use GPU's and there are no usable GPU's in the machine for Asteroids@home so is <rec_half_life_days>X</rec_half_life_days> going to play a part in taming this project? I've given up on Asteroid@home until this is resolved.
ID: 90125 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 90128 - Posted: 14 Feb 2019, 7:21:23 UTC

If a project wrongly under estimates the duration of tasks then there is little that BOINC can do to stop that stop that project from flooding you with work. If it is consistently happening then you should take it up with the project concerned.

You still haven't said what projects you are trying to run, and if all your computers are set for the same set of projects.
ID: 90128 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 90133 - Posted: 15 Feb 2019, 0:25:00 UTC - in response to Message 90125.  

What is the effect of setting it to 0?
With that setting, your BOINC client won't ask for any additional work on top of what is in the first setting.

Very frequent updates or does the algorithm assume 0 is to use default value, and what is the default value? (I think default BOINC install was 1.5 days).
Here are the details of what these settings do. As an example, assume the first setting was 2.0 days and the second (extra days) setting was 1.5 days. BOINC would regard a 'full' cache would be when the estimates of all tasks add up to > 3.5 days. As you complete and return work, BOINC will not take any action to replenish the cache until the work on hand falls below 2 days. So if you put a value in the extra days setting you create a hysteresis effect between a 'hi water' mark and a 'lo water' mark. For this example those marks are separated by 1.5 days. This is precisely why there would be long periods where no work is being requested.

If a user really wanted the best chance to have a stable level of work on hand, with regular top-ups, I think the best option is to put what is wanted in the first setting and leave the second one at zero. Please realise that BOINC can only make decisions about work fetch based on estimates. You mention tasks for a single project taking between 5 mins and 10 days. Do these tasks come with proper estimates when you first receive them? No problem if they do, since BOINC can handle that. If they don't, you should complain bitterly in the project's forums since this is really bad behaviour which will cause lots of problems for your client when managing the work flow for multiple projects.
Cheers,
Gary.
ID: 90133 · Report as offensive
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90138 - Posted: 15 Feb 2019, 11:23:07 UTC - in response to Message 90133.  

Here are the details of what these settings do. As an example, assume the first setting was 2.0 days and the second (extra days) setting was 1.5 days. BOINC would regard a 'full' cache would be when the estimates of all tasks add up to > 3.5 days. As you complete and return work, BOINC will not take any action to replenish the cache until the work on hand falls below 2 days. So if you put a value in the extra days setting you create a hysteresis effect between a 'hi water' mark and a 'lo water' mark. For this example those marks are separated by 1.5 days. This is precisely why there would be long periods where no work is being requested.

If a user really wanted the best chance to have a stable level of work on hand, with regular top-ups, I think the best option is to put what is wanted in the first setting and leave the second one at zero. Please realise that BOINC can only make decisions about work fetch based on estimates.


Nice explanation. One for the FAQ's if not already there.


You mention tasks for a single project taking between 5 mins and 10 days. Do these tasks come with proper estimates when you first receive them? No problem if they do, since BOINC can handle that. If they don't, you should complain bitterly in the project's forums since this is really bad behaviour which will cause lots of problems for your client when managing the work flow for multiple projects.


They seem to be proper estimates but my machines are downclocked/upclocked depending on the temperature in the house which is varying daily now that spring is approaching. Unlike Asteroids@home (with a WU name identical for GPU and CPU work which will be isolated from now on), both prime searching projects I've worked for (Primegrid and SRBase) have WU's with short to extremely long computation times (30 minutes to 235 days). One math project, YAFU, has quirky work units that vary in length and give a time-till-completion estimate that is useless. The WU's are multi-threaded and can take from 200,000 to 3,500,000 CPU seconds with unpredictable credit BUT the project owner(Yoyo) understands the issues and only gives out maximum 2 WU per client and gives a second week long deadline past the deadline listed in the client before invalidating.

I'll keep 'additional days' set to 0.
I'll go back to <rec_half_life_days>30</rec_half_life_days> instead of <rec_half_life_days>1</rec_half_life_days> that's been used for the last 3 days since there are several WU that are running 10+ days.

Is it a good heuristic for <rec_half_life_days>X</rec_half_life_days> an X that is twice as long as your maximum WU length?
ID: 90138 · Report as offensive
Profile marmot
Avatar

Send message
Joined: 16 Sep 13
Posts: 82
United States
Message 90139 - Posted: 15 Feb 2019, 11:44:07 UTC - in response to Message 90128.  

If a project wrongly under estimates the duration of tasks then there is little that BOINC can do to stop that stop that project from flooding you with work. If it is consistently happening then you should take it up with the project concerned.



1) Asteroids@home, with a WU name identical for GPU and CPU, has been that way for years and my complaint will be likely ignored. I'll isolate it to a single machine with the GPU it needs or a separate BOINC directory or VM.
2) Rosetta having no server side control for receiving mini or large WU has also been in place for years and I'll say something but have no illusions that it'll get some result.
3) DHEP is ignoring work cache and using the NCI mechanism (from what I gathered) for computationally intensive work. Already many complaints were registered and they stated they are understaffed and won't be addressing it. It's a new project trying to attract massive computational power so, inflated credit and work cache dominance seem to be helpful to that endeavor. Suspending 1 task is best control as the WU's (place holders) are 19 days.

Every other issue on the projects set to 0 resource are minor and likely corrected by yours and Gary Roberts advice.
ID: 90139 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 90140 - Posted: 15 Feb 2019, 12:32:55 UTC - in response to Message 90139.  

1) Asteroids@home, with a WU name identical for GPU and CPU, has been that way for years and my complaint will be likely ignored. I'll isolate it to a single machine with the GPU it needs or a separate BOINC directory or VM.
The identical name (by itself) shouldn't be a problem. The CPU and GPU tasks will need different different binary code to run them, and they should be distinguishable by Plan Class (shown in brackets in the 'Application' column in BOINC Manager, Advanced view - probably for the GPU tasks only).

Provided the separate plan class for GPU apps is in place, BOINC should track the estimated run times for the two different processes independently.

[I thought that Einstein was the only GPU-project still failing to do that, because they deliberately opted out of CreditNew, for understandable reasons]
ID: 90140 · Report as offensive

Message boards : Questions and problems : Slow scheduling

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.