Why doesn't Boinc schedule earlier deadlines first?

Message boards : Questions and problems : Why doesn't Boinc schedule earlier deadlines first?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Nick Name

Send message
Joined: 14 Aug 19
Posts: 55
United States
Message 94914 - Posted: 13 Jan 2020, 9:50:21 UTC

What's irksome is when things like the following happen. Right now I'm attached to:

    1) YoYo (share 100)
    2) Latin Squares (share 1)
    3) T.Brada Experiment (share 1)
    4) QuChemPedIA@home (share 1)


I'm also attached to WUProp and GoofyGrid (both with shares of 100) but those don't really affect other projects as they're NCI.

Latin Squares (LS) was previously at 100 and I wasn't crunching YoYo at all. I hit a goal on LS and moved on to YoYo within the last week. Shares were set to one and 100 respectively. Other projects were unchanged.

Right now five T. Brada jobs are running in high priority mode. They were received on the 7th and based on FIFO should have been returned before now, as the deadline is now less than 24 hours. BOINC suspended one of them and started a YoYo job, received on the 12th, which ran for three minutes. BOINC then lost its mind, suspended that job plus another YoYo job (also received on the 12th) that had run for 17 minutes, and started the suspended T. Brada job plus another one.

I have my cache set to 0.5 days and 0.1 additional days, as I've been crunching long enough to know that nothing screws BOINC up like a large cache with more than a couple projects. I started using low settings like this years ago. When I first started with BOINC and had it set to run on a schedule, I often saw this type of behavior where work started too late to complete in time.

I don't expect this to cause a real problem as this machine generally runs 24/7. However, if something unexpected should happen, or I needed to reboot for some reason, that could cause wasted work or in a worst case scenario, failure to return these Brada tasks in time. Tasks that don't checkpoint at all, often enough or have faulty checkpoints could be a major problem in this scenario. Strictly honoring FIFO would make it less likely to be a problem. If I were using an app_config to limit the number of concurrent T. Brada tasks, which I'd actually prefer to do but haven't because of silliness like this, some of these tasks probably wouldn't get done in time.

Failing "common sense" scheduling like the average user would expect, where work really would be consistently processed in the order it's received, I'd like a setting in cc_config to force jobs to run to completion once they start.


Team USA forum
Follow us on Twitter
Help us #crunchforcures!
ID: 94914 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94915 - Posted: 13 Jan 2020, 10:33:45 UTC

As far as I'm aware there is no setting available to force completion of tasks once they start. To implement such an option would mean a considerable rework of the scheduler within the client so is probably a long way down the list of priorities.
Checkpoints are under the control of the project, some projects have very long checkpoints (CPDN comes to mind), while others fairly short, for some the user can change them, and for others there are no checkpoints available.
I'm not familiar with the projects you are running, but one thing that might be possible is to set the checkpoint to just slightly longer than typical task run time. BUT there is a downside to doing this, if you suffer any sort of outage then you will loose the work done to that point, whereas if you have a short (1-5 minutes) checkpoint you should only loose those few minutes of work. The actual management of the how and what is stored by a checkpoint is down to the individual project's applications, so there is nothing BOINC can do apart from making sure the checkpoint is triggered.
ID: 94915 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 94916 - Posted: 13 Jan 2020, 11:43:34 UTC - in response to Message 94915.  

You can also use the 'Switch between tasks every ...' control in the global computing preferences section of any project.

Notes:

  • 'Switch between tasks' is permissive, not directive - BOINC may switch tasks after this minimum interval, but doesn't have to.
  • 'Earliest deadline first' is more important than 'Switch between tasks' - tasks will still be switched if a deadline is in danger of being missed.
  • I've found that setting 'Switch between tasks' to a bit longer than the longest expected runtime for any of your project tasks can smooth things, but like Rob, I'm not familiar with those particular projects.
  • Your Resource Share of 1 is extreme in this situation. You have set a cache of 0.6 days: at resource share 1, BOINC will assume that it is going to take 60 days to clear the cache for that project. If the deadline is less than 60 days, you have created the problem you describe.

ID: 94916 · Report as offensive
Nick Name

Send message
Joined: 14 Aug 19
Posts: 55
United States
Message 94927 - Posted: 13 Jan 2020, 18:17:02 UTC
Last modified: 13 Jan 2020, 18:18:32 UTC

Rob & Richard,

I'm aware of everything you have said. I'm particularly aware of the limitations of Switch Between Tasks, although usually I make sure to set that to an absurdly high value and on this host I had forgotten. YoYo has at least one app (ECM) that does not checkpoint and I'm no longer running it, since it runs ten to twelve hours and no checkpoint for such an app is ridiculous.

Years ago, after much exasperation with BOINC's scheduling, I settled on "Low cache, few projects". This generally works well for me, other than occasionally quirky behavior like I described. I don't understand why BOINC would assume it would take 60 days to clear with low cache settings. That's not intuitive and just seems bizarre. None of these projects (*edit: or at least these particular tasks) have deadlines of 60 days. Brada has the longest deadline of what's currently in the queue, seven days, which might be why BOINC is delaying those tasks, but it should realize it doesn't have 60 days to process tasks with a deadline of seven.

In this case the work did clear and didn't cause a "real" problem. One of the more annoying problems I had with this sort of scenario years ago was BOINC stopping GPU work to free up cores to run high-priority CPU work. I solved that with multiple clients. The point is that BOINC does not honor FIFO as advertised, or at least as it would be expected to. If there happened to be a "spanner in the works", that could lead work not getting done in time, entirely preventable by honoring FIFO.

Now I have ten Brada tasks, received the 11th and due on the 18th. I suppose this process will repeat in a few days.
Team USA forum
Follow us on Twitter
Help us #crunchforcures!
ID: 94927 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94929 - Posted: 13 Jan 2020, 18:46:55 UTC

Checkpoint every x seconds means exactly as it says - checkpoint every x seconds. There may be other checkpoints set in the application - such as when a particular part of a calculation has been completed, these will occur, and the x-second counter reset.
However some applications do not have the ability to checkpoint programmed, or have very long default periods between checkpoints.
Realistically 30 seconds is about as low as you want to go as doing a checkpoint does take a finite amount of resource, and some may say that is far too frequent....
ID: 94929 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94935 - Posted: 13 Jan 2020, 21:33:50 UTC

Let's try again - the time interval for a timed checkpoint (as opposed to an event checkpoint) is the normal interval, there is a SMALL tolerance (a second or so).
Checkpoints for most projects are written to disk. Some, like CPDN, do a "trickle-up" in addition to the checkpoints, these are sent to the servers.
If an application is paused all the timers are stopped, nothing is happening and depending how the application was written it may do a checkpoint write when paused so it can resume at that instant, or it may not, in which case when it re-starts it will go back to the previous checkpoint.
When a job swaps from "running" to "waiting to run" most applications do a checkpoint write, but some don't, just relying on the last checkpoint.
Likewise for suspending an application....
ID: 94935 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2518
United Kingdom
Message 94939 - Posted: 13 Jan 2020, 21:43:02 UTC

Some, like CPDN, do a "trickle-up" in addition to the checkpoints,


The trickle up files are what the credit is based on for CPDN. These days they are concurrent with the monthly (or other interval) zips being produced which have the information for the scientists.

The system was introduced in the days when tasks could take six months or more so that if a task through no fault of the cruncher produced an invalid climate, e.g. -ve pressure after five months causing the task to crash, the cruncher would still get the credit for work done up to the last trickle up.
ID: 94939 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 94944 - Posted: 13 Jan 2020, 21:53:47 UTC - in response to Message 94914.  

I'd like a setting in cc_config to force jobs to run to completion once they start.

You can do that by changing the "switch between tasks every" parameter. I change from the default of 60 minutes to 360 minutes so that a GPUGrid job runs to completion and never exits or suspends. The application can't handle restarting on a different device in a mixed type gpu configuration and the switch parameter is how I get around the issue.
ID: 94944 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 94945 - Posted: 13 Jan 2020, 22:14:53 UTC - in response to Message 94938.  

That's a terrible system which means we're all wasting huge amounts of processing power if we run more than one project, as every time a processor is swapped from project A to project B, project A is likely to lose calculations. Data should ALWAYS be written to disk when an application is paused for whatever reason (computer shut down, phone unplugged from charger, exclusive application running, another project taking the processor).
I think the algorithm is

1) Wait until Task Switch Interval has expired.
2) Then start looking for a good time to switch.
3) Wait until task has just checkpointed.
4) SWITCH

You can set debug message log flags that will show you all that happening.
ID: 94945 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 94947 - Posted: 13 Jan 2020, 22:34:13 UTC - in response to Message 94946.  

That algorithm won't help for computer shut down, phone unplugged from charger, exclusive application running.
No, the human factor can never be predicted, but it's pretty good for the events under BOINC's control. There's some good design in there, if you can be bothered to look.
ID: 94947 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94948 - Posted: 13 Jan 2020, 22:44:15 UTC

Richard was describing the process for JOB SWAPPING - which is obviously an operation where regular checkpoints are an advantage.
The controlled shutdown (including phones changing to battery, exclusive application starts) process for "sensible" projects includes do a checkpoint write, for "non-sensible" projects don't do that checkpoint.
Uncontrolled shutdowns are just that, sudden and uncontrolled (and hopefully rare) events - one has to rely on the last checkpoint to get things going again and there is no escape from that.

Keith's comment about applications trashing tasks if they start on the "wrong" processor is only partially correct. Some applications do the nice thing and start without faults, but others don't when the two GPUs are the same, but most fail when the two GPUs are different models never mind from different families.
ID: 94948 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 94950 - Posted: 13 Jan 2020, 23:15:21 UTC - in response to Message 94941.  
Last modified: 13 Jan 2020, 23:15:45 UTC

Plus in 6 months something bad could happen to that computer, or it might not be processing Boinc any more, or it might not meet the deadline, so the project at least gets some of the data and can send the remainder of the calculations to someone else.

No they can't be sent to some one else part way through.
Resends start from the beginning with the new person/computer.

And BOINC was written way back before people started using smart phones to run BOINC. And BOINC has never kept up. Probably impossible to do so.
ID: 94950 · Report as offensive
ProDigit

Send message
Joined: 8 Nov 19
Posts: 718
United States
Message 94952 - Posted: 13 Jan 2020, 23:47:05 UTC - in response to Message 94915.  

As far as I'm aware there is no setting available to force completion of tasks once they start. To implement such an option would mean a considerable rework of the scheduler within the client so is probably a long way down the list of priorities.
Checkpoints are under the control of the project, some projects have very long checkpoints (CPDN comes to mind), while others fairly short, for some the user can change them, and for others there are no checkpoints available.
I'm not familiar with the projects you are running, but one thing that might be possible is to set the checkpoint to just slightly longer than typical task run time. BUT there is a downside to doing this, if you suffer any sort of outage then you will loose the work done to that point, whereas if you have a short (1-5 minutes) checkpoint you should only loose those few minutes of work. The actual management of the how and what is stored by a checkpoint is down to the individual project's applications, so there is nothing BOINC can do apart from making sure the checkpoint is triggered.


You'd kind of have to look at bittorrent clients, how they set priorities on (nearly downloaded) torrents.
Their priority rating depends on personal priority settings (high, normal, low), as well as torrent availability, network speed of the torrent, and finished percentage.
Once a torrent reaches past a certain point (eg:75%), it's priority status gets increased dramatically, to the point that torrents which are 98% finished, get a 98% boost in priority.

The high/normal/low settings influence mostly torrents of similar finished percentage.
ID: 94952 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Questions and problems : Why doesn't Boinc schedule earlier deadlines first?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.