Scheduling Issues

Message boards : BOINC client : Scheduling Issues
Message board moderation

To post messages, you must log in.

AuthorMessage
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 4690 - Posted: 10 Jun 2006, 0:31:44 UTC

I've come across a weird scheduling issue that can starve a project and potentially put a system in EDM too often.

I have a system attached to several projects, including CPDN, Rosetta and PrimeGrid. As we know, CPDN uses quite a bit of disk space and as it is right now, only Rosetta's and PrimeGrid's WUs are small enough to make the cut and be downloaded.

Given that there are so many projects attached, it get only 1s of work, as it's not necessary to queue several WUs to keep the system chewing on any project at one time.

Assuming that CPDN is suspended and the Rosetta queue is empty, as soon as a PrimeGrid WU finishes, CPDN is resumed and its STD is reset as it's now the only project running. Typically, a new PrimeGrid WU is downloaded a few seconds later, which causes CPDN to be suspended again.

This goes on and on without CPDN having much of a chance to run. Except when there's a Rosetta WU downloaded, when CPDN's STD is not reset when the PrimeGrid WU finishes and it has chance to run if the conditions are right.

Perhaps STD should not be reset when a project becomes the sole project with an active queue, but actually be split among the projects with active queues according to their resource shares, even if the STD is negative, to "spread the pain" fairly.

HTH

ID: 4690 · Report as offensive
Metod, S56RKO

Send message
Joined: 9 Sep 05
Posts: 128
Slovenia
Message 4734 - Posted: 16 Jun 2006, 9:03:09 UTC
Last modified: 16 Jun 2006, 9:05:58 UTC

Another schedulling issue which happens also on BOINC clients with only one project attached ... I noticed that with SAH with recent changes (different deadlines for different WUs).

Scenario: a machine, dedicated to single project (SAH) and with fairly long queue setting (5 days). As long as all of WUs had the same deadline setting (14 days), this machine was happily crunching and never entering EDF mode.

Since the start of seti_enhanced, WUs have different deadlines. Typically machine gets a mix of short and long WUs and obviously WUs get crunched in FIFO mode - regardless of deadline. Now, if machine has a couple of really long WUs (deadline nearly one month) and then receives a short one (deadline in order of a couple of days), machine enters EDF. Reason is clearly one: connect every 5 days (which is longer than deadline anyway). But EDF can also be triggered by receiving a mid-deadline WU if there's a WU with short deadline already in queue.

This behaviour is not a problem if only one project is attached to (other than creating excessive numberof directories under slots). But is a bit disruptive if computer is attached to several projects - it causes to shift STD.

My proposal would be to slightly change the way scheduller chooses next WU to crunch:

  • when choosing project to crunch for, it should behave exactly the same as now: consider only STD values and choose the one with the largest one
  • when choosing new WUs to crunch (within already chosen project), shoose the one with earliest deadline


This is basically a hybrid between normal and EDF schedulling. Current EDF scheme would still kick in if needed ...


Metod ...
ID: 4734 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 4735 - Posted: 16 Jun 2006, 11:04:25 UTC

The CPU scheduler has always chosen the next task within a projet based on deadline, not FIFO. Since deadlines are fairly consistant in most projects this is not really noticable though. One thing that is still a bit confusing is that it does have a tendancy to choose an in progress task rather than start a new one if there is no deadline pressure. For example task A has a 2 week deadline and task B has a 3 week deadline and both should be finished in 1 week. If task B is already partially processed it will continue the next time that project runs. However if task A has a 1 week deadline it may preempt task B to run task A without even changing projects. If neither task has been started task A will be the next task from that project in either situation.

The next version of the CPU scheduler will wait for a checkpoint before switching tasks in most situations. It will also be less likely to go into a true EDF mode. Instead it will temporarily favor the specific tasks that are in deadline trouble and use LTD to prevent a project from taking over completely.
BOINC WIKI

BOINCing since 2002/12/8
ID: 4735 · Report as offensive
Metod, S56RKO

Send message
Joined: 9 Sep 05
Posts: 128
Slovenia
Message 4737 - Posted: 16 Jun 2006, 12:19:01 UTC - in response to Message 4735.  
Last modified: 16 Jun 2006, 12:20:35 UTC

The CPU scheduler has always chosen the next task within a projet based on deadline, not FIFO. Since deadlines are fairly consistant in most projects this is not really noticable though.


Read my complaint again: deadlines at SETI are not consistent anymore at all.

Anyhow, right now I've got the computer I described in my previous post in EDF again. It's a dual-P2 machine (hence Wus are done in pairs). It suspended two tasks with deadlines due 2006-07-06 and started WUs with deadlines due 2006-06-29. It has two WUs in queue with deadlines due before the ones being suspended right now. None of these 6 WUs in question have been assigned/downloaded recently (in the last 36 hours). If your statement above was true, it'd never start the WUs with deadlines due 2006-07-06 before 4 WUs with earlier deadlines were done.


[edit] Uhmmm ... perhaps we're not talking about the same version of BOINC cc? I'm referring to 5.4.7.
Metod ...
ID: 4737 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 4744 - Posted: 17 Jun 2006, 11:05:08 UTC - in response to Message 4737.  
Last modified: 17 Jun 2006, 11:15:23 UTC

Uhmmm ... perhaps we're not talking about the same version of BOINC cc? I'm referring to 5.4.7.

This behavior should not have changed in quite some time, probably since 4.32.

I would expect what happened is that the 7/6 tasks were downloaded before any of the others currently in your queue and were started at that time. Since they were already in progress the client would continue to run them until deadline pressure forced it to something else.

To be honest I have not had more than one workunit on my hosts for some time, I am running an experiment to see what happens over the next few days. I have been watching the code changes though and should have noticed a change in this area.

[edit] Should be a decent test, I originally had a task due 6/28, first try got some due 7/2 and 7/11, second contact got some due 7/6. The only thing missing is one due like 6/20.
BOINC WIKI

BOINCing since 2002/12/8
ID: 4744 · Report as offensive
Norbert Hoffmann

Send message
Joined: 19 Dec 05
Posts: 28
Germany
Message 4746 - Posted: 17 Jun 2006, 15:01:05 UTC - in response to Message 4734.  

My proposal would be to slightly change the way scheduller chooses next WU to crunch:

  • when choosing project to crunch for, it should behave exactly the same as now: consider only STD values and choose the one with the largest one
  • when choosing new WUs to crunch (within already chosen project), shoose the one with earliest deadline


This has been added to the Client scheduling policies lately ("CPU scheduling policy", 6.). The client will follow this document soon.

Norbert
ID: 4746 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 4747 - Posted: 17 Jun 2006, 15:42:46 UTC

The problem that I described is caused by P.debt being normalized so that the mean or the minimum is zero. If the mean or minimum of zero were not enforced when there's only one runnable project and any other potentially runnable project, then the scenario I described wouldn't happen. To make things easier, the enforcing should only be done when there are only runnable projects and no potentially runnable projects.

This oscillation of running a project for a few seconds after one finishes just to bounce back to the same project again skews fair scheduling and can easily throw a host into EDM unnecessarily.

HTH

ID: 4747 · Report as offensive
Metod, S56RKO

Send message
Joined: 9 Sep 05
Posts: 128
Slovenia
Message 4751 - Posted: 17 Jun 2006, 18:26:08 UTC - in response to Message 4746.  
Last modified: 17 Jun 2006, 18:28:09 UTC

My proposal would be to slightly change the way scheduller chooses next WU to crunch:

  • when choosing project to crunch for, it should behave exactly the same as now: consider only STD values and choose the one with the largest one
  • when choosing new WUs to crunch (within already chosen project), shoose the one with earliest deadline


This has been added to the Client scheduling policies lately ("CPU scheduling policy", 6.). The client will follow this document soon.


If I read the above quoted document correctly, it does address the issue of too frequent re-schedulling. But it does not address the isue of non-consistent deadlines within a project. My proposal would slightly decrease possibility of getting into EDF mode if queue length (connect every setting) is not only a fraction of deadlines. At least I can't find anything like that in the document.


[edit] Actually, I can't find anything about choosing next WU to run from queue of WUs for already selected project in that document. This is actually the gist of my proposal: define a bit more inteligent way of choosing next WU to crunch, not just follow FIFO.
Metod ...
ID: 4751 · Report as offensive
Norbert Hoffmann

Send message
Joined: 19 Dec 05
Posts: 28
Germany
Message 4759 - Posted: 17 Jun 2006, 21:40:39 UTC - in response to Message 4751.  
Last modified: 17 Jun 2006, 21:40:56 UTC

Actually, I can't find anything about choosing next WU to run from queue of WUs for already selected project in that document. This is actually the gist of my proposal: define a bit more inteligent way of choosing next WU to crunch, not just follow FIFO.

I think "Find the project P with the greatest anticipated debt, select one of P's runnable results (picking one that is already running, if possible, else the result with earliest deadline) and schedule that result" sounds ok. (Why did I believe that it was 6. when it actually is 7.?)

Norbert

ID: 4759 · Report as offensive
Metod, S56RKO

Send message
Joined: 9 Sep 05
Posts: 128
Slovenia
Message 4763 - Posted: 18 Jun 2006, 8:17:09 UTC - in response to Message 4759.  
Last modified: 18 Jun 2006, 8:17:29 UTC

I think "Find the project P with the greatest anticipated debt, select one of P's runnable results (picking one that is already running, if possible, else the result with earliest deadline) and schedule that result" sounds ok. (Why did I believe that it was 6. when it actually is 7.?)


Sounds OK indeed. Obviously I wasn't careful enough while reading.

Metod ...
ID: 4763 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 4806 - Posted: 21 Jun 2006, 11:06:21 UTC - in response to Message 4744.  

To be honest I have not had more than one workunit on my hosts for some time, I am running an experiment to see what happens over the next few days. I have been watching the code changes though and should have noticed a change in this area.

[edit] Should be a decent test, I originally had a task due 6/28, first try got some due 7/2 and 7/11, second contact got some due 7/6. The only thing missing is one due like 6/20.


Ok there seems to be a bug here. My test host is now processing one of the tasks due on 7/11 instead of one due on 7/2. I have reported this on the dev and alpha mailing lists. I did find a place in the code that is supposed to choose the task with the earliest deadline within the project but it is not being used properly.

Dangit now I'm going to have to run a larger queue to track this thing, that is really annoying on a 10 project box.
BOINC WIKI

BOINCing since 2002/12/8
ID: 4806 · Report as offensive
Norbert Hoffmann

Send message
Joined: 19 Dec 05
Posts: 28
Germany
Message 4807 - Posted: 21 Jun 2006, 11:17:15 UTC - in response to Message 4806.  

Dangit now I'm going to have to run a larger queue to track this thing, that is really annoying on a 10 project box.

You'll love the serverside bug (when it comes to sending, the server will forget the other 9 projects).

Norbert
ID: 4807 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 459
United Kingdom
Message 4890 - Posted: 2 Jul 2006, 8:24:58 UTC
Last modified: 2 Jul 2006, 8:26:29 UTC

We see a fair amount of comments on the msg boards about people wanting to report immediately and the stock answer is, the reporting and requests are best done in batches to ease server load. But my computer on the default 'connect to network' setting keeps reporting units and then shortly after makes a separate request for work. Below is a cutting from one project of my results page:
	Sent			Time reported or deadline
1	[color=00FF00]2 Jul 2006 4:20:17 UTC[/color]		25 Jul 2006 9:56:38 UTC
2	1 Jul 2006 23:21:36 UTC		24 Jul 2006 10:54:35 UTC
3	[color=00FF66]1 Jul 2006 18:22:25 UTC[/color]		[color=00FF00]2 Jul 2006 4:15:32 UTC[/color]
4	1 Jul 2006 12:08:26 UTC		1 Jul 2006 22:07:04 UTC
5	[b[1 Jul 2006 7:18:43 UTC[/b]		[color=00FF66]1 Jul 2006 18:17:15 UTC[/color]
6	1 Jul 2006 5:19:30 UTC		1 Jul 2006 9:57:51 UTC
7	1 Jul 2006 4:12:05 UTC		1 Jul 2006 9:57:51 UTC
8	[b]1 Jul 2006 0:33:01 UTC[/b]		[b[1 Jul 2006 7:18:43 UTC[/b]
9	30 Jun 2006 22:31:01 UTC		1 Jul 2006 2:57:13 UTC
10	[color=00FF00]30 Jun 2006 16:18:30 UTC[/color]		[b]1 Jul 2006 0:33:01 UTC[/b]
11	[color=FF00FF]30 Jun 2006 11:16:25 UTC[/color]		30 Jun 2006 21:11:23 UTC
12	[color=666666]30 Jun 2006 4:54:30 UTC[/color]		[color=00FF00]30 Jun 2006 16:10:10 UTC[/color]
13	30 Jun 2006 0:06:04 UTC		[color=FF00FF]30 Jun 2006 11:10:14 UTC[/color]
14	[color=0000FF]29 Jun 2006 19:47:31 UTC[/color]		[color=666666]30 Jun 2006 4:51:56 UTC[/color]
15	[color=990099]29 Jun 2006 14:41:24 UTC[/color]		29 Jun 2006 22:55:03 UTC
16	[color=FF0000]29 Jun 2006 9:48:25 UTC[/color]		[color=0000FF]29 Jun 2006 19:34:05 UTC[/color]
17	29 Jun 2006 4:57:58 UTC		[color=990099]29 Jun 2006 14:40:39 UTC[/color]
18	27 Jun 2006 11:34:36 UTC		[color=FF0000]29 Jun 2006 9:47:36 UTC[/color]
19	27 Jun 2006 6:35:28 UTC		29 Jun 2006 1:16:44 UTC
20	27 Jun 2006 0:01:53 UTC		27 Jun 2006 10:01:37 UTC


If you follow the colours you can see what I mean, especially entries Sent/16 and Received/18, where the time separation is only 49 seconds.

Would it not be better to delay reporting until the next request for more work, unless the report would be after deadline etc, which I think is already taken care of.

Andy
ID: 4890 · Report as offensive
Metod, S56RKO

Send message
Joined: 9 Sep 05
Posts: 128
Slovenia
Message 4932 - Posted: 7 Jul 2006, 17:24:51 UTC - in response to Message 4890.  
Last modified: 7 Jul 2006, 17:27:35 UTC

If you follow the colours you can see what I mean, especially entries Sent/16 and Received/18, where the time separation is only 49 seconds.

Would it not be better to delay reporting until the next request for more work, unless the report would be after deadline etc, which I think is already taken care of.


My guess is that this asynchronous reporting can happen if you have your connect every set to fairly long period of time. Definition of long depends on the deadlines of course.

Why do I think like that?

BOINC CC tries to have buffer filled up at most of times. So whenever amount of work in queue drops below the threshold, it connects project server to get more work. Which is fine.
When a result gets done, it is queued for reporting at later time - preferably when it needs to refill the queue. The problem is, however, if the queue length is long and deadline for just finished WU is less than 2*connect_every away (I'm not sure about the factor of 2, but it is proportional to the connect every). If deadline is due soon, BOINC CC will connect project server regardless of need for refill.
[edit] It needs to connect to report pending WU because another connection is not guaranteed to happen before deadline - this is what connect every is all about: guaranteed maximum time between two consecutive connections. [/edit]

Short connect every seems to be fine as the WU just done has plenty of time before deadlines. On majority of my putters, where I have connect every set to 0.1 days I mostly see a couple of WUs waiting to report - and get reported when buffer needs to refill.

Metod ...
ID: 4932 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 459
United Kingdom
Message 4958 - Posted: 10 Jul 2006, 17:24:59 UTC - in response to Message 4932.  

If you follow the colours you can see what I mean, especially entries Sent/16 and Received/18, where the time separation is only 49 seconds.

Would it not be better to delay reporting until the next request for more work, unless the report would be after deadline etc, which I think is already taken care of.


My guess is that this asynchronous reporting can happen if you have your connect every set to fairly long period of time. Definition of long depends on the deadlines of course.

Why do I think like that?

BOINC CC tries to have buffer filled up at most of times. So whenever amount of work in queue drops below the threshold, it connects project server to get more work. Which is fine.
When a result gets done, it is queued for reporting at later time - preferably when it needs to refill the queue. The problem is, however, if the queue length is long and deadline for just finished WU is less than 2*connect_every away (I'm not sure about the factor of 2, but it is proportional to the connect every). If deadline is due soon, BOINC CC will connect project server regardless of need for refill.
[edit] It needs to connect to report pending WU because another connection is not guaranteed to happen before deadline - this is what connect every is all about: guaranteed maximum time between two consecutive connections. [/edit]

Short connect every seems to be fine as the WU just done has plenty of time before deadlines. On majority of my putters, where I have connect every set to 0.1 days I mostly see a couple of WUs waiting to report - and get reported when buffer needs to refill.

My connect to network is at the default 0.1 days for this computer.
At most usually have, one ready to report, one crunching/preempted, and one waiting. Usually it is only two units.

Andy
ID: 4958 · Report as offensive

Message boards : BOINC client : Scheduling Issues

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.