Thread 'How to finish a work unit without pausing?'

Message boards : Questions and problems : How to finish a work unit without pausing?
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87634 - Posted: 14 Aug 2018, 17:30:21 UTC

The Rosetta work units are best done continuously from start to finish (24 hours in my case), but BOINC has the habit of pausing one after an hour or two in order to work on another one. It leaves the first one in memory (if LAIM is enabled) to bad effect. And disabling LAIM might adversely affect other work units.

How can I specify for BOINC to finish a work unit without switching to another?
ID: 87634 · Report as offensive
anniet
Avatar

Send message
Joined: 12 Jul 14
Posts: 656
Zambia
Message 87636 - Posted: 14 Aug 2018, 18:37:07 UTC - in response to Message 87634.  

How can I specify for BOINC to finish a work unit without switching to another?
There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe?
ID: 87636 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87637 - Posted: 14 Aug 2018, 18:47:27 UTC - in response to Message 87636.  

There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe?

That might be the best way we have. I have tried something similar, but never that long. That affects all the projects (I usually run WCG also) and work units equally, and it would be nice to have something more specific. That is, something that applies only to Rosetta. Or, it could apply to any work unit, but only until it finishes, which of course depends on the length of that work unit. I am a bit surprised that no one has asked for it before. Thanks for the input.
ID: 87637 · Report as offensive
anniet
Avatar

Send message
Joined: 12 Jul 14
Posts: 656
Zambia
Message 87638 - Posted: 14 Aug 2018, 20:19:17 UTC - in response to Message 87637.  

I am a bit surprised that no one has asked for it before. Thanks for the input.
There was a thread back in 2014
http://boinc.berkeley.edu/dev/forum_thread.php?id=9194 which didn't get very far.
ID: 87638 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87641 - Posted: 14 Aug 2018, 20:46:49 UTC - in response to Message 87638.  

There was a thread back in 2014
http://boinc.berkeley.edu/dev/forum_thread.php?id=9194 which didn't get very far.

The BOINC scheduler seems to be a sacred cow, maybe because nobody can figure out how it works, and so they leave it alone.
I am trying 1600 minutes to see how it goes.

Thanks.
ID: 87641 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5133
United Kingdom
Message 87643 - Posted: 14 Aug 2018, 21:10:20 UTC

Most BOINC tasks, from most projects, 'checkpoint' periodically. The default is every 60 seconds, and the restart from a checkpoint is quick and easy - so in the vast majority of cases, nobody is concerned if BOINC switches to another project once an hour.

If Rosetta is an outlier in this respect, then it needs special treatment. Delaying task switch is a good one, up to and beyond the normal run time of the task type/project in question.

Another good one is to check that 'leave tasks in memory when suspended' is active, through web preferences or BOINC Manager (depending on the preferred management style in operation). But it's possible that not all projects benefit from that setting.
ID: 87643 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87653 - Posted: 15 Aug 2018, 1:07:23 UTC - in response to Message 87643.  

If Rosetta is an outlier in this respect, then it needs special treatment. Delaying task switch is a good one, up to and beyond the normal run time of the task type/project in question.

Another good one is to check that 'leave tasks in memory when suspended' is active, through web preferences or BOINC Manager (depending on the preferred management style in operation). But it's possible that not all projects benefit from that setting.

That is the rub. I normally enable "leave tasks in memory". But Rosetta can produce highly variable output credits when you run too many at once, and it appears to be related to memory usage. So the suggestion has been made to disable "leave tasks in memory", to free up as much memory as possible for use by the running work units. I run the 24-hour Rosetta work units, so it takes time to run through all the possible combinations of things that can go wrong (it helps to leave a few cores free in some cases, or to limit the number of Rosettas running simultaneously in other cases). It would simplify things if the work units did not suspend at all, and all the memory was available for use.

It is the only project I know of that has that problem, and it is compounded by the fact that Rosetta runs for a fixed amount of time (e.g., 6, 12, 24 hours), and you only find out the credits you get at the end of the run.
ID: 87653 · Report as offensive
jglrogujgv

Send message
Joined: 6 Jul 18
Posts: 49
Barbados
Message 87671 - Posted: 16 Aug 2018, 8:11:20 UTC - in response to Message 87653.  

[It is the only project I know of that has that problem

Actually LHC's VBox apps have a similar problem, not exactly the same but similar in the sense that the problem would be alleviated if BOINC had an option to run each and every task, regardless of project, from start to finish, no pre-empting by other tasks. With LHC the problem is that their VBox apps don't like to be interrupted. On suspension with LAIM off the VM snapshot file, written to the task's slot dir, can grow so large that the <rsc_disk_bound> is exceeded causing "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED".

So.... request for "no-preemption for Rosetta" is essentially a request for special treatment for a single project. I would expect devs to frown upon that on the grounds that it's yet another branch in an already complex algorithm and what for... for just one project? A request for a no-preemption option that would apply to all projects is a branch but it would apply to all projects which is much easier to justify from a development cost versus benefit perspective.
ID: 87671 · Report as offensive
ProfileJord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15573
Netherlands
Message 87673 - Posted: 16 Aug 2018, 11:39:49 UTC - in response to Message 87671.  
Last modified: 16 Aug 2018, 11:40:56 UTC

BOINC has that option: ask the project to take out the checkpointing for that application, and its tasks will run from start to finish without break.

Edit: of course, if for some unforeseen reason the run is broken, the calculations will restart from the beginning.
ID: 87673 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87674 - Posted: 16 Aug 2018, 11:43:48 UTC - in response to Message 87671.  
Last modified: 16 Aug 2018, 11:46:12 UTC

A request for a no-preemption option that would apply to all projects is a branch but it would apply to all projects which is much easier to justify from a development cost versus benefit perspective.

I like that better. In fact, my real preference is to get rid of the BOINC scheduler (almost) entirely, and just run the jobs first in/first out on a per-core basis. For example, you could devote one core to Rosetta, and another core to LHC, etc. And you could set backups, so that if one project is out of work, it would go to another. It would eliminate the problem of cores running on empty, which still happens on occasion.

There might be exceptions to allow high-priority jobs to run on a different core, but I don't think even that is necessary. Jobs don't always finish on time even with the present BOINC scheduler. I think that overall, it would improve reliability.

But if that is a bridge too far, just go with your option. It can only help insofar as I can see.
ID: 87674 · Report as offensive
jglrogujgv

Send message
Joined: 6 Jul 18
Posts: 49
Barbados
Message 87695 - Posted: 17 Aug 2018, 23:26:39 UTC - in response to Message 87673.  

BOINC has that option: ask the project to take out the checkpointing for that application, and its tasks will run from start to finish without break.

Edit: of course, if for some unforeseen reason the run is broken, the calculations will restart from the beginning.

Create a new problem to fix a problem? I can't see many projects agreeing to that.
ID: 87695 · Report as offensive
jglrogujgv

Send message
Joined: 6 Jul 18
Posts: 49
Barbados
Message 87696 - Posted: 18 Aug 2018, 0:06:17 UTC - in response to Message 87674.  

my real preference is to get rid of the BOINC scheduler (almost) entirely, and just run the jobs first in/first out on a per-core basis.

The scheduler needs gutting. It's far too complicated. And what for.. so that users can see resource shares respected down to the hour? What does it matter if Project A gets way ahead of Project B for a day? Or even 2 days? Or even a week? If shares are respected over the longterm then all is well.

For example, you could devote one core to Rosetta, and another core to LHC, etc. And you could set backups, so that if one project is out of work, it would go to another. It would eliminate the problem of cores running on empty, which still happens on occasion.
Very complicated to program, I should think. Extremely complicated decision ladder. KISS for easier maintenance, easier documentation and less user frustration.

There might be exceptions to allow high-priority jobs to run on a different core, but I don't think even that is necessary. Jobs don't always finish on time even with the present BOINC scheduler. I think that overall, it would improve reliability.
Want greater reliability then cache fewer tasks.

But if that is a bridge too far, just go with your option. It can only help insofar as I can see.
I would implement it with a new element in app_config.xml:
<dont_preempt>1|0</dont_preempt>

The default would be 0. Set it to 1 for any app you want to run free of pre-emption. Should fit into existing code nicely... when scheduler is deciding which tasks to pre-empt simply don't pre-empt those set to 1 in app_config.xml.

The problem with setting "switch between tasks every..." very high is that it paints all apps with the same brush and creates a programming conundrum when a task is getting close to deadline. What do you do in that situation? Override an existing rule (the switch every interval) with an "unless" clause? How much more complicated does that make the logic/flow? How many more "unless" clauses are needed just to deal with the initial "unless"? Rather keep it simple and allow simple rules with no need for over-rides.
ID: 87696 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87697 - Posted: 18 Aug 2018, 0:25:08 UTC - in response to Message 87696.  

The scheduler needs gutting. It's far too complicated. And what for.. so that users can see resource shares respected down to the hour? What does it matter if Project A gets way ahead of Project B for a day? Or even 2 days? Or even a week? If shares are respected over the longterm then all is well.

I always though that devoting 23.7% to one project, 68.8% to another (and who knows what happens to the rest) is overdoing it a bit, and not at all necessary.

But another example just occurred to me, well-known to several people here I am sure, and that is CPDN. The work units there don't like to be pre-empted at all, and often fail when it happens too much. Just letting them run would solve half their problems.
ID: 87697 · Report as offensive
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2725
United Kingdom
Message 87699 - Posted: 18 Aug 2018, 7:54:30 UTC - in response to Message 87697.  

But another example just occurred to me, well-known to several people here I am sure, and that is CPDN. The work units there don't like to be pre-empted at all, and often fail when it happens too much. Just letting them run would solve half their problems.


Also the scheduler doesn't play very well with the longer tasks which can still take over a month on slower machines. (I have stopped using my ageing second hand netbook for BOINC as on that the tasks which take about a week or a bit more on this machine take about three months!)
ID: 87699 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 87726 - Posted: 20 Aug 2018, 0:52:30 UTC - in response to Message 87671.  
Last modified: 20 Aug 2018, 0:52:49 UTC

Actually LHC's VBox apps have a similar problem, not exactly the same but similar in the sense that the problem would be alleviated if BOINC had an option to run each and every task, regardless of project, from start to finish, no pre-empting by other tasks. With LHC the problem is that their VBox apps don't like to be interrupted. On suspension with LAIM off the VM snapshot file, written to the task's slot dir, can grow so large that the <rsc_disk_bound> is exceeded causing "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED".

They seem to be having that problem now on ATLAS.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4721&postid=36506#36506
ID: 87726 · Report as offensive
iranlaw

Send message
Joined: 23 Oct 18
Posts: 1
Iran
Message 88588 - Posted: 23 Oct 2018, 11:21:41 UTC - in response to Message 87726.  

There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe?
ID: 88588 · Report as offensive

Message boards : Questions and problems : How to finish a work unit without pausing?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.