Message boards : Questions and problems : How to finish a work unit without pausing?
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Nov 10 Posts: 310 |
The Rosetta work units are best done continuously from start to finish (24 hours in my case), but BOINC has the habit of pausing one after an hour or two in order to work on another one. It leaves the first one in memory (if LAIM is enabled) to bad effect. And disabling LAIM might adversely affect other work units. How can I specify for BOINC to finish a work unit without switching to another? |
Send message Joined: 12 Jul 14 Posts: 656 |
How can I specify for BOINC to finish a work unit without switching to another?There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe? |
Send message Joined: 8 Nov 10 Posts: 310 |
There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe? That might be the best way we have. I have tried something similar, but never that long. That affects all the projects (I usually run WCG also) and work units equally, and it would be nice to have something more specific. That is, something that applies only to Rosetta. Or, it could apply to any work unit, but only until it finishes, which of course depends on the length of that work unit. I am a bit surprised that no one has asked for it before. Thanks for the input. |
Send message Joined: 12 Jul 14 Posts: 656 |
I am a bit surprised that no one has asked for it before. Thanks for the input.There was a thread back in 2014 http://boinc.berkeley.edu/dev/forum_thread.php?id=9194 which didn't get very far. |
Send message Joined: 8 Nov 10 Posts: 310 |
There was a thread back in 2014 The BOINC scheduler seems to be a sacred cow, maybe because nobody can figure out how it works, and so they leave it alone. I am trying 1600 minutes to see how it goes. Thanks. |
Send message Joined: 5 Oct 06 Posts: 5112 |
Most BOINC tasks, from most projects, 'checkpoint' periodically. The default is every 60 seconds, and the restart from a checkpoint is quick and easy - so in the vast majority of cases, nobody is concerned if BOINC switches to another project once an hour. If Rosetta is an outlier in this respect, then it needs special treatment. Delaying task switch is a good one, up to and beyond the normal run time of the task type/project in question. Another good one is to check that 'leave tasks in memory when suspended' is active, through web preferences or BOINC Manager (depending on the preferred management style in operation). But it's possible that not all projects benefit from that setting. |
Send message Joined: 8 Nov 10 Posts: 310 |
If Rosetta is an outlier in this respect, then it needs special treatment. Delaying task switch is a good one, up to and beyond the normal run time of the task type/project in question. That is the rub. I normally enable "leave tasks in memory". But Rosetta can produce highly variable output credits when you run too many at once, and it appears to be related to memory usage. So the suggestion has been made to disable "leave tasks in memory", to free up as much memory as possible for use by the running work units. I run the 24-hour Rosetta work units, so it takes time to run through all the possible combinations of things that can go wrong (it helps to leave a few cores free in some cases, or to limit the number of Rosettas running simultaneously in other cases). It would simplify things if the work units did not suspend at all, and all the memory was available for use. It is the only project I know of that has that problem, and it is compounded by the fact that Rosetta runs for a fixed amount of time (e.g., 6, 12, 24 hours), and you only find out the credits you get at the end of the run. |
Send message Joined: 6 Jul 18 Posts: 49 |
[It is the only project I know of that has that problem Actually LHC's VBox apps have a similar problem, not exactly the same but similar in the sense that the problem would be alleviated if BOINC had an option to run each and every task, regardless of project, from start to finish, no pre-empting by other tasks. With LHC the problem is that their VBox apps don't like to be interrupted. On suspension with LAIM off the VM snapshot file, written to the task's slot dir, can grow so large that the <rsc_disk_bound> is exceeded causing "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED". So.... request for "no-preemption for Rosetta" is essentially a request for special treatment for a single project. I would expect devs to frown upon that on the grounds that it's yet another branch in an already complex algorithm and what for... for just one project? A request for a no-preemption option that would apply to all projects is a branch but it would apply to all projects which is much easier to justify from a development cost versus benefit perspective. |
Send message Joined: 29 Aug 05 Posts: 15533 |
BOINC has that option: ask the project to take out the checkpointing for that application, and its tasks will run from start to finish without break. Edit: of course, if for some unforeseen reason the run is broken, the calculations will restart from the beginning. |
Send message Joined: 8 Nov 10 Posts: 310 |
A request for a no-preemption option that would apply to all projects is a branch but it would apply to all projects which is much easier to justify from a development cost versus benefit perspective. I like that better. In fact, my real preference is to get rid of the BOINC scheduler (almost) entirely, and just run the jobs first in/first out on a per-core basis. For example, you could devote one core to Rosetta, and another core to LHC, etc. And you could set backups, so that if one project is out of work, it would go to another. It would eliminate the problem of cores running on empty, which still happens on occasion. There might be exceptions to allow high-priority jobs to run on a different core, but I don't think even that is necessary. Jobs don't always finish on time even with the present BOINC scheduler. I think that overall, it would improve reliability. But if that is a bridge too far, just go with your option. It can only help insofar as I can see. |
Send message Joined: 6 Jul 18 Posts: 49 |
BOINC has that option: ask the project to take out the checkpointing for that application, and its tasks will run from start to finish without break. Create a new problem to fix a problem? I can't see many projects agreeing to that. |
Send message Joined: 6 Jul 18 Posts: 49 |
my real preference is to get rid of the BOINC scheduler (almost) entirely, and just run the jobs first in/first out on a per-core basis. The scheduler needs gutting. It's far too complicated. And what for.. so that users can see resource shares respected down to the hour? What does it matter if Project A gets way ahead of Project B for a day? Or even 2 days? Or even a week? If shares are respected over the longterm then all is well. For example, you could devote one core to Rosetta, and another core to LHC, etc. And you could set backups, so that if one project is out of work, it would go to another. It would eliminate the problem of cores running on empty, which still happens on occasion.Very complicated to program, I should think. Extremely complicated decision ladder. KISS for easier maintenance, easier documentation and less user frustration. There might be exceptions to allow high-priority jobs to run on a different core, but I don't think even that is necessary. Jobs don't always finish on time even with the present BOINC scheduler. I think that overall, it would improve reliability.Want greater reliability then cache fewer tasks. But if that is a bridge too far, just go with your option. It can only help insofar as I can see.I would implement it with a new element in app_config.xml: <dont_preempt>1|0</dont_preempt> The default would be 0. Set it to 1 for any app you want to run free of pre-emption. Should fit into existing code nicely... when scheduler is deciding which tasks to pre-empt simply don't pre-empt those set to 1 in app_config.xml. The problem with setting "switch between tasks every..." very high is that it paints all apps with the same brush and creates a programming conundrum when a task is getting close to deadline. What do you do in that situation? Override an existing rule (the switch every interval) with an "unless" clause? How much more complicated does that make the logic/flow? How many more "unless" clauses are needed just to deal with the initial "unless"? Rather keep it simple and allow simple rules with no need for over-rides. |
Send message Joined: 8 Nov 10 Posts: 310 |
The scheduler needs gutting. It's far too complicated. And what for.. so that users can see resource shares respected down to the hour? What does it matter if Project A gets way ahead of Project B for a day? Or even 2 days? Or even a week? If shares are respected over the longterm then all is well. I always though that devoting 23.7% to one project, 68.8% to another (and who knows what happens to the rest) is overdoing it a bit, and not at all necessary. But another example just occurred to me, well-known to several people here I am sure, and that is CPDN. The work units there don't like to be pre-empted at all, and often fail when it happens too much. Just letting them run would solve half their problems. |
Send message Joined: 28 Jun 10 Posts: 2614 |
But another example just occurred to me, well-known to several people here I am sure, and that is CPDN. The work units there don't like to be pre-empted at all, and often fail when it happens too much. Just letting them run would solve half their problems. Also the scheduler doesn't play very well with the longer tasks which can still take over a month on slower machines. (I have stopped using my ageing second hand netbook for BOINC as on that the tasks which take about a week or a bit more on this machine take about three months!) |
Send message Joined: 8 Nov 10 Posts: 310 |
Actually LHC's VBox apps have a similar problem, not exactly the same but similar in the sense that the problem would be alleviated if BOINC had an option to run each and every task, regardless of project, from start to finish, no pre-empting by other tasks. With LHC the problem is that their VBox apps don't like to be interrupted. On suspension with LAIM off the VM snapshot file, written to the task's slot dir, can grow so large that the <rsc_disk_bound> is exceeded causing "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED". They seem to be having that problem now on ATLAS. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4721&postid=36506#36506 |
Send message Joined: 23 Oct 18 Posts: 1 |
There's probably a better way than what I'm about to suggest, but on local preferences on the computer tab, you could try setting the time next to "switch between tasks every ... minutes" to something like 1600 (I think the default is 60)? Maybe? |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.