Message boards : BOINC client : checkpoints instead of minutes
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 05 Posts: 25 |
Would it be possible to set boinc to run a fixed number of checkpoints instead of a fixed number of minutes? Even if this saves only 24 minutes of productive crunching per 24/7 machine it would significantly add to the overall power of the system. The new scheduler with long term debt would keep resource share in line. |
Send message Joined: 30 Aug 05 Posts: 297 |
Would it be possible to set boinc to run a fixed number of checkpoints instead of a fixed number of minutes? Not sure exactly what you are asking. Projects are switched at the interval you set in the prefs. Status is written at 'checkpoint' times set by the projects, but no more often than you set in the prefs. When I first looked at the checkpoint time issue, I wanted to minimize disk writes on my laptop and set that time quite high, only to find that I was losing hours of work every time I exited BOINC for any reason. The time saved by not writing to disk is milliseconds, not minutes. The time required to switch applications, if "keep in memory" is allowed, is also milliseconds. If the application has to be reloaded, then it could be seconds over the course of a day, but still not minutes. If you want one WU to finish completely before the next starts, as happens in EDF or "panic mode", that is generally a bad idea. Too likely for later WUs to be returned past deadline if your cache is set to more than a day. Once the "estimated" times are close to the "actual" times, for all projects (V5.x required for everyone) that problem might not exist, and the developers might be able to be talked into allowing EDF as a "user option". It'd be a low priority item even then, because it's hard to see any real reason for it. |
Send message Joined: 29 Aug 05 Posts: 304 |
I would also like to see this. JM7 (and/or Chris) and I talked about this at some point in the scheduler discussions. The idea is the client should only change projects immediately after a checkpoint is written. In this configuration the host would only lose a fraction of a second of CPU time per project switch instead of the current amount which varies per project and with user settings. Best case currently is SETI which will lose an average of 30 seconds per switch. Worst is lattice or BURP where the entire workunit has to be started over. The main problem that came up was getting this to work independently on multiple CPU systems, it always caused all CPUs to change even though only one was ready. Another problem was project's like lattice or BURP that can not or do not checkpoint at all. BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 29 Aug 05 Posts: 15575 |
Worst is lattice or BURP where the entire workunit has to be started over.. Why don't you leave them in memory then, when preempted? |
Send message Joined: 30 Aug 05 Posts: 297 |
I didn't realize you were talking about lost progress because of switching between checkpoints, I thought you meant the lost CPU time during the switch... The real problem is that it seems none of the science apps checkpoint when they should. They seem to be on the "checkpoint every x minutes as set" and "checkpoint when something significant changes" approach, which leaves out the CRITICAL one to me - "checkpoint when told to quit"! If every app checkpointed when switched out, no matter what the reason was for switching out - changing to another application, benchmarks started, BOINC Manager quit, PC shutting down... - then we would quit having these "lost crunching time" problems. This is the checkpoint that SHOULD override the prefs setting and happen every single time. I would bet that this one change would also fix many of the "one app doesn't quit so now there are two running" problems, and the "WU past 100% and still running" problems, etc. |
Send message Joined: 30 Aug 05 Posts: 25 |
Worst is lattice or BURP where the entire workunit has to be started over.. For some of us who like to run all the projects this is going to be a problem eventually. And if the computer is turned off this does not matter. |
Send message Joined: 29 Aug 05 Posts: 304 |
Good in theory however it just doesn't work in practice. The reason CPDN doesn't checkpoint more often is that it takes longer to checkpoint than processing a timestamp. In that case even if it tried it would not finish checkpointing before windows killed the program as being unresponsive. When this happens it may produce an corrupt file. As I understand it BURP is in the same situation it takes longer to create/restore a checkpoint than it does to reprocess the workunit. BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 1 Oct 05 Posts: 2 |
depending on the wu length, i could even imagine an option to finish one wu completely before switching project (although for CPDN a trickle would be more appropriate). LTD should take care of resource shares. leaving app in memory is not a solution for machines with many projects/many processors. e.g. my xeon with 4 procs would already with 2-3 projects "waste" a big amount of memory resources. |
Send message Joined: 7 Sep 05 Posts: 130 |
Does anyone happen to know what the actual time intervals are for those projects that do have checkpoints? Seti was mentioned as 30 secs but what about others like EAH and LHC for example? If a user sets "Write to disk ..." as 60 secs, does that mean Seti will skip every alternate checkpoint? I know that is what you would assume but does it actually work that way? In thinking about this, as a policy for machines running multiple projects 24/7 where uptimes were normally in the range of weeks to months, and where each machine had plenty of RAM (physical or virtual) I made the decision that the most efficient way to operate would be to set the preference to something like 120 secs (to prevent any "chatty" app from writing too frequently) and to keep apps in memory when preempted (to remove any possible checkpoint loss). Then once in a blue moon if a machine went down and was rebooted, the average loss when restarting from a checkpoint would be only 60secs. Any comments on possible flaws in this sort of policy? Cheers, Gary. |
Send message Joined: 29 Aug 05 Posts: 304 |
CPDN is every 144 timestamps. Sorry I don't know the others off the top of my head. The preference sets a flag alowing a checkpoint after that amount of seconds has passed since the last checkpoint. The next time the app is ready to checkpoint after that flag is set the checkpoint will be written. For example if the app checkpoints every 4 minutes and write to disk is set for 5 minutes then you will actually get checkpoints about every 9 minutes. BOINC WIKI BOINCing since 2002/12/8 |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.