Message boards : BOINC Manager : Exiting with Data Left in Memory
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Dec 05 Posts: 10 |
When BOINC is exited normally, i.e. File>Exit, is the status of any suspended project work that was left in memory lost? How about running projects? Are they checkpointed or is any crunching since the last project initiated checkpoint wasted? Regards, Mike |
Send message Joined: 4 Dec 05 Posts: 35 |
and as a small adjunct to this... also when a machine goes into standby or hibernate states are they checkpointed... just in case they don't restart for whatever reason? The wiki answer here seems to indicate that it's only the last data written at a checkpoint interval that's saved If the BOINC Client Software is halted for what ever reason, processing starts with the most recent Checkpoint so I'm guessing that the client doesn't to a full tidy-up on a file|exit condition (or for safety when Windows alerts it to a standby or hibernate condition) but I may just be missing something about those conditions as they're more managed than a forced abort etc Random Thoughts |
Send message Joined: 29 Aug 05 Posts: 225 |
Nope, you got it in one. If the application checkpoints on a 1 minute basis you lose, on average, 30 seconds work. Those that have longer, or no checkpointing will cause you to lose all the work to that point. Most project have reasonable checkpoints, but there are gottcha's like the one where I got a "hung" Rosetta@Home work unit that "ate" 25 hours of computer time and got no where... being worked on by the project ... Or the Predictor@Home work units that threw up a FORTRAN error dialog and halted the CPU till Ok was pressed ... as far as I know they are not seriously looking into this problem. The BOINC Client cannot do what the Science Application does not allow it to do. So, the responsibility is back on the projects to ensure "safe" computing. Most do a pretty good job, but, some checkpoints are so large that they are not practical to do very often, like CPDN's, I forget what their interval is ... (15 min?), but, this is one of the reasons I run 24/7 :) |
Send message Joined: 4 Dec 05 Posts: 35 |
The BOINC Client cannot do what the Science Application does not allow it to do. So, the responsibility is back on the projects to ensure "safe" computing. Most do a pretty good job, but, some checkpoints are so large that they are not practical to do very often, like CPDN's, I forget what their interval is ... (15 min?), but, this is one of the reasons I run 24/7 :) I guess 15 mins is okay, as the loss is 7 mins on average.... but it's frustrating that there may be science projects that don't appreciate the wasted work! I imagine there is a balance in terms of efficiency between checkpointing too often and not often enough... the latter wastes time and effort while the former probably affects throughput on a stable machine.... maybe a 'smart' checkpoint system that adapts to how a machine is being used... Random Thoughts |
Send message Joined: 30 Aug 05 Posts: 297 |
Checkpointing is easy for some projects, and much more difficult for others, just because of the nature of the work being done. For example, SETI checkpoints "very often", because there you're running a fairly small loop of activities. Rosetta checkpoints very "rarely", because each WU is made up of only 10 "blocks", that are each started with a random seed. On a fast computer, the 10 checkpoints may each be 10 minutes apart; but on a very slow system, they may be over an hour apart. Short of writing the entire contents of memory to the file, there just isn't a "quick" way to checkpoint any more often than they do. So some projects are much better suited to "intermittent use" machines than others. Rosetta _really_ does best when it runs 24/7, SETI is fine with ten minutes run-time here and there. |
Send message Joined: 29 Aug 05 Posts: 225 |
One more point, there are programs where the simulation *HAS* to be run from end to end and there is no possibility that they can be checkpointed. I am not sure that is case with Rosetta@Home yet, but, for example, CPDN can be very sensitive to stopping and restarting the models. There was some work (Folding@Home?, which is in Alpha test) that did not or does not check point at all. and the work runs for a day or so ... So, yes, it can be inattention on the project's part, or, just not done yet, not practical to do, etc. Like many problems in system design there may not be a simple answer, contrary to some. And I have not "met" a project staffer that wants to have any more waste than is unavoidable. For all these reasons, *I* recommend 1) Don't run projects in testing 2) If the project is doing something you don't like, vote with your feet. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.