checkpoints instead of minutes

Message boards : BOINC client : checkpoints instead of minutes
Message board moderation

To post messages, you must log in.

AuthorMessage
Stanley A Bourdon

Send message
Joined: 30 Aug 05
Posts: 25
Message 424 - Posted: 17 Sep 2005, 19:40:46 UTC

Would it be possible to set boinc to run a fixed number of checkpoints instead of a fixed number of minutes? Even if this saves only 24 minutes of productive crunching per 24/7 machine it would significantly add to the overall power of the system. The new scheduler with long term debt would keep resource share in line.
ID: 424 · Report as offensive
Bill Michael

Send message
Joined: 30 Aug 05
Posts: 297
Message 427 - Posted: 18 Sep 2005, 2:31:46 UTC - in response to Message 424.  

Would it be possible to set boinc to run a fixed number of checkpoints instead of a fixed number of minutes?


Not sure exactly what you are asking. Projects are switched at the interval you set in the prefs. Status is written at 'checkpoint' times set by the projects, but no more often than you set in the prefs. When I first looked at the checkpoint time issue, I wanted to minimize disk writes on my laptop and set that time quite high, only to find that I was losing hours of work every time I exited BOINC for any reason. The time saved by not writing to disk is milliseconds, not minutes. The time required to switch applications, if "keep in memory" is allowed, is also milliseconds. If the application has to be reloaded, then it could be seconds over the course of a day, but still not minutes.

If you want one WU to finish completely before the next starts, as happens in EDF or "panic mode", that is generally a bad idea. Too likely for later WUs to be returned past deadline if your cache is set to more than a day. Once the "estimated" times are close to the "actual" times, for all projects (V5.x required for everyone) that problem might not exist, and the developers might be able to be talked into allowing EDF as a "user option". It'd be a low priority item even then, because it's hard to see any real reason for it.

ID: 427 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 431 - Posted: 18 Sep 2005, 12:06:05 UTC

I would also like to see this. JM7 (and/or Chris) and I talked about this at some point in the scheduler discussions. The idea is the client should only change projects immediately after a checkpoint is written. In this configuration the host would only lose a fraction of a second of CPU time per project switch instead of the current amount which varies per project and with user settings. Best case currently is SETI which will lose an average of 30 seconds per switch. Worst is lattice or BURP where the entire workunit has to be started over.

The main problem that came up was getting this to work independently on multiple CPU systems, it always caused all CPUs to change even though only one was ready. Another problem was project's like lattice or BURP that can not or do not checkpoint at all.
BOINC WIKI

BOINCing since 2002/12/8
ID: 431 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 441 - Posted: 18 Sep 2005, 16:45:55 UTC - in response to Message 431.  

Worst is lattice or BURP where the entire workunit has to be started over..

Why don't you leave them in memory then, when preempted?

ID: 441 · Report as offensive
Bill Michael

Send message
Joined: 30 Aug 05
Posts: 297
Message 442 - Posted: 18 Sep 2005, 23:48:30 UTC

I didn't realize you were talking about lost progress because of switching between checkpoints, I thought you meant the lost CPU time during the switch...

The real problem is that it seems none of the science apps checkpoint when they should. They seem to be on the "checkpoint every x minutes as set" and "checkpoint when something significant changes" approach, which leaves out the CRITICAL one to me - "checkpoint when told to quit"!

If every app checkpointed when switched out, no matter what the reason was for switching out - changing to another application, benchmarks started, BOINC Manager quit, PC shutting down... - then we would quit having these "lost crunching time" problems.

This is the checkpoint that SHOULD override the prefs setting and happen every single time. I would bet that this one change would also fix many of the "one app doesn't quit so now there are two running" problems, and the "WU past 100% and still running" problems, etc.

ID: 442 · Report as offensive
Stanley A Bourdon

Send message
Joined: 30 Aug 05
Posts: 25
Message 443 - Posted: 19 Sep 2005, 0:26:51 UTC - in response to Message 441.  

Worst is lattice or BURP where the entire workunit has to be started over..

Why don't you leave them in memory then, when preempted?


For some of us who like to run all the projects this is going to be a problem eventually. And if the computer is turned off this does not matter.
ID: 443 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 444 - Posted: 19 Sep 2005, 3:36:58 UTC - in response to Message 442.  


If every app checkpointed when switched out, no matter what the reason was for switching out - changing to another application, benchmarks started, BOINC Manager quit, PC shutting down... - then we would quit having these "lost crunching time" problems.

Good in theory however it just doesn't work in practice. The reason CPDN doesn't checkpoint more often is that it takes longer to checkpoint than processing a timestamp. In that case even if it tried it would not finish checkpointing before windows killed the program as being unresponsive. When this happens it may produce an corrupt file. As I understand it BURP is in the same situation it takes longer to create/restore a checkpoint than it does to reprocess the workunit.
BOINC WIKI

BOINCing since 2002/12/8
ID: 444 · Report as offensive
fzb

Send message
Joined: 1 Oct 05
Posts: 2
Message 630 - Posted: 1 Oct 2005, 19:17:43 UTC

depending on the wu length, i could even imagine an option to finish one wu completely before switching project (although for CPDN a trickle would be more appropriate). LTD should take care of resource shares.
leaving app in memory is not a solution for machines with many projects/many processors. e.g. my xeon with 4 procs would already with 2-3 projects "waste" a big amount of memory resources.
ID: 630 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 645 - Posted: 2 Oct 2005, 9:45:10 UTC

Does anyone happen to know what the actual time intervals are for those projects that do have checkpoints? Seti was mentioned as 30 secs but what about others like EAH and LHC for example?

If a user sets "Write to disk ..." as 60 secs, does that mean Seti will skip every alternate checkpoint? I know that is what you would assume but does it actually work that way?

In thinking about this, as a policy for machines running multiple projects 24/7 where uptimes were normally in the range of weeks to months, and where each machine had plenty of RAM (physical or virtual) I made the decision that the most efficient way to operate would be to set the preference to something like 120 secs (to prevent any "chatty" app from writing too frequently) and to keep apps in memory when preempted (to remove any possible checkpoint loss). Then once in a blue moon if a machine went down and was rebooted, the average loss when restarting from a checkpoint would be only 60secs. Any comments on possible flaws in this sort of policy?

Cheers,
Gary.
ID: 645 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 660 - Posted: 3 Oct 2005, 9:11:04 UTC

CPDN is every 144 timestamps. Sorry I don't know the others off the top of my head.

The preference sets a flag alowing a checkpoint after that amount of seconds has passed since the last checkpoint. The next time the app is ready to checkpoint after that flag is set the checkpoint will be written. For example if the app checkpoints every 4 minutes and write to disk is set for 5 minutes then you will actually get checkpoints about every 9 minutes.
BOINC WIKI

BOINCing since 2002/12/8
ID: 660 · Report as offensive

Message boards : BOINC client : checkpoints instead of minutes

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.