Effect of suspending and limiting on checkpoints

Message boards : Questions and problems : Effect of suspending and limiting on checkpoints
Message board moderation

To post messages, you must log in.

AuthorMessage
[STS]LoB

Send message
Joined: 9 May 23
Posts: 12
Message 111747 - Posted: 9 May 2023, 8:43:06 UTC

Since I'm experimenting a bit, I came across the following question: How do suspending and limiting of CPU usage affect checkpointing or the loss of work?

So: Is work (since the last checkpoint) lost, once I ...

  • Suspend the whole client?
  • Suspend single CPU cores (via preference "Use at most N % of the CPUs")
  • Throttle CPU time (via preference "Use at most N % CPU time")

ID: 111747 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 111753 - Posted: 9 May 2023, 9:22:11 UTC - in response to Message 111747.  

It depends on:

The quality of the project's scientific programming.
Other settings under your control.

Not every project can produce valid checkpoint files, or successfully read them back for a restart after suspension. Not much you can do about that.

For CPU tasks (only), leaving applications in memory when suspended should cause minimal data loss on suspension. If the application is removed from memory - as it always will be if BOINC or the computer is restarted - time spent processing since the last checkpoint will be lost and wasted. That could be up to several minutes, again depending on the project.

GPU tasks are always removed from memory, so the wastage is unavoidable.
ID: 111753 · Report as offensive
Brian Nixon

Send message
Joined: 19 Apr 23
Posts: 16
United Kingdom
Message 111754 - Posted: 9 May 2023, 9:22:34 UTC - in response to Message 111747.  
Last modified: 9 May 2023, 9:48:40 UTC

For CPU apps: Whether or not work is lost when tasks are suspended depends on the setting Leave non-GPU tasks in memory while suspended. If that is checked, no work should be lost, because the task processes continue to exist and just pick up where they left off when preferences later allow them to run again. If it is not checked, task processes will exit when they are suspended, so it depends on the individual app: if it is able to capture the exit request and checkpoint immediately, no work is lost. Otherwise it will simply get killed and have to restore from its previous checkpoint when restarted.

The effect is the same in all three cases: the only difference is in the number of tasks that are either suspended or allowed to run (for the % of CPUs case), and the frequency at which suspension and resumption occur (for the % of time case). This is not true: see correction below.
ID: 111754 · Report as offensive
[STS]LoB

Send message
Joined: 9 May 23
Posts: 12
Message 111755 - Posted: 9 May 2023, 9:32:50 UTC - in response to Message 111754.  
Last modified: 9 May 2023, 9:33:24 UTC

Thank you Richard and Brian

@Brian:
The effect is the same in all three cases: the only difference is in the number of tasks that are either suspended or allowed to run (for the % of CPUs case), and the frequency at which suspension and resumption occur (for the % of time case).


I don't think that is true. Because when limitng the percentage of cpu time, the applications stay in memory and active ALL the time (even if "Leave non-GPU tasks in memory while suspended" is unticked!)
So they are not officially suspended, they just receive less cpu time. That's why I prefer this over suspending threads completely.
ID: 111755 · Report as offensive
Brian Nixon

Send message
Joined: 19 Apr 23
Posts: 16
United Kingdom
Message 111756 - Posted: 9 May 2023, 9:51:14 UTC - in response to Message 111755.  
Last modified: 9 May 2023, 9:57:30 UTC

You’re right: tasks are not removed from memory when they are suspended by the % of CPU time preference [source]. Thanks for the correction!
ID: 111756 · Report as offensive

Message boards : Questions and problems : Effect of suspending and limiting on checkpoints

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.