The old scheduling problem strikes again

Message boards : BOINC client : The old scheduling problem strikes again
Message board moderation

To post messages, you must log in.

AuthorMessage
Darrell
Avatar

Send message
Joined: 29 Aug 05
Posts: 11
Message 78066 - Posted: 20 May 2017, 20:04:04 UTC
Last modified: 20 May 2017, 20:17:32 UTC

Running a PrimeGrid task on the GPU that was expected to take 20 hours, but new GPU finishes them in about 8. Task gets to 99.990% complete, estimated time left is 7 secs, the task ran for 8hrs 13mins and 16secs straight, and then the client decides to suspend it and go on to other projects. Arrg.

Have to figure out where to put an if statement in the code:

if (estimated_time_left < 60) //keep running task
ID: 78066 · Report as offensive
Darrell
Avatar

Send message
Joined: 29 Aug 05
Posts: 11
Message 79605 - Posted: 15 Jul 2017, 3:48:39 UTC - in response to Message 78378.  

? Yes, and was that 99.99% a checkpoint event? I've seen 7 seconds turn into hours and hours, and since those 'estimated' TTCs are notorious inexact **, the client just goes and applies the swap app logic if it was another project's turn.

** There's a tag for app_config to at least get a better real-time remaining time estimation with <fraction_done_exact/> for sciences that don't necessarily evolve linearly.

I would have no problem if the cpu_scheduler was following the switch between tasks every 60 minutes setting, but with the PrimeGrid gpu tasks that run for more than an hour it isn't. Here is the log from a current PrimeGrid task that ran for 1:47:42 and was 99.917% complete with 14 seconds remaining:

7/14/2017 9:47:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed
7/14/2017 9:47:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:47:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:48:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:48:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:49:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:49:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:50:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed
7/14/2017 9:50:50 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:50:50 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:51:20 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:51:20 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:51:26 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:51:26 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:52:27 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:52:27 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:53:27 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:53:27 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:53:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed
7/14/2017 9:54:28 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0
7/14/2017 9:54:28 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0
7/14/2017 9:54:51 PM | PrimeGrid | [cpu_sched] Preempting genefer19_10979043_0 (removed from memory)
7/14/2017 9:54:51 PM | PrimeGrid | [task] task_state=QUIT_PENDING for genefer19_10979043_0 from request_exit()
7/14/2017 9:54:51 PM | | request_exit(): PID 7148 has 0 descendants
7/14/2017 9:54:52 PM | PrimeGrid | [task] Process for genefer19_10979043_0 exited, exit code 0, task state 8
7/14/2017 9:54:52 PM | PrimeGrid | [task] task_state=UNINITIALIZED for genefer19_10979043_0 from handle_exited_app

And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory? The problem of letting the tasks run for hours and then suspending them just before they complete is that it is usually hours before the task is restarted to finish that final few seconds which could result in being just a checker for finding a prime instead of the the computer that found the prime.
ID: 79605 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 79609 - Posted: 15 Jul 2017, 8:52:53 UTC - in response to Message 79605.  

And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory?
Yes, GPU apps are always removed from memory when suspended. GPUs don't have the facility to swap stale memory images out to a paging file on disk, so a suspended task would always continue to occupy real, physical, memory, which might be in short supply.
ID: 79609 · Report as offensive
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 79616 - Posted: 15 Jul 2017, 15:14:57 UTC - in response to Message 79609.  
Last modified: 15 Jul 2017, 15:15:45 UTC

And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory?
Yes, GPU apps are always removed from memory when suspended. GPUs don't have the facility to swap stale memory images out to a paging file on disk, so a suspended task would always continue to occupy real, physical, memory, which might be in short supply.

Am i right in saying, the GPU task actually falls back to last checkpoint and restarts from that point?
ID: 79616 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 79617 - Posted: 15 Jul 2017, 16:24:04 UTC - in response to Message 79616.  

Am i right in saying, the GPU task actually falls back to last checkpoint and restarts from that point?
They should do, provided the developer has implemented checkpointing correctly.
ID: 79617 · Report as offensive

Message boards : BOINC client : The old scheduling problem strikes again

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.