Any way to exempt one project from suspending tasks?

Message boards : Questions and problems : Any way to exempt one project from suspending tasks?
Message board moderation

To post messages, you must log in.

AuthorMessage
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 102400 - Posted: 4 Jan 2021, 14:21:21 UTC

I run CPDN, and I've been told their tasks don't like being suspended. I find if they get suspended several times, they crash.

To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.

I've got CPDN on weight 1000000, the others are between 1 and 50, so it won't pause one to do something else unless the something else is a multi core task.

I have "leave applications in memory" ticked. I'll cease downloading them for the games machine. The other 6 run Boinc 24/7.

The only thing I can think of doing is manually editing the deadline for CPDN tasks as they come in. If I set it to make Boinc think they're going to be late, it'll run them no matter what, even at the same time as a 24 core task from another project.
ID: 102400 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4598
United Kingdom
Message 102401 - Posted: 4 Jan 2021, 14:53:37 UTC

I think I'd constrain the Primegrid task to say 20 threads using app_config.xml, and let CPDN have continuous use of one (or more) of the remaining cores.
ID: 102401 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 102403 - Posted: 4 Jan 2021, 15:05:19 UTC - in response to Message 102401.  

I think I'd constrain the Primegrid task to say 20 threads using app_config.xml, and let CPDN have continuous use of one (or more) of the remaining cores.
Sounds good, although it's not flexible for future occurrences. Say this time I got 6 CPDN, and constrained Primegrid to only use 18 cores, that would work until next time I get 12 CPDN tasks, and incoming Primegrids keep pausing 6 of them.

I think my idea should work:

If I get a CPDN for every core, I'll do nothing. It'll only pause them once to shove other things in that are running late.

But if I get less CPDNs than the number of cores, or once some of them have completed, I'll set the CPDN deadlines to 1 day from now. Then any incoming 24 core tasks will run at the same time as the CPDNs. More slowly, but nothing is paused and crashes. The 24 core tasks often don't fully utilise all the cores anyway.
ID: 102403 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4598
United Kingdom
Message 102405 - Posted: 4 Jan 2021, 15:24:53 UTC - in response to Message 102403.  

The 24 core tasks often don't fully utilise all the cores anyway.
Precisely. The more cores you have working in parallel, the longer you have to wait at the synchronisation points for the laggards to catch up. I'd even suggest two PG tasks at 10 threads each, plus 4 spare cores for anything else. Keep away from multiples of 4 or 6, else PG will fill the beast and you're back to square 1.
ID: 102405 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 102406 - Posted: 4 Jan 2021, 15:46:01 UTC - in response to Message 102405.  
Last modified: 4 Jan 2021, 15:46:25 UTC

The 24 core tasks often don't fully utilise all the cores anyway.
Precisely. The more cores you have working in parallel, the longer you have to wait at the synchronisation points for the laggards to catch up. I'd even suggest two PG tasks at 10 threads each, plus 4 spare cores for anything else. Keep away from multiples of 4 or 6, else PG will fill the beast and you're back to square 1.
The CPDN moderators have told me it should be ok to suspend them as long as they're left in memory, so I'm not sure why they're crashing. I have all my computers set to leave applications in memory when suspended.

PG does actually use about 22 of 24 cores on average, which is good enough. And since that will disable HT sometimes, and also allow the CPU to engage a higher turbo gear, presumably I'm not even losing those 2 cores in their entirety.
ID: 102406 · Report as offensive
Profile Dave

Send message
Joined: 28 Jun 10
Posts: 1473
United Kingdom
Message 102407 - Posted: 4 Jan 2021, 16:11:37 UTC - in response to Message 102406.  

The 24 core tasks often don't fully utilise all the cores anyway.
Precisely. The more cores you have working in parallel, the longer you have to wait at the synchronisation points for the laggards to catch up. I'd even suggest two PG tasks at 10 threads each, plus 4 spare cores for anything else. Keep away from multiples of 4 or 6, else PG will fill the beast and you're back to square 1.
The CPDN moderators have told me it should be ok to suspend them as long as they're left in memory, so I'm not sure why they're crashing. I have all my computers set to leave applications in memory when suspended.

PG does actually use about 22 of 24 cores on average, which is good enough. And since that will disable HT sometimes, and also allow the CPU to engage a higher turbo gear, presumably I'm not even losing those 2 cores in their entirety.


I am assuming all your tasks are from the latest safr (South Africa region) batch, statistics for which are as follows,
Success: 203 (6%)
Fails: 4397 (126%)
Hard Fail: 711 (20%)
Running: 2576 (74%)
Unsent: 0 (0%)

Hard fail means failed on all three attempts. Of those that have crashed on one computer and succeeded on another, they seem mostly to be successful on a different CPU type to the one where they crashed. AMD cpus seem to be crashing slightly more/cpu but I have not looked closely enough to see if this is statistically significant.

Bottom line is the crashes may be nothing to do with suspending the tasks. I would even go so far as to say probably nothing to do with it.
ID: 102407 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 102408 - Posted: 4 Jan 2021, 16:33:30 UTC - in response to Message 102407.  
Last modified: 4 Jan 2021, 16:41:47 UTC

The 24 core tasks often don't fully utilise all the cores anyway.
Precisely. The more cores you have working in parallel, the longer you have to wait at the synchronisation points for the laggards to catch up. I'd even suggest two PG tasks at 10 threads each, plus 4 spare cores for anything else. Keep away from multiples of 4 or 6, else PG will fill the beast and you're back to square 1.
The CPDN moderators have told me it should be ok to suspend them as long as they're left in memory, so I'm not sure why they're crashing. I have all my computers set to leave applications in memory when suspended.

PG does actually use about 22 of 24 cores on average, which is good enough. And since that will disable HT sometimes, and also allow the CPU to engage a higher turbo gear, presumably I'm not even losing those 2 cores in their entirety.


I am assuming all your tasks are from the latest safr (South Africa region) batch, statistics for which are as follows,
Success: 203 (6%)
Fails: 4397 (126%)
Hard Fail: 711 (20%)
Running: 2576 (74%)
Unsent: 0 (0%)

Hard fail means failed on all three attempts. Of those that have crashed on one computer and succeeded on another, they seem mostly to be successful on a different CPU type to the one where they crashed. AMD cpus seem to be crashing slightly more/cpu but I have not looked closely enough to see if this is statistically significant.

Bottom line is the crashes may be nothing to do with suspending the tasks. I would even go so far as to say probably nothing to do with it.
Yip, just read something over in CPDN (from you? same first name, same timestamp-ish) saying there's something up with them. I did get some failures with the November batch, but not that many:

"Just to let you know that I have stopped resends going out from this batch as thee failures are very high. The experiment setup and files etc are the same as we have used for other weather@home regions so I wasn’t expecting this degree of error. I know that the South Africa region is not as stable though as other regions and I am wondering if this is what is causing the problems in this case - it would be consistent with just an initial conditions difference causing an error as Ian reported. I will check that I picked up the best restart file to use and also what the researchers want to do."

As for AMD/Intel, I'm not so sure. I have one AMD (a new Ryzen) and 6 Intels (only the 5 older ones managed to get CPDN tasks). All have been equally unreliable.
ID: 102408 · Report as offensive
Profile Dave

Send message
Joined: 28 Jun 10
Posts: 1473
United Kingdom
Message 102413 - Posted: 4 Jan 2021, 21:29:29 UTC - in response to Message 102408.  

As for AMD/Intel, I'm not so sure. I have one AMD (a new Ryzen) and 6 Intels (only the 5 older ones managed to get CPDN tasks). All have been equally unreliable.


I did add the caveat that what I had looked at may not be statistically significant. In the absence of anyone with greater script writing skills and more time than I have to look at the issue and pull the relevant data from the servers, it will stay not statistically significant. I saw the email that I posted the contents of just after I posted here.
ID: 102413 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 102436 - Posted: 5 Jan 2021, 18:16:50 UTC - in response to Message 102413.  

I did add the caveat that what I had looked at may not be statistically significant. In the absence of anyone with greater script writing skills and more time than I have to look at the issue and pull the relevant data from the servers, it will stay not statistically significant. I saw the email that I posted the contents of just after I posted here.
I got one to finish!

As for the palaver above, I will be doing that anyway, to prevent multi-core tasks from other projects from stealing time from CPDN. The fastest machine plays games, but I can live with the second GPU and most of the 24 cores running Boinc while I play, so I'm just manually pausing some stuff. The game only needs the primary GPU and 3 cores, not sure what will happen with a 53" 4K 120Hz monitor and a grand's worth of Nvidia running though. The game might decide to make use of all those cores, which at the moment are only benefitting Boinc!
ID: 102436 · Report as offensive

Message boards : Questions and problems : Any way to exempt one project from suspending tasks?

Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.