GPUgrid not always resuming tasks correctly

Message boards : Projects : GPUgrid not always resuming tasks correctly
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 366
United States
Message 94242 - Posted: 13 Dec 2019, 20:30:47 UTC - in response to Message 94222.  

Interesting. I think you may have figured out a workaround. I have most often seen a resumed after suspend task go to start up again on device 0. So I think your solution of moving the slot contents might work.

Needs an experiment to test validity. The harm is only in dumping the task which you would have done anyway if it didn't start back up on the same device.
ID: 94242 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94243 - Posted: 13 Dec 2019, 20:46:37 UTC

No slot contents got moved during the course of this conversation. It's the device that matters, not the storage location.
ID: 94243 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94244 - Posted: 13 Dec 2019, 22:27:28 UTC - in response to Message 94243.  
Last modified: 13 Dec 2019, 22:31:26 UTC

No slot contents got moved during the course of this conversation. It's the device that matters, not the storage location.


Possibly may work if the apps are all the same executable code and can handle sm_52 or sm_60 or anything.

Consider this: The app that was running in slot 17 is brought is back up in 17 by boinc. However, it is given a different GPU device

If the data files in the slot have been moved there correctly, they now match the device that is going to do the crunching.

This gets difficult with more than two boards but can be easily tested but is not anything I am really interested in doing and work units are far and few between.

Looking in the gpugrid folder there seem to be a lot of stuff and possibly different apps and if there is one app for each class of device then moving the slot data will fail because the app cannot handle the data even if the data matches the device.
ID: 94244 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94246 - Posted: 13 Dec 2019, 23:15:15 UTC

There is just one app - acemd3.exe, which is invoked by wrapper_6.1_windows_x86_64.exe

The wrapper defines how acemd3 is deployed, using job-win64.xml:

<job_desc>
    <task>
	<application>acemd3.exe</application>
	<command_line>--boinc input --device $GPU_DEVICE_NUM</command_line>

	<setenv>TMPDIR=$PWD</setenv>
	<setenv>TEMP=$PWD</setenv>
	<setenv>TMP=$PWD</setenv>
	<setenv>HOME=$PWD</setenv>
	<setenv>ACEMD_PLUGIN_DIR=$PWD</setenv>
	<setenv>SystemRoot=C:\Windows</setenv>
	<setenv>ComSpec=C:\Windows\system32\cmd.exe</setenv>

	<stdout_filename>progress.log</stdout_filename> 
	<checkpoint_filename>restart.chk</checkpoint_filename>
	<fraction_done_filename>progress</fraction_done_filename>
    </task>
</job_desc>
(that's the one which is missing the 'priority' line)

That also shows how acemd3 gets the device number from init_data.xml (via the wrapper setting $GPU_DEVICE_NUM)

That much is certain. Now for the speculation.

There are also a lot of DLLs ('Dynamic Link Libraries', under Windows). An interesting one is 'OpenMMCudaCompiler.dll': just at the moment, I don't have a GPUGrid task running, so I can't identify the source code that the compiler is going to work on. It may be downloaded with the task.

My guess is that when the task starts, OpenMMCudaCompiler.dll compiles the source code to suit the GPU initially specified by BOINC. On restart, the compiled code is found, or assumed, to be already present - but if the device has changed, the old code is no longer appropriate for the new device. We need to find a way of forcing re-compilation to suit the new device.

Which I'll try to do next time I get allocated a task during working hours.
ID: 94246 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94313 - Posted: 17 Dec 2019, 13:37:14 UTC - in response to Message 94246.  

Which I'll try to do next time I get allocated a task during working hours.
Well, I got one - just in time for SETI maintenance, which is convenient.

Having backed up all the files, taken a copy of client_state.xml, and disabled networking - let's see what we can do.

First, I can confirm that running on a GTX 750 Ti, the compiler output files start

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26218862
// Cuda compilation tools, release 10.1, V10.1.168
// Based on LLVM 3.4svn
//

.version 6.4
.target sm_50
.address_size 64
- so they are specific to the card in use. I'll run this to the first checkpoint (~15 minutes), and then see if it'll pick up on the GTX 970 if I delete those files (testing to see if they get re-generated automatically).
ID: 94313 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94314 - Posted: 17 Dec 2019, 13:42:56 UTC
Last modified: 17 Dec 2019, 14:04:29 UTC

Sadly, no go. Found an error message:

ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!
so I'll have to put all those backup files back in place. Ugh.

Well, it's running. File restart.chk is generated at the first checkpoint, and the only readable bits are:

25000
GeForce GTX 970
OpenMM Binary Checkpoint
- the rest is binary. (And there's a lot of it - 24,186,281 bytes. I don't think we'll be able to get much further.)
ID: 94314 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94333 - Posted: 17 Dec 2019, 20:51:30 UTC - in response to Message 94314.  
Last modified: 17 Dec 2019, 21:17:31 UTC

Sadly, no go. Found an error message:
ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!



Does this scenario describe what is happening:
Premise: at start  d0 is faster than d1 and boinc assigns faster GPU first 
d0 using long-data-0 in slot-0
d1 using short-data-1 in slot-1 and has a short deadline
d1 finished first as data-1 is simple
d1 working on short-data-2 in slot-1 and also has a short deadline

====tasks suspended and system reboots===

on startup tasks resumed are data0 and data2
data0 is in slot 0
data2 is in slot 1
so far no problem
boinc looks at priorities to decide which to run first: the short tasks have short deadlines
boinc choses faster gpu for the short deadline and "d0" starts working on data-2 in slot-1 which
is a GPU mismatch, not a slot or data mismatch


just a guess, trying to figure out what has happened.
I have 3 systems set up to get gpugrid but not a single tasks has shown up in days, so I am just speculating. Even it I did get some tasks in I would have to move a gtx1060 into a system with a gtx1070 to get a mismatch and I have had bad experiences moving boards needlessly.

[edit] This thread is so far off the original subject that Keith should request the moderator to move just about everything to a new thread "GPUgrid not always resuming tasks correctly" or something like that and put that into "projects"
ID: 94333 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 366
United States
Message 94342 - Posted: 18 Dec 2019, 3:15:01 UTC - in response to Message 94333.  

[edit] This thread is so far off the original subject that Keith should request the moderator to move just about everything to a new thread "GPUgrid not always resuming tasks correctly" or something like that and put that into "projects"

done.
ID: 94342 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94345 - Posted: 18 Dec 2019, 6:57:51 UTC

While the thread is still around, I will brag: I got 4 gpugrid tasks running on my SETI mning machine. SETI has been out of tasks for hours and I lucked out and snagged a few.
ID: 94345 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 366
United States
Message 94346 - Posted: 18 Dec 2019, 7:36:38 UTC - in response to Message 94345.  
Last modified: 18 Dec 2019, 7:38:58 UTC

While the thread is still around, I will brag: I got 4 gpugrid tasks running on my SETI mning machine. SETI has been out of tasks for hours and I lucked out and snagged a few.

Richard will hate me for saying, but I seem to get GPUGrid work every day on all or most of my cards. May be as little as one task, but I consistently get work.

[Edit] I've had work every day since 8 December it seems. Just lucky or maybe the schedulers reward hosts that return work every day with more work.
ID: 94346 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94347 - Posted: 18 Dec 2019, 8:23:42 UTC - in response to Message 94333.  

boinc looks at priorities to decide which to run first: the short tasks have short deadlines
boinc choses faster gpu for the short deadline and
I'm not sure that is the cause

"d0" starts working on data-2 in slot-1 which
is a GPU mismatch, not a slot or data mismatch
but that's certainly the mechanism that causes the error.

I think it's more like

d0 starts working on task 1
d1 starts working on task 2
d0 finishes task, starts working on task 3
task 1 reports, and everything shifts up one: d1 is working on task 1, d0 is working on task 2
restart
d0 is allocated to the new task 1
d1 is allocated to task 2

- which is a swap from before the restart. I think it's a simple 'first come, first served' for each device, each task.
ID: 94347 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94363 - Posted: 18 Dec 2019, 14:25:49 UTC - in response to Message 94347.  
Last modified: 18 Dec 2019, 15:12:45 UTC


I think it's more like

d0 starts working on task 1
d1 starts working on task 2
d0 finishes task, starts working on task 3
task 1 reports, and everything shifts up one: d1 is working on task 1, d0 is working on task 2

I do not follow this. I would think that if task 1 completed then it is gone and done with and d1 would get task 3,4,5 etc
however, if a "task" is considered a class such as task 1 is s_52 and task 2 is s_60 then there are only "2 tasks"
There is no task 3, just data 1,2, 3 ,etc for only the two classes
task 1 reports and another s_52 arrives and starts running
so far:
 d0 is on s_60 
 d1 is on s_52
 reboot occurs  T1 asks for a device before T2 does
s_52 gets d0 as d0 is top of list ==>fail 
s_60 gets d1 as d1 is next in line ==>fail



restart
d0 is allocated to the new task 1
d1 is allocated to task 2

- which is a swap from before the restart. I think it's a simple 'first come, first served' for each device, each task.


Another possibility I was thinking of was one of Keith special SETI cuda jobs come in and after 60 minutes d0 is switched to it
If the resource for GPUGRID is 0 and SETI is 100 (likely for keith) then I believe GPUGRID will run to completion and the 60
minute time slice does not apply. I run a lot of backup projects, normally Einstein and I have never seen them give up their time to a higher priority when they are zero. I see time slicing at 60 minutes when both projects are %50 or thereabouts. However, I might not have noticed a 0-100 exchange so cannot be %100 sure

[EDIT] The whole thread was moved, not just the part that deviated. Anyway, I am glad that I did not get a private message for each of the "moved" messages in the thread like what happened to me in SETI recently.

[EDIT-2]. If indeed, the data is downloaded as s_52 and s_60 then the problem could be fixed by only sending the lower class as the better device will be able to handle any of the lower classes.

Question: How does the project know there is more than one type of GPU? If the scheduler request identifies what is available then that accounts for different classes being sent. In that case the GPU identification could be "faked" to indicate that all the GPU were lower class and all would get s_52 instead of a mix. That could easily be done as the SETI people already fake the number of GPUs and all that is necessary is force all the identities to be the weaker GPU. just a guess and it would only work if the checkpoint file contains science data only and not unique gpu parameters.
ID: 94363 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94365 - Posted: 18 Dec 2019, 16:33:33 UTC - in response to Message 94363.  

I think it's more like

d0 starts working on task 1
d1 starts working on task 2
d0 finishes task, starts working on task 3
task 1 reports, and everything shifts up one: d1 is working on task 1, d0 is working on task 2
I do not follow this. I would think that if task 1 completed then it is gone and done with and d1 would get task 3,4,5 etc
however, if a "task" is considered a class such as task 1 is s_52 and task 2 is s_60 then there are only "2 tasks"
There is no task 3, just data 1,2, 3 ,etc for only the two classes
Sorry, I was unclear. I was numbering the tasks as they appear in an unsorted FIFO list. We work on tasks 1, then 2, then 3. When task 1 is reported as complete, it is removed from the list, and everything below moves up one line. Task 2 becomes the new task 1, task 3 becomes the new task 2, and so on. Those new list positions are the only ones known to BOINC after the restart.

Question: How does the project know there is more than one type of GPU? If the scheduler request identifies what is available then that accounts for different classes being sent. In that case the GPU identification could be "faked" to indicate that all the GPU were lower class and all would get s_52 instead of a mix. That could easily be done as the SETI people already fake the number of GPUs and all that is necessary is force all the identities to be the weaker GPU. just a guess and it would only work if the checkpoint file contains science data only and not unique gpu parameters.
No: the server does not know that you have GPUs of different specifications. Look in the file 'sched_request_www.gpugrid.net.xml' - that's the only way that GPUGrid (or any project) gets information about our machines. The file contains all the data you see on the website (and then some), but it goes on to say

<coproc_cuda>
   <count>2</count>
   <name>GeForce GTX 970</name>
- and that's from one of my GTX 970 + GTX 750 Ti combos. There's no reference to the lesser card at all.
ID: 94365 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94366 - Posted: 18 Dec 2019, 16:59:53 UTC - in response to Message 94365.  
Last modified: 18 Dec 2019, 17:00:22 UTC

No: the server does not know that you have GPUs of different specifications. Look in the file 'sched_request_www.gpugrid.net.xml' - that's the only way that GPUGrid (or any project) gets information about our machines. The file contains all the data you see on the website (and then some), but it goes on to say

<coproc_cuda>
   <count>2</count>
   <name>GeForce GTX 970</name>
- and that's from one of my GTX 970 + GTX 750 Ti combos. There's no reference to the lesser card at all.


Then a module in the gpugrid project folder decides to use s_52 for the lower class and s_60 for the better.

I suspect if it simply picked the s_52 then the problem goes away since ether co-processor can process s_52

This could be suggested to the project.

It would be nice to prove this was the case.
ID: 94366 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3959
United Kingdom
Message 94367 - Posted: 18 Dec 2019, 17:18:59 UTC - in response to Message 94366.  

I think all that happens locally, not on the server.

As I see it, the BOINC client (which does know about the different GPUs installed) just picks 'first free card', and tells the project's app to get on with its task on that card. For GPUGrid, the first process is to compile the application source code into a format suitable for that card - that's the point at which the s_52, s_60 etc come into the picture.

[I haven't looked at all the files to find the raw source code yet, but I might. The compiler output is an intermediate p-code similar to assembly language, which is why we can read it: there'll be a later interpreter stage where the p-code is rendered into binary op-codes]
ID: 94367 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 94377 - Posted: 18 Dec 2019, 21:07:25 UTC
Last modified: 18 Dec 2019, 21:09:04 UTC

Found solution

One of my boards overheated and I needed to reboot and issue new fan speed settings as I had forgotten to do that when I powered it up earlier. The board overheating was the one vnc was using so unable to use vnc to run speed settings. Speed settings required a $DISPLAY and cannot be done from putty.

There was one GPU grid task left over from the four I had and it had about an hour left. It was D4 so I excluded do..d3 and d5..d8 in cc_config and rebooted

GPUGRID got d4 as all others were excluded. This gave me the idea:

if a pair of RTX-2070 and as single gtx1060 then use one boinc service for the pair of RTX and another service for the gtx. Since the client sees all the boards, then the exclude is used to deny access to the other services boards.

It is a PITA to set up multiple clients with the existing boinc windows and Linux version. However, a script can be created to simplify the procedure. I actually have a script I tested on Milkyway that worked fine,. It split my 6 GPUs into a pair of 3 each which allowed me to obtain the project max of 900 work units for each client. I have no need to get 1800 work units, it was just a test and I am back to 900 for all 6 GPUs. My script was simplified as I as able to use my special boinc client "mod" to supply a different hostname to the client which get a unique project id for the new host. Without that option the script would be much more complicated. If anyone is truly interested I can put a script together and submit it as 3rd party but I would need the feature of setting the hostname. This was discussed in issue 3337 and marked as to-be-determined.

Different GPUs in a system is not that common among regular users and a professional gridcoin miner would have all identical boards per system. The GPUGrid project could modify their code to "start over". They already know the board is different as a message was printed to that effect. This whole discussion is storm in a teacup .
ID: 94377 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 366
United States
Message 94379 - Posted: 19 Dec 2019, 3:15:13 UTC

My solution is even simpler. I just set "switch task every xx minutes" to 360 minutes for all my hosts. Longest task so far on the slowest card has only come to about 4 .75 hours so still in the clear for the tasks to start and finish on the same card.

Just have to pay attention whenever you want to stop BOINC or shut a machine down to look first and see if any GPUGrid tasks are running. The one host with identical cards is never a problem because whatever the pre-compiler generates is compatible against all cards.

I don't have any issues with priority because of the once started, don't stop setting. Even though the resource share is 10:1 for Seti against my other projects.
ID: 94379 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 538
United States
Message 95359 - Posted: 20 Jan 2020, 21:18:34 UTC

Verified a solution to the problem that I had guessed

6 GPUs: d0 is 1660Ti (cc7 not supported by Asteroids at home) rest are gtx class (CC 6.0)

d0 was an excluded GPU for Asteroids.

d1..d5 were excluded GPUs for Einstein as it was OK for Einstein to use the gtx1660Ti

Unaccountably, d0 was idle. This is a different problem for a different post.

A gpugrid was running on d1 and in slot 0
There were 4 Asteroids running.

I did not want an idle gpu so I suspended all tasks, went to slot 0 and deleted the checkpoints. This forces gpugrid to start over.

After rebooting, I resumed gpugrid first. That got it d0, the best GPU, and it started up in slot 0 which is where the app was suspended in the first place. Al the other projects resume from suspension just fine and there are no idle gpus and no gpugrid "computation error" since it started from scratch, not where it was left off on a gtx1070.
ID: 95359 · Report as offensive
Previous · 1 · 2

Message boards : Projects : GPUgrid not always resuming tasks correctly

Copyright © 2020 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.