All GPU boinc projects return computation error?

Message boards : Questions and problems : All GPU boinc projects return computation error?
Message board moderation

To post messages, you must log in.

AuthorMessage
ProDigit

Send message
Joined: 8 Nov 19
Posts: 180
United States
Message 94045 - Posted: 6 Dec 2019, 5:44:53 UTC

My pc runs Linux (Lubuntu), and has been working flawlessly for the past few weeks.
From a working condition, I closed Boinc, turned off the PC, and a few days later restart boinc, and all my projects return a computation error.
Nvidia drivers are found, everything was exactly the same as before.

I ended up needing to reinstall Boinc, for the issue to be fixed on SOME projects.
Asteroids, Einstein, all worked fine before. Now they just error out :(
Meanwhile, new downloads of Collatz and Milkyway seem to work fine...
What could be the cause of this?
ID: 94045 · Report as offensive     Reply Quote
Profile JStateson
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 27 Jun 08
Posts: 516
United States
Message 94062 - Posted: 6 Dec 2019, 19:06:34 UTC - in response to Message 94045.  

My pc runs Linux (Lubuntu), and has been working flawlessly for the past few weeks.
From a working condition, I closed Boinc, turned off the PC, and a few days later restart boinc, and all my projects return a computation error.
Nvidia drivers are found, everything was exactly the same as before.


It is up to the project to implement a restart mechanism. Some projects have a robust method and others, to put it nicely, do not.

Some projects ("like "A") take longer to write checkpoint (recovery files) than others. If you "close the lid" on your laptop or tell the OS to shutdown there is a good chance that Project "B" will not get to write its checkpoints and when you restart, some of "A" will have errored out as well as all of "B"

GPUgrid: If you have two GPUs and they are different, there is a %50 chance that GPU0 will use GPU1's checkpoint and GPU1 will use GPU0's. This causes both work units to report compilation errors. If have 3 different GPUs there is far less than %33 chance.

Depending on which system I need to power down I do the following:
Issue command for NO NEW TASKS
Suspend all work units that have not started
wait for all GPU tasks to finish
I have not had a problem with CPU bound tasks like WCG but you may want to suspend CPU tasks fi a problem
Exit the gridcoin "research" program (this is a must as they have a really terrible handler for sigterm or win shutdown )


I ended up needing to reinstall Boinc, for the issue to be fixed on SOME projects.
Asteroids, Einstein, all worked fine before. Now they just error out :(
Meanwhile, new downloads of Collatz and Milkyway seem to work fine...
What could be the cause of this?


The only time a re-install of BOINC is needed is if there is a disk drive problem and boinc does not start.
On rare occasions (gpugrid comes to mind) there is a bug in the project startup like a null account or maybe a null (empty) reply file caused by power going off when the file was written. Very likely all subsequent work units will error out. Just reset the project instead of reinstalling boinc.
ID: 94062 · Report as offensive     Reply Quote
ProDigit

Send message
Joined: 8 Nov 19
Posts: 180
United States
Message 94063 - Posted: 7 Dec 2019, 5:20:16 UTC - in response to Message 94062.  
Last modified: 7 Dec 2019, 5:20:47 UTC

My pc runs Linux (Lubuntu), and has been working flawlessly for the past few weeks.
From a working condition, I closed Boinc, turned off the PC, and a few days later restart boinc, and all my projects return a computation error.
Nvidia drivers are found, everything was exactly the same as before.


It is up to the project to implement a restart mechanism. Some projects have a robust method and others, to put it nicely, do not.

Some projects ("like "A") take longer to write checkpoint (recovery files) than others. If you "close the lid" on your laptop or tell the OS to shutdown there is a good chance that Project "B" will not get to write its checkpoints and when you restart, some of "A" will have errored out as well as all of "B"

GPUgrid: If you have two GPUs and they are different, there is a %50 chance that GPU0 will use GPU1's checkpoint and GPU1 will use GPU0's. This causes both work units to report compilation errors. If have 3 different GPUs there is far less than %33 chance.

Depending on which system I need to power down I do the following:
Issue command for NO NEW TASKS
Suspend all work units that have not started
wait for all GPU tasks to finish
I have not had a problem with CPU bound tasks like WCG but you may want to suspend CPU tasks fi a problem
Exit the gridcoin "research" program (this is a must as they have a really terrible handler for sigterm or win shutdown )


I ended up needing to reinstall Boinc, for the issue to be fixed on SOME projects.
Asteroids, Einstein, all worked fine before. Now they just error out :(
Meanwhile, new downloads of Collatz and Milkyway seem to work fine...
What could be the cause of this?


The only time a re-install of BOINC is needed is if there is a disk drive problem and boinc does not start.
On rare occasions (gpugrid comes to mind) there is a bug in the project startup like a null account or maybe a null (empty) reply file caused by power going off when the file was written. Very likely all subsequent work units will error out. Just reset the project instead of reinstalling boinc.

Thank you for the explanation.
I can understand that the last project errors out, but all tasks (like 20 of them) all had compute errors.
I'm not sure if newly downloaded tasks had the same error or not, I'll have to take a look at it now.
ID: 94063 · Report as offensive     Reply Quote

Message boards : Questions and problems : All GPU boinc projects return computation error?

Copyright © 2020 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.