All project with CUDA capabilities are freeze and Display driver stopped responding and has recovered..

Message boards : Questions and problems : All project with CUDA capabilities are freeze and Display driver stopped responding and has recovered..
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 50968 - Posted: 22 Oct 2013, 20:29:23 UTC - in response to Message 50964.  
Last modified: 22 Oct 2013, 20:32:45 UTC

Not sure why that wasn't useful? Didn't it show all the projects I'm attached to?

It does show all the projects you're attached to, but isn't useful.
E.g. these are all the David Johnson's at Einstein. None are you, because you don't name yourself David Johnson at projects, but instead dajohnso.

So instead, these are more useful:
10/22/2013 3:11:37 AM | rosetta@home | URL http://boinc.bakerlab.org/rosetta/; Computer ID 1518755; resource share 100
10/22/2013 3:11:37 AM | superlinkattechnion | URL http://cbl-boinc-server2.cs.technion.ac.il/superlinkattechnion/; Computer ID 109671; resource share 0
10/22/2013 3:11:37 AM | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 4676852; resource share 100
10/22/2013 3:11:37 AM | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 6418760; resource share 300
10/22/2013 3:11:37 AM | Spinhenge@home | URL http://spin.fh-bielefeld.de/; Computer ID 223333; resource share 50

Since they show not only the project you're attached to, but also the computer's hostID that we can then go to directly and check up on things.

E.g. your host at Einstein, where we can check how your tasks are doing, and when they have a computation error, click on such a result's ID to see what its stderr.txt wrote.

And then there's a lot of trouble going on.
Some have exit code -1073741819 (0xc0000005) which can point out driver problems and RAM trouble.
Others have couldn't start app: Input file l1_0595.75_S6Directed missing or invalid: file missing which could be due to file or disk corruption. Or because it was written to during a sudden reboot.
Then others have There is not enough space on the disk. (0x70) - exit code 112 (0x70), which I think speaks for itself.
And last as an example, there's one I really don't know. Cannot create a symbolic link in a registry key that already has subkeys or values. (0x3fc) - exit code 1020 (0x3fc).

Do a disk check.
Do a thorough RAM check with memtest86+
It could be a problem with the motherboard (micro-fracture) that only shows up when things get warm enough.
There's nothing as impossibly difficult to diagnose as a system that throws its toys out of the pram for no good reason whatsoever.
ID: 50968 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 50972 - Posted: 22 Oct 2013, 21:02:56 UTC - in response to Message 50966.  
Last modified: 22 Oct 2013, 21:04:22 UTC

I also just installed GPU-Z and when a CUDA task is running the GPU temp doesn't get above 55c, fan at 60%, GPU load at 41%? Maybe its just the work unit but why only 41% load on GPU? I'll keep an eye on it and see if other work units use more GPU.

The load percentage GPU-Z gives is a bit flawed. It would make you believe that the GPU doesn't use all its stream processors on the one task, but it does. No, what's meant with this load is that it doesn't use all available memory on the videocard. That you can run 2 or 3 tasks at the same time on there.

Although then 'at the same time' isn't correct, as the GPU then quickly switches between them. The present day GPUs cannot put part of their processors to work on one task and the rest on another task at the same time. Perhaps in the (far) future.
ID: 50972 · Report as offensive
Profile Gary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2465
United States
Message 50975 - Posted: 23 Oct 2013, 0:49:56 UTC
Last modified: 23 Oct 2013, 0:50:19 UTC

I'm going to toss in here a random one, because I also was having display restarts on occasion. Get out the dust off (canned air) and get the dirt out of the display card heatsink. While you have the box open, do the rest of the computer. See if that doesn't help. It can't hurt.
ID: 50975 · Report as offensive
david johnson

Send message
Joined: 9 Jun 09
Posts: 14
United States
Message 50977 - Posted: 23 Oct 2013, 6:21:37 UTC - in response to Message 50968.  

Thanks for all the useful information. I am worried about some of the error messages though especially the out of disk? Were all these errors recent?

BOINC is installed on my D: drive with 292G free, its been about 292G free for the last year so there's no way I had a disk full situation. I'll run the memory test as suggested but the system has been stable for the last 2 days and I have been back to running GPU tasks. I am currently limiting WU to only Einstein@home and Rosetta@home.

Also the MB and Memory are brand new, I just replaced both, my last MB failed so I replaced PS, MB, and RAM just recently.

I just upgraded to a 1250W PS tonight just in case as well. If all goes well, I'll add a second GTX770.
ID: 50977 · Report as offensive
david johnson

Send message
Joined: 9 Jun 09
Posts: 14
United States
Message 50990 - Posted: 24 Oct 2013, 5:29:32 UTC

FYI, As a follow up. I have BOINC running on 3 machines now with various NVIDIA GPU's. They are all stable and working fine. I have suspended all WU for Seti, Spinhenge, and superlinkattechnion and everything has been stable ever since. I'll give it a few more days and see if adding any of them back causes a repeat of the issue. I have CPU running all the time and GPU only when I'm AFK (varies from 30sec to 30 min on different machines). The GPU has suspended and started dozens of times without issue.
ID: 50990 · Report as offensive
Previous · 1 · 2

Message boards : Questions and problems : All project with CUDA capabilities are freeze and Display driver stopped responding and has recovered..

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.