Not running both GPUs

Message boards : Questions and problems : Not running both GPUs
Message board moderation

To post messages, you must log in.

AuthorMessage
Claggy

Send message
Joined: 23 Apr 07
Posts: 1112
United Kingdom
Message 37951 - Posted: 23 May 2011, 19:26:23 UTC - in response to Message 37950.  

Boinc Startup messages would help, as would a link to that host on that project,

Claggy
ID: 37951 · Report as offensive
Claggy

Send message
Joined: 23 Apr 07
Posts: 1112
United Kingdom
Message 38037 - Posted: 28 May 2011, 22:09:43 UTC - in response to Message 38036.  
Last modified: 28 May 2011, 22:11:06 UTC

First thing i would try is try the 268.58 drivers, there are downclocking problems with 27x.x Cuda4 drivers in conjunction with legacy Cuda apps, see this post:

This problem has recently been drawn to the attention of the SETI CUDA developers, and is in the process of being reported onwards to other CUDA-enabled BOINC projects. It seems to be related specifically to the release of nVidia drivers which will support the forthcoming v4 release of the CUDA run-time support files. The central BOINC library code isn't yet fully compatible with CUDA v4: the problem has been overcome with test programs, and should go away with the next round of application releases.

Unfortunately, v3 and earlier nVidia drivers aren't available for the very latest generation of nVidia cards, but if you have a card which can run with nVidia driver 266.58, that should avoid the problem.

Interesting, Richard do you know if this is only effecting the 5 series cards or is it effecting others running the 275.61's? I downgraded to the 275.51's and don't seem to be having these issues.

I'm not involved in the detailled technicalities, just a messenger. I'll try and find out - or someone who knows might post.



OK I will. It's quite involved, but I'll try detail first then explain further if needed.

Certain new methods that Cuda4 drivers deal with memory & Cuda transfers are sensitive to being abrubtly terminated without warning. All Windows-Boinc-Cuda app releases to date use boincApi code for their exit code, given that Boinc needs to tell applications through this channel when to snooze/resume/exit etc, as well as when the worker needs to exit normally.

Symptoms directly pertaining to effects using Cuda 4 drivers with current Boinc-Cuda applications are primarily the 'sticky downclock' problem, but also other forms of unexplained erroring out.

There are other non-Cuda related symptoms visible across non-Cuda (CPU) applications as well, most visible being truncation or erasure of the stderr.txt contents, and less visible possibly checkpoint & result files as well.

These sorts of symptoms, being apparently related to how 'nicely' the program treats the active buffer transfers when the application shuts down, seemed to be statisically more common on lower bus/memory speed systems, probably as a result of the transfers etc taking longer (i.e. higher contention).

The trial solution in testing is to install exit code within boincAPI that 'asks' the worker thread (that feeds the Cuda device etc) to shut down 'nicely', so that it can quickly finish what it is doing & tidyup before being 'killed'. At present this seems effective at preventing the downclock problem & possibly the stderr/etc truncation symptoms as well, though we're poking at it to look for unexpected issues at this time. I've relayed as much information as I can to Berkeley & will leave it in their hands.

If you experience the downclock problems, there are currently 2 options I'm aware of:
- Downgrade to driver 266.58 which is not as senstivie to its tasks being summarily executed, or
- Determine if it's a situation where you absolutely need the fix now: That would only be a possiblity for this Project (Other projects don't have the fix yet & may not be even aware of the issue), and only under special circumstances, as it would involve pre-alpha testing unproven code. We are a bit overworked at the moment with V7 & other development considerations, So please don't expect a rush release of this uproven code.

In any case, high throughput hosts are statisically less susceptible to this problem, so It is quite possible many hosts don't see the symptoms appear even with newer drivers & existing applications.

HTH, Jason


Claggy
ID: 38037 · Report as offensive

Message boards : Questions and problems : Not running both GPUs

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.