Automatic fan speed control on GTX boards with CUDA

Message boards : BOINC client : Automatic fan speed control on GTX boards with CUDA
Message board moderation

To post messages, you must log in.

AuthorMessage
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22793 - Posted: 31 Jan 2009, 18:49:53 UTC

I saw a thread that mentioned fan speeds, but I've yet to figure out how to get the fan speeds to adjust according to work load automatically. I would think the board drivers would do that, but on mine they don't. I have a pair of GTX 260's in SLI. The fan speeds seem to default to 40%. When I start running a CUDA app at full saturation, the temp of the GPUs begin climbing, but the fan speeds never increase correspondingly. Temps continue to climb until the board seizes and causes a driver error and computational error on the unit. I installed the nVidia System Tools package that allows me to manually set the fan speeds, but at times when I'm not running CUDA, such as listening to music or gaming, I don't want my machine sounding like a set of turbines next to me, nor to I want to have to manually change the fan speeds every time I do something different. I tried using the new feature the tools added to create rules for fan speed based on temp, but they appear to have no effect. The speed still stays where you manually set it regardless of temperature. I have the latest nVidia drivers. Again, I would think the drivers would do this automatically. Am I missing something? Perhaps an option could be added to BOINC where we could set the fan behavior when CUDA is being utilized. It probably be a simple, single command to set the speed to a higher setting when CUDA starts, and return it to default when CUDA stops, though ideally I'd prefer to see it tied to actual GPU temp. Am I missing something? Perhaps there's a utility that will monitor temps and change fan speed automatically that someone knows about? Or nVidia could add this to the drivers? Any help appreciated since right now it's either manually crank up the fans, or don't run CUDA. Thanks.
ID: 22793 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 22795 - Posted: 31 Jan 2009, 19:25:38 UTC - in response to Message 22793.  

I saw a thread that mentioned fan speeds, but I've yet to figure out how to get the fan speeds to adjust according to work load automatically. I would think the board drivers would do that, but on mine they don't.

The drivers, working together with the VBIOS will take care of that.


ID: 22795 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22796 - Posted: 31 Jan 2009, 19:32:50 UTC

That's what I thought in my original post, that the drivers and vbios should do that automatically, but something must be wrong on my system, because every time I've tried to run some CUDA work, the card just continues to heat up until it crashes with no change in fan speed. As it stands now, I can't run CUDA, but I don't know why my cards are behaving like this. I haven't overclocked or messed with any other settings on the video cards, or the rest of my system for that matter. Frustrating...
ID: 22796 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 22797 - Posted: 31 Jan 2009, 19:54:19 UTC - in response to Message 22796.  

I can ask the developer to look at that, but he'll need information about your system setup and the cards used, OS, drivers etc. etc.

It may be a problem with your cards, though. it could also be a side effect of you running them in SLI mode, as then the GPUs will be seen as one (by the system, nothing to do with BOINC).
ID: 22797 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22798 - Posted: 31 Jan 2009, 20:07:14 UTC

Thanks... I think I'll check with nVidia as well to see if they have an idea.

Here's what BOINC reports:

1/30/2009 6:06:15 PM||Starting BOINC client version 6.6.3 for windows_x86_64
1/30/2009 6:06:15 PM||log flags: task, file_xfer, sched_ops
1/30/2009 6:06:15 PM||Libraries: libcurl/7.19.2 OpenSSL/0.9.8i zlib/1.2.3
1/30/2009 6:06:15 PM||Data directory: C:\ProgramData\BOINC
1/30/2009 6:06:15 PM||Running under account pharrg
1/30/2009 6:06:15 PM||Processor: 8 GenuineIntel Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 4]
1/30/2009 6:06:15 PM||Processor features: fpu tsc pae nx sse sse2 pni
1/30/2009 6:06:15 PM||OS: Microsoft Windows Vista: Home Premium x64 Editon, Service Pack 1, (06.00.6001.00)
1/30/2009 6:06:15 PM||Memory: 5.99 GB physical, 12.09 GB virtual
1/30/2009 6:06:15 PM||Disk: 233.81 GB total, 148.92 GB free
1/30/2009 6:06:15 PM||Local time is UTC -7 hours
1/30/2009 6:06:15 PM||Not using a proxy
1/30/2009 6:06:16 PM||CUDA device: GeForce GTX 260 (896MB, est. 96GFLOPS)



------------------------------------------------
And here's the sysinfo log from the nVidia control panel:


Video Cards: XFX GX260NADFF GeForce GTX 260 in SLI
Driver: 181.22
CPU: Core i7 920
OS: Vista Home Premium 64bit



Here's the complete system info log:


Current Time = Sat Jan 31 12:50:18 2009 Mountain Standard Time
Computer Name = PHYSICSCORE
Number of Processors = 8
========== CPUID Info ==========
Processor 0:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 1:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 2:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 3:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 4:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 5:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 6:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== CPUID Info ==========
Processor 7:
Brand = Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Family = 6
Model = 10
Stepping = 4

L1 Instr Cache Size = 0
L1 Data Cache Size = 32768
L2 Cache Size = 262144
L3 Cache Size = 0

Feature flags:

CPUID 00000001[edx]=0xbfebfbff:

fpu : Processor supported vme : Processor supported
de : Processor supported pse : Processor supported
msr : Processor supported pae : Processor supported
mce : Processor supported cx8 : Processor supported
apic : Processor supported sep : Processor supported
mtrr : Processor supported pge : Processor supported
mca : Processor supported cmov : Processor supported
pat : Processor supported pse36 : Processor supported
cflush : Processor supported ds : Processor supported
acpi : Processor supported mmx : Processor supported
fxsr : Processor supported sse1 : Processor supported
sse2 : Processor supported ss : Processor supported
htt : Processor supported tm : Processor supported
sbf : Processor supported



CPUID 00000001[ecx]=0x0098e3bd:

tm2 : Processor supported est : Processor supported
cid : No processor support sse3 : Processor supported



CPUID 80000001[edx]=0x28100800:

fpu : No processor support vme : No processor support
de : No processor support pse : No processor support
tsc : No processor support msr : No processor support
pae : No processor support mce : No processor support
cx8 : No processor support syscall : Processor supported
mtrr : No processor support pge : No processor support
mca : No processor support cmov : No processor support
pat : No processor support pse36 : No processor support
mp : No processor support nx : Processor supported
ammx : No processor support mmx : No processor support
fxsr : No processor support lm : Processor supported
x3dnow : No processor support 3dnow : No processor support

========== Operating System Information ==========
Microsoft Windows Vista Home Premium Edition, 64-bit Service Pack 1, Build 6001
========== Motherboard Information ==========
Motherboard Vendor: Not reported
Motherboard Version: Not reported
Motherboard Model:
========== DirectX version Information ==========
DirectX 10.0 (6.0.6000.16386)
========== Driver Version Information ==========
Nforce Driver Package version 6.03
Audio Driver Not reported
GART Driver Not reported
Graphics Driver 181.22 (7.15.11.8122)
Ethernet Driver Not reported
IDE Driver Not reported
nTune 6.03.12
========== Processor, Bus, and Memory Speed Information ==========
CPU Speed = 5743.270
CPU Multiplier = 1.000
FSB Frequency = 0
Memory Frequency = 0.000
AGP Frequency = 0
========== GeForce Information ==========
GPU 1: GeForce GTX 260
GPU 2: GeForce GTX 260
========== GPU's Core and Memory Speed Information ==========

GPU Core Speed = 576
GPU Memory Speed = 999
GPU Shader Clock Speed = 1242

GPU Core Speed = 576
GPU Memory Speed = 999
GPU Shader Clock Speed = 1242
GPU SLI Mode Enabled.
========== System Voltage Information ==========

========== Processor and System's Temperature Information ==========
CPU Temperature = Not Reported
Second CPU Temperature = Not Reported
System Temperature = Not Reported
GPU 0 Temperature = 46 C
GPU 1 Temperature = 50 C
========== Memory SPD Information ==========
Memory Dimm 0 (0x50) Memory not present
Memory Dimm 1 (0x51) Memory not present
Memory Dimm 2 (0x52) Memory not present
Memory Dimm 3 (0x53) Memory not present
Memory Dimm 4 (0x54) Memory not present
Memory Dimm 5 (0x55) Memory not present
Memory Dimm 6 (0x56) Memory not present
Memory Dimm 7 (0x57) Memory not present
========== NVIDIA PCI Device Information ==========

NVIDIA Device At PCI Bus 0x2 Device 0x0 Function 0x0:
Vendor ID = 0x10de DeviceID = 0x05b1
PCI Config Space Register Value for This Device
PCI Address Reg Value
0x80020000 0x05b110de
0x80020004 0x00100507
0x80020008 0x060400a3
0x8002000c 0x00010010
0x80020010 0x00000000
0x80020014 0x00000000
0x80020018 0x00050302
0x8002001c 0x0000b1b1
0x80020020 0xf7f0f400
0x80020024 0xcff1c001
0x80020028 0x00000000
0x8002002c 0x00000000
0x80020030 0x00000000
0x80020034 0x00000040
0x80020038 0x00000000
0x8002003c 0x00030000
0x80020040 0xc8036001
0x80020044 0x00000000
0x80020048 0x00000000
0x8002004c 0x00000000
0x80020050 0x00000000
0x80020054 0x00000000
0x80020058 0x00000000
0x8002005c 0x00000000
0x80020060 0x0052a010
0x80020064 0x012c8020
0x80020068 0x00000000
0x8002006c 0x00013502
0x80020070 0x11020000
0x80020074 0x00000000
0x80020078 0x00000000
0x8002007c 0x00000000
0x80020080 0x00000000
0x80020084 0x00000000
0x80020088 0x00000000
0x8002008c 0x00000000
0x80020090 0x00000002
0x80020094 0x00000000
0x80020098 0x00000000
0x8002009c 0x00000000
0x800200a0 0x0000000d
0x800200a4 0xcb1910de
0x800200a8 0x00000000
0x800200ac 0x00000000
0x800200b0 0x00000000
0x800200b4 0x00000000
0x800200b8 0x00000000
0x800200bc 0x00000000
0x800200c0 0x00000000
0x800200c4 0x00000000
0x800200c8 0x00000000
0x800200cc 0x00000000
0x800200d0 0x00000000
0x800200d4 0x00000000
0x800200d8 0x00000000
0x800200dc 0x00000000
0x800200e0 0x00000000
0x800200e4 0x00000000
0x800200e8 0x00000000
0x800200ec 0x00000000
0x800200f0 0x00000000
0x800200f4 0x00000000
0x800200f8 0x00000000
0x800200fc 0x00000000
NVIDIA Device At PCI Bus 0x3 Device 0x0 Function 0x0:
Vendor ID = 0x10de DeviceID = 0x05b1
PCI Config Space Register Value for This Device
PCI Address Reg Value
0x80030000 0x05b110de
0x80030004 0x00100504
0x80030008 0x060400a3
0x8003000c 0x00010010
0x80030010 0x00000000
0x80030014 0x00000000
0x80030018 0x00040403
0x8003001c 0x000001f1
0x80030020 0x0000fff0
0x80030024 0x0001fff1
0x80030028 0x00000000
0x8003002c 0x00000000
0x80030030 0x00000000
0x80030034 0x00000040
0x80030038 0x00000000
0x8003003c 0x00030000
0x80030040 0xc8036001
0x80030044 0x00000000
0x80030048 0x00000000
0x8003004c 0x00000000
0x80030050 0x00000000
0x80030054 0x00000000
0x80030058 0x00000000
0x8003005c 0x00000000
0x80030060 0x01620010
0x80030064 0x00008020
0x80030068 0x00000000
0x8003006c 0x00313502
0x80030070 0x11010000
0x80030074 0x00080000
0x80030078 0x00000000
0x8003007c 0x00000000
0x80030080 0x00000000
0x80030084 0x00000000
0x80030088 0x00000000
0x8003008c 0x00000000
0x80030090 0x00000002
0x80030094 0x00000000
0x80030098 0x00000000
0x8003009c 0x00000000
0x800300a0 0x00000000
0x800300a4 0x00000000
0x800300a8 0x00000000
0x800300ac 0x00000000
0x800300b0 0x00000000
0x800300b4 0x00000000
0x800300b8 0x00000000
0x800300bc 0x00000000
0x800300c0 0x00000000
0x800300c4 0x00000000
0x800300c8 0x00000000
0x800300cc 0x00000000
0x800300d0 0x00000000
0x800300d4 0x00000000
0x800300d8 0x00000000
0x800300dc 0x00000000
0x800300e0 0x00000000
0x800300e4 0x00000000
0x800300e8 0x00000000
0x800300ec 0x00000000
0x800300f0 0x00000000
0x800300f4 0x00000000
0x800300f8 0x00000000
0x800300fc 0x00000000
NVIDIA Device At PCI Bus 0x3 Device 0x2 Function 0x0:
Vendor ID = 0x10de DeviceID = 0x05b1
PCI Config Space Register Value for This Device
PCI Address Reg Value
0x80031000 0x05b110de
0x80031004 0x00100507
0x80031008 0x060400a3
0x8003100c 0x00010010
0x80031010 0x00000000
0x80031014 0x00000000
0x80031018 0x00050503
0x8003101c 0x0000b1b1
0x80031020 0xf7f0f400
0x80031024 0xcff1c001
0x80031028 0x00000000
0x8003102c 0x00000000
0x80031030 0x00000000
0x80031034 0x00000040
0x80031038 0x00000000
0x8003103c 0x00030000
0x80031040 0xc8036001
0x80031044 0x00000000
0x80031048 0x00000000
0x8003104c 0x00000000
0x80031050 0x00000000
0x80031054 0x00000000
0x80031058 0x00000000
0x8003105c 0x00000000
0x80031060 0x01620010
0x80031064 0x00008020
0x80031068 0x00010000
0x8003106c 0x02313502
0x80031070 0x71020000
0x80031074 0x00180000
0x80031078 0x00400000
0x8003107c 0x00000000
0x80031080 0x00000000
0x80031084 0x00000000
0x80031088 0x00000000
0x8003108c 0x00000000
0x80031090 0x00010042
0x80031094 0x00000000
0x80031098 0x00000000
0x8003109c 0x00000000
0x800310a0 0x00000000
0x800310a4 0x00000000
0x800310a8 0x00000000
0x800310ac 0x00000000
0x800310b0 0x00000000
0x800310b4 0x00000000
0x800310b8 0x00000000
0x800310bc 0x00000000
0x800310c0 0x00000000
0x800310c4 0x00000000
0x800310c8 0x00000000
0x800310cc 0x00000000
0x800310d0 0x00000000
0x800310d4 0x00000000
0x800310d8 0x00000000
0x800310dc 0x00000000
0x800310e0 0x00000000
0x800310e4 0x00000000
0x800310e8 0x00000000
0x800310ec 0x00000000
0x800310f0 0x00000000
0x800310f4 0x00000000
0x800310f8 0x00000000
0x800310fc 0x00000000
NVIDIA Device At PCI Bus 0x5 Device 0x0 Function 0x0:
Vendor ID = 0x10de DeviceID = 0x05e2
PCI Config Space Register Value for This Device
PCI Address Reg Value
0x80050000 0x05e210de
0x80050004 0x00100006
0x80050008 0x030000a1
0x8005000c 0x00000010
0x80050010 0xf7000000
0x80050014 0xc000000c
0x80050018 0x00000000
0x8005001c 0xf4000004
0x80050020 0x00000000
0x80050024 0x0000bf81
0x80050028 0x00000000
0x8005002c 0x23901682
0x80050030 0x00000000
0x80050034 0x00000060
0x80050038 0x00000000
0x8005003c 0x00000123
0x80050040 0x23901682
0x80050044 0x00000000
0x80050048 0x00000000
0x8005004c 0x00000000
0x80050050 0x00000001
0x80050054 0x00000001
0x80050058 0x0023d6ce
0x8005005c 0x00000000
0x80050060 0x00036801
0x80050064 0x00000008
0x80050068 0x00807805
0x8005006c 0x00000000
0x80050070 0x00000000
0x80050074 0x00000000
0x80050078 0x00020010
0x8005007c 0x000084e0
0x80050080 0x00002910
0x80050084 0x02002d02
0x80050088 0x01020008
0x8005008c 0x00000000
0x80050090 0x00000000
0x80050094 0x00000000
0x80050098 0x00000000
0x8005009c 0x00000010
0x800500a0 0x00000000
0x800500a4 0x00000000
0x800500a8 0x00000000
0x800500ac 0x00000000
0x800500b0 0x00000000
0x800500b4 0x00000000
0x800500b8 0x00000000
0x800500bc 0x00000000
0x800500c0 0x00000000
0x800500c4 0x00000000
0x800500c8 0x00000000
0x800500cc 0x00000000
0x800500d0 0x00000000
0x800500d4 0x00000000
0x800500d8 0x00000000
0x800500dc 0x00000000
0x800500e0 0x00000000
0x800500e4 0x00000000
0x800500e8 0x00000000
0x800500ec 0x00000000
0x800500f0 0x00000000
0x800500f4 0x00000000
0x800500f8 0x00000000
0x800500fc 0x00000000
NVIDIA Device At PCI Bus 0x6 Device 0x0 Function 0x0:
Vendor ID = 0x10de DeviceID = 0x05e2
PCI Config Space Register Value for This Device
PCI Address Reg Value
0x80060000 0x05e210de
0x80060004 0x00100107
0x80060008 0x030000a1
0x8006000c 0x00000010
0x80060010 0xfa000000
0x80060014 0xd000000c
0x80060018 0x00000000
0x8006001c 0xf8000004
0x80060020 0x00000000
0x80060024 0x0000cc01
0x80060028 0x00000000
0x8006002c 0x23901682
0x80060030 0x00000000
0x80060034 0x00000060
0x80060038 0x00000000
0x8006003c 0x0000011e
0x80060040 0x23901682
0x80060044 0x00000000
0x80060048 0x00000000
0x8006004c 0x00000000
0x80060050 0x00000001
0x80060054 0x00000001
0x80060058 0x0023d6ce
0x8006005c 0x00000000
0x80060060 0x00036801
0x80060064 0x00000008
0x80060068 0x00807805
0x8006006c 0x00000000
0x80060070 0x00000000
0x80060074 0x00000000
0x80060078 0x00020010
0x8006007c 0x012c84e0
0x80060080 0x00002910
0x80060084 0x00002d02
0x80060088 0x01020008
0x8006008c 0x00000000
0x80060090 0x00000000
0x80060094 0x00000000
0x80060098 0x00000000
0x8006009c 0x00000010
0x800600a0 0x00000000
0x800600a4 0x00000000
0x800600a8 0x00000000
0x800600ac 0x00000000
0x800600b0 0x00000000
0x800600b4 0x00000000
0x800600b8 0x00000000
0x800600bc 0x00000000
0x800600c0 0x00000000
0x800600c4 0x00000000
0x800600c8 0x00000000
0x800600cc 0x00000000
0x800600d0 0x00000000
0x800600d4 0x00000000
0x800600d8 0x00000000
0x800600dc 0x00000000
0x800600e0 0x00000000
0x800600e4 0x00000000
0x800600e8 0x00000000
0x800600ec 0x00000000
0x800600f0 0x00000000
0x800600f4 0x00000000
0x800600f8 0x00000000
0x800600fc 0x00000000
========== Other PCI Device Information ==========

Device At PCI Bus 0x0 Device 0x0 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3405
Device At PCI Bus 0x0 Device 0x1 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3408
Device At PCI Bus 0x0 Device 0x3 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x340a
Device At PCI Bus 0x0 Device 0x7 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x340e
Device At PCI Bus 0x0 Device 0x10 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3425
Device At PCI Bus 0x0 Device 0x10 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x3426
Device At PCI Bus 0x0 Device 0x13 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x342d
Device At PCI Bus 0x0 Device 0x14 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x342e
Device At PCI Bus 0x0 Device 0x14 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x3422
Device At PCI Bus 0x0 Device 0x14 Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x3423
Device At PCI Bus 0x0 Device 0x14 Function 0x3:
Vendor ID = 0x8086 DeviceID = 0x3438
Device At PCI Bus 0x0 Device 0x1a Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3a37
Device At PCI Bus 0x0 Device 0x1a Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x3a38
Device At PCI Bus 0x0 Device 0x1a Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x3a39
Device At PCI Bus 0x0 Device 0x1a Function 0x7:
Vendor ID = 0x8086 DeviceID = 0x3a3c
Device At PCI Bus 0x0 Device 0x1b Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3a3e
Device At PCI Bus 0x0 Device 0x1c Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3a40
Device At PCI Bus 0x0 Device 0x1c Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x3a42
Device At PCI Bus 0x0 Device 0x1c Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x3a44
Device At PCI Bus 0x0 Device 0x1d Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3a34
Device At PCI Bus 0x0 Device 0x1d Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x3a35
Device At PCI Bus 0x0 Device 0x1d Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x3a36
Device At PCI Bus 0x0 Device 0x1d Function 0x7:
Vendor ID = 0x8086 DeviceID = 0x3a3a
Device At PCI Bus 0x0 Device 0x1e Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x244e
Device At PCI Bus 0x0 Device 0x1f Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x3a16
Device At PCI Bus 0x0 Device 0x1f Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x3a22
Device At PCI Bus 0x0 Device 0x1f Function 0x3:
Vendor ID = 0x8086 DeviceID = 0x3a30
Device At PCI Bus 0x7 Device 0x0 Function 0x0:
Vendor ID = 0x10ec DeviceID = 0x8168
Device At PCI Bus 0x8 Device 0x0 Function 0x0:
Vendor ID = 0x10ec DeviceID = 0x8168
Device At PCI Bus 0xff Device 0x0 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c41
Device At PCI Bus 0xff Device 0x0 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c01
Device At PCI Bus 0xff Device 0x2 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c10
Device At PCI Bus 0xff Device 0x2 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c11
Device At PCI Bus 0xff Device 0x3 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c18
Device At PCI Bus 0xff Device 0x3 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c19
Device At PCI Bus 0xff Device 0x3 Function 0x4:
Vendor ID = 0x8086 DeviceID = 0x2c1c
Device At PCI Bus 0xff Device 0x4 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c20
Device At PCI Bus 0xff Device 0x4 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c21
Device At PCI Bus 0xff Device 0x4 Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x2c22
Device At PCI Bus 0xff Device 0x4 Function 0x3:
Vendor ID = 0x8086 DeviceID = 0x2c23
Device At PCI Bus 0xff Device 0x5 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c28
Device At PCI Bus 0xff Device 0x5 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c29
Device At PCI Bus 0xff Device 0x5 Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x2c2a
Device At PCI Bus 0xff Device 0x5 Function 0x3:
Vendor ID = 0x8086 DeviceID = 0x2c2b
Device At PCI Bus 0xff Device 0x6 Function 0x0:
Vendor ID = 0x8086 DeviceID = 0x2c30
Device At PCI Bus 0xff Device 0x6 Function 0x1:
Vendor ID = 0x8086 DeviceID = 0x2c31
Device At PCI Bus 0xff Device 0x6 Function 0x2:
Vendor ID = 0x8086 DeviceID = 0x2c32
Device At PCI Bus 0xff Device 0x6 Function 0x3:
Vendor ID = 0x8086 DeviceID = 0x2c33
ID: 22798 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 22799 - Posted: 31 Jan 2009, 20:09:54 UTC

I use a Gigabyte mobo, and the fans have 3 leads, which include a sensor lead. The cpu fan has 4 leads, as does the socket for the cpu fan, and the northbridge fan and socket has 3 leads, as does the sockets for the system fans.
So, when the chips get hotter, the sensor lead is used to speed up the fan(s).

ID: 22799 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22800 - Posted: 31 Jan 2009, 20:30:30 UTC

Yes, but I'm talking about the fans on the video boards themselves, which are self contained inside the board case and controlled by the video board. If you look at a photo of a GTX 260 or later board, they have their own fans onboard. Those are the ones that should speed up as the GPU gets hotter. It's the GPU, not CPU, that I'm talking about.
ID: 22800 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 22802 - Posted: 31 Jan 2009, 20:50:53 UTC

Same thing, basically. My display cards have fans too, but as they're internal to the display board, it's up to the card designer/maker to ensure that a similar temp sensing occurs.
ID: 22802 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22806 - Posted: 1 Feb 2009, 5:42:29 UTC

Well... I'm stubborn and wouldn't give up. I think I found a workaround. It's not perfect, but it works. As I said before, the fans wouldn't ramp up with the temp, so the cards would overheat and become unstable then I'd either get a message saying the driver had an error and has recovered, or worse, a BSOD saying the system has halted to prevent damage, then a reboot. I've never had any problems with these cards until they get over about 65 degrees Celsius, so that's the challenge, to keep them cool when running full tilt.

I kept trying in the nVidia Control Panel to get the fans to speed up as the temperatures climbed. I tried using both device rules in the device settings menu, and creating stored profiles and setting the profile policies to change to profiles with higher speed settings above certain temperatures. Neither worked. I have come to believe there's a bug in either the video bios or the drivers that doesn't let it respond to temperatures correctly.

So, I tried attacking the problem from a different angle, and it worked. Like I said, not perfect, but good enough for my situation for now. What I did was configure BOINC to only run when in screensaver mode, then configured the video cards to crank up the cooling fans whenever the screensaver comes on. I figure if the screensaver is not on, I'm probably playing games, listening to music, or something else where I don't want CUDA loading the machine and the fans loud anyway, so this works for me.

Here's what I did for others that may be having the problem:

1. Go to the BOINC computing preferences.
a. set 'Suspend while computer in use' to Yes
b. set 'In use means activity in last' to exactly same number of minutes as you have your screensaver set to turn on.
c. On the Project tab in BOINC manager, click update to get BOINC to load the new preferences.

Be careful, if you set the preferences in BOINC individually for each computer, make sure you do this on at least all the ones that are CUDA capable, or else clear the preferences there and set them on the web page.

Then, here's how to setup the video card action:

1. Of course, make sure you have the latest drivers installed.

2. Open the nVidia Control Panel.

3. Click 'Device Settings' on the right side. You should now have the 'Create Profile' tab showing.

4. In the 'Cooling' section, move the slider for GeForce GPU to 90%. (I experimented, and I really have to crank the fan that high to keep the GPU below 60c when running CUDA. Don't worry, when we're done, it will only run at that speed when doing CUDA.)

5. Repeat this for each of your GPUs if you have more than one card.

6. Click save. It will prompt for a name. Make sure to give it a new name such as 'CUDA.nsu' or whatever. Just don't overwrite the existing profiles.

7. Now click 'Profile Policies' on the left side.

8. Create two new rules. 1st, create one that says 'Load profile HOT.nsu (or whatever you named it) when screensaver is launched. 2nd, create another rule that says 'Load profile 'osbootpf.nsu' when screensaver is stopped. Make sure both are checked. You should now have 3 rules, since there would have been the default rule to load the osbootpf.nsu when windows starts.

That's it. Now, whenever your machine's screensaver comes on and presumably CUDA starts to crank, your fans will crank up as well to keep the cards cool. Then, when you wake the machine up to play a game or something, CUDA will stop and the fans will slow back down to normal again. I still think the fan speed should be automatic based on temps, but at least this does the job.

Two other things you may want to consider. On the preferences page for BOINC, I would change the 'Leave applications in memory while suspended' to NO since CUDA apps can really load up your video ram and impact performance on things like games. Also, I personally tweaked my osbootpf.nsu profile from the default 40% to 45% for each card. If you do other things that heat up the cards, you may need to set that profile even higher, but very few programs or games run the video cards as hot as all out CUDA processing does, so you probably don't need the fans cranked all the time. That's probably why nVidia sets a default of 40% thinking it's the best balance between what most people need for cooling and noise. Anyway, going to 45% I can hardly hear the difference, but it keeps the cards a few degrees cooler even at idle.

Hope this helps people get rid of driver errors and BSODs when running CUDA. Since I made these changes, I've not had a single error at all. I think heat was the cause of all my stability problems with CUDA. Now, if nVidia would fix the temp responses and BOINC would add preferences for CUDA GPU loads and such, things would be perfect.
ID: 22806 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22809 - Posted: 1 Feb 2009, 6:32:31 UTC

Update: now I've run into another bug that others have reported. When BOINC says it's suspended, at least 1 or more of the CUDA tasks keep running. I've even tried manually selecting the task and clicking suspend, and it does show a message saying suspended next to it, yet it keeps ticking up progress and time, and since my fans have slowed down to non-screensaver speed, my card starts heating up again. Plus, even if I left my fans running at full, I don't want CUDA running while I'm trying to play a game.

I give up. I'll have to pass on CUDA projects for now until either BOINC fixes the suspend issue with CUDA processes and adds a way to throttle how hard it works the GPUs, or nVidia improves the cooling issues in their vbios or drivers. I think the issue where you have to turn SLI off to use all your video cards for CUDA needs to be fixed as well.

Oh well... back to 'regular' BOINC.
ID: 22809 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 22821 - Posted: 1 Feb 2009, 15:46:54 UTC - in response to Message 22809.  

...until either BOINC fixes the suspend issue with CUDA processes

BOINC 6.6.3 has the modification to its code that should work on the GPU as well, where when suspended, it unloads the task from the video card's memory. That was the problem with the task continuing when BOINC was suspended.
ID: 22821 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 22841 - Posted: 2 Feb 2009, 17:41:38 UTC

pharrg, did you ever see what the maximum temperature of the GPUs was? If not, can you check that and post the outcome here, please? Perhaps check it in SLI mode and without SLI.
ID: 22841 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22843 - Posted: 2 Feb 2009, 19:34:50 UTC
Last modified: 2 Feb 2009, 19:37:19 UTC

I think when my card would crash, the temps were above 80c and climbing rapidly. Also, I take my cards out of SLI mode when running BOINC because currently BOINC and CUDA will only use a single card and ignore the others if you have SLI enabled, but will use all of them if you disable SLI. I know, seems backwards to me, but the way the software is now, that's how they work. It's a pain because I have to remember to re-enable SLI before starting my games.

I'm running:

2 GTX 260 cards
Driver: 181.22
Latest DirectX 10 updates
Vista 64bit SP1 with all current updates
ASUS P6T6 WS Revolution system board
6 Gb DDR3 Triple Channel RAM
BOINC 6.6.3

As for the cooling issue, I think it's systemic to all the new GTX 200 series cards. You can google and find gobs of posts in forums, including gamers that have nothing to do with BOINC or CUDA, that fight issues with BSOD's or messages from the OS saying a driver error has occured and recovered. I've even seen a couple websites where the dreaded 'nvlddmkm.sys' error, for which I see a thousand supposed solutions on the web, was reproduced repeatedly by simply overheating the video card, even non-GTX cards.

The GTX 200 series of boards have by far the fastest, hottest GPU's nVidia has released. The boards are already two slots wide, and I think nVidia has been desperate to avoid going to a triple wide board. Unfortunately, that severely limits the available space for fans and heatsinks. These chips under load run hotter that the CPU, yet are forced to make do with tiny fans and heatsinks most modders would laugh at for a CPU. Even the stock Intel fan and heatsink for Core i7 CPU's are 3 times the size of what the GTX 200 series GPUs have to work with. In fact, I've already seen aftermarket coolers being released for those who have room in thier case (and between the boards) to replace the stock fan and heatsink with a much larger one. Also, there are versions of the GTX line being released with water cooling attachments already on them.

I think we are seeing some of the same issues as extreme gamers see because CUDA is capable of putting a maximum load, indeed much more intense than the typical pc user, on these cards. nVidia is simply reaching the point where they will need to improve the cooling solutions on these boards. This likely means they will need to work with motherboard manufacturers for things like PCI slot spacing, to ensure they have room for multiple GPUs for those that want to do SLI. Even now many motherboards don't have room to do 3 or 4 boards in SLI, and even if they have room, most won't do full x16 speeds on all those slots. You have to choose the system board carefully.

People balked when they did longer cards, then again when they first started making cards 2 slots thick. Now I think they'll need to do 3 slots so they can put much beefier heatsinks and fans on them. The engineer in me thinks even the current fans and heatsinks could be vastly improved. Again, look how large many of the heatsink/fan combos are for a single CPU are. It's not surprising that a GTX 295 with two GPUs on a single board are among the ones people have the most trouble keeping cool with the dinky fan and heatsink they put on them. That's part of why I went with two 260's instead of a single 295, but still I have to crank the fan to keep it cool under load. That's life when pushing the envelope I guess.

I think since nVidia is really starting to push CUDA, and as more games that are even more graphics intense than say Crysis come along, they'll have no choice but to rethink the cooling of these monsters.

I'm thinking at this point, I'd like to see someone come out with a small compressor type unit that would simply feed chilled air into a case. If all the fans and heatsinks in my case were sucking in air that was cooler to begin with, it would help everything. Right now, all the fans and heatsinks in the world are still working with room temperature air. I'd prefer that over liquid cooling. Just pump cold air into the case. It wouldn't need to be a very large compressor at all to cool a single pc case. Entrepeneurs? As many modders buy all sorts of extreme cooling things, if the price was right, I think this would sell. I would think the server market would be big as well. I'd buy one.
ID: 22843 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 22845 - Posted: 2 Feb 2009, 20:40:05 UTC

Eh... just had a thought... if you chill all the air in the chassis very much below room temp, you'll get water condensation... oh well... I still think they could improve cooling by simple things like using different metals that conduct heat better, improved fan and heatsink fin designs, and designing cooler running chips to begin with.
ID: 22845 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 22860 - Posted: 3 Feb 2009, 13:16:15 UTC

pharrg, a question from the nVidia CUDA developer, could you please get the latest GPU-Z, set it up so it logs the sensors to a log file and run Seti with that, then send me the log? I'll forward it to the developer.

I'll give you my email address through PM.
ID: 22860 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 23102 - Posted: 14 Feb 2009, 7:46:07 UTC

Ageless... as requested, I emailed the logs and some other info to you. Let me know if you don't receive them.

Dagorath... the GTX 200 series of video cards are two slots thick, and have a case built around them. The fan and heatsinks are inside of that. They pull fresh air into them, over the heatsinks, then immediately vent the hot air out through vents built into the outside end of the cards to outside the case. The heated air never enters the case itself. Also, my cpu has a Cooler Master V8 cooling system that keeps my cpu cooler at load than the stock heatsink and fan did at idle, and my case is setup to pull that heated air out of the case immediately as well. I have a couple temp sensors that monitor ambient air temp inside my case and it rarely gets over about 48c even with everything working full tilt and is usually lower.

As for the video cards, if I manually crank up the fan speeds to something like 95%, they stay reasonably cool even while running CUDA. I just don't want to have to open the nVidia control panel, change the fan speed for each individual card, then apply the changes everytime I start a CUDA app, then repeat the process again everytime I stop CUDA to do something else. I still think the vbios or the driver should automatically change the fan speeds in response to temps, just like motherboards do with the cpu fans. The problem is really with nVidia's drivers or vbios, not BOINC, although BOINC could do some things to deal with this issue even though it's not thier fault.

Oh well... even so, I'm impressed that each of my GTX260's can process a CUDA SETI task in 19 minutes when my Core i7 CPU takes 8 hours to do one. I'm kinda wishing I'd gone with the GTX295's, and had loaded my machine with 3 of them. That'd be a lot of science we could do!
ID: 23102 · Report as offensive
pharrg

Send message
Joined: 8 Jan 09
Posts: 24
United States
Message 23131 - Posted: 15 Feb 2009, 19:33:51 UTC

Hi... just noticed one catch to the cooling process I gave above. If you run a program like media player that is set to disable the screensaver, that will prevent the fans from kicking in since the screensaver will then not come on. However, since BOINC's setting is simply based on time since last keyboard or mouse activity, it will start running anyway. So, when I started to play some music and do other non-computer things, or watch a movie, media player prevented the screensaver, and thus the fan speedup, yet BOINC resumed after a few minutes of idle mouse and it wasn't long before the cards overheated and crashed. So, either turn off the option on apps that prevent the screensaver, or snooze BOINC. Also, remember that the snooze is time limited and it will resume again if you don't watch out. That's another thing I wish they'd change. If I snooze BOINC, I want it to remain suspended until I tell it otherwise. Or, you could just exit BOINC and the science apps altogether until your done with your music, movie, or whatever, then relaunch it afterwards.

Perhaps the best solution right now is to do as Tank_Master suggested to me and use Riva Tuner to manage fan speeds based on temps. But, then you have to create profiles, launchers, etc.

I just don't understand why nVidia took automatic fan speeds out of their drivers? I understand that there are a few overclockers and such out there that want to control them differently. Fine, add a checkbox in the control panel to disable the built in automatic fan controls in the driver. But, then, for the other 99% of the pc users on the planet that simply want to put a card in and have it work automatically on it's own, it'd be fine. Most pc users don't monitor temps, don't tweak things, don't download 3rd party hardware tweakers, and all this other hassle. All they know is they try to run a game like Crysis on a card that should be able to run it full settings, and suddenly they get 'nvlddmkm.sys' errors or the errors saying the driver encountered an error and recovered, or simply get BSOD's and they don't realize that what's causing it is their cards are overheating and becoming unstable. nVidia needs to fix this because people are increasingly giving them the reputation of cards that are error prone, especially on the GTX 200 series that have a hard time staying cool anyway. It's a simple fix. Why screw 99% of users to make a few overclockers happy when they could use the checkbox as I've said and make everyone happy? All of these ideas we're working on are simply trying to deal with an intentional 'feature' of nVidia's crappy drivers. I wish someone with more clout that me would bring this up to nVidia. Perhaps since nVidia is so big on pushing CUDA, the CUDA developers could point these issues out.

I'd be willing to bet 90% of the errors and problems people see when running CUDA (and extreme graphics games) are simple cooling issues affecting the stability of the higher end cards. If nVidia would fix the drivers, many headaches for users and developers alike would probably disappear immediately.

Frustrating...
ID: 23131 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 23134 - Posted: 15 Feb 2009, 20:21:31 UTC - in response to Message 23131.  

If I snooze BOINC, I want it to remain suspended until I tell it otherwise.

This is called Suspend BOINC, and is in the menu of the BOINC manager under Activity.

ID: 23134 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 23136 - Posted: 15 Feb 2009, 22:54:10 UTC - in response to Message 23131.  

I just don't understand why nVidia took automatic fan speeds out of their drivers?

They didn't.

I understand that there are a few overclockers and such out there that want to control them differently. Fine, add a checkbox in the control panel to disable the built in automatic fan controls in the driver.

Overclockers use other means to OC their card(s). There are plenty of (3rd party) applications out there that will do it for them, which will override the stock drivers. Then there are those who forgo the stock drivers and build their own, to match their own perspective on things.

But, then, for the other 99% of the pc users on the planet that simply want to put a card in and have it work automatically on it's own, it'd be fine. Most pc users don't monitor temps, don't tweak things, don't download 3rd party hardware tweakers, and all this other hassle.

You do know what 3rd party means, right? It means applications not made by Nvidia, nor given the thumbs up by the manufacturer.

All they know is they try to run a game like Crysis on a card that should be able to run it full settings, and suddenly they get 'nvlddmkm.sys' errors or the errors saying the driver encountered an error and recovered, or simply get BSOD's and they don't realize that what's causing it is their cards are overheating and becoming unstable.

The GTX 200 series is also notorious for overheating. The rear plate of the cooler acts as a heatsink for half of the memory, and needs a good amount of airflow.

nVidia needs to fix this because people are increasingly giving them the reputation of cards that are error prone, especially on the GTX 200 series that have a hard time staying cool anyway. It's a simple fix. Why screw 99% of users to make a few overclockers happy when they could use the checkbox as I've said and make everyone happy?

May I remind you that you are on the BOINC development forums, not on the Nvidia forums? So in essence, you're talking to the wrong people here.

All of these ideas we're working on are simply trying to deal with an intentional 'feature' of nVidia's crappy drivers. I wish someone with more clout that me would bring this up to nVidia. Perhaps since nVidia is so big on pushing CUDA, the CUDA developers could point these issues out.

I am in contact with the Seti and Nvidia developers over Seti CUDA. Since these people do not write the gaming drivers, nor the Control Panel, you will really have to bring it up to the Nvidia people yourself. Again, I am pointing you to the Nvidia forums.

I'd be willing to bet 90% of the errors and problems people see when running CUDA (and extreme graphics games) are simple cooling issues affecting the stability of the higher end cards. If nVidia would fix the drivers, many headaches for users and developers alike would probably disappear immediately.

It's not that easy. Sure, saturating the GPU and RAM with a Seti task will heat up a person's video card (and computer). But none of us can look inside the computer of any person out there. We don't know how they cool their machine, if it's very dusty, if it will fall over when the sun is just shining on it..

Then there's the problem with the actual calculations. They also count for a large part of the past crashes, as neither BOINC, nor the science application was very stable. BOINC is still not stable, the science application is getting there, but still needs a lot of work. Just porting over a stable science application to a whole new platform will not work out of the box, it will need large scale testing.

The other day I pointed out to the BOINC developers that BOINC will detect and allow a CUDA capable GPU that is using a driver that is lower than the absolute minimum (177.35), to be used on Seti to be able to do the calculations of work. The people who got work with their lower drivers, were trashing all of it, all without a warning. So now the developers are busy building in a detection for the driver used, with a warning and a no work routine if your driver doesn't meet the minimum that is needed.
ID: 23136 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 23215 - Posted: 20 Feb 2009, 17:27:59 UTC

OK, I got a reply back from my source at Nvidia. He sent the logs you gave me and I sent to him onwards to the driver guys who have looked through them, but they see nothing wrong.

76 degrees Celsius is not considered as overheating, so if you get that temperature within a few minutes of operations, it's normal. Depending on the VBIOS that's on your card, the fans won't ramp up until the temperature threshold of at least 75C is reached, only then do they change the fan speed.

So the crashing comes from something else, although it can still be heat-related, but then the trouble with enough air-flow around the back of your cards. It might be that the heated air just stays in your computer case, eventually overheating RAM and CPU, making the system crash.
ID: 23215 · Report as offensive

Message boards : BOINC client : Automatic fan speed control on GTX boards with CUDA

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.