BSOD

Message boards : Questions and problems : BSOD
Message board moderation

To post messages, you must log in.

AuthorMessage
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31318 - Posted: 3 Mar 2010, 2:46:04 UTC

Hi,

Been using BOINC for well over 5 years now. Have had nothing but BSOD with every version I have tried on Win 7 Ultimate x64 from 6.10.18 through 6.10.34

Machine does not BSOD with any other program loaded or running. Plenty of power and cooling. Have tested memory and CPU and no issues there.

Have tired with cc_config set to No CPU usage and without the cc_config.

BOINC begins to process and after about 5 minutes BSOD and reboot.

Did not happen with Win 7 RTM x64 on same hardware and does not happen on Vista x32 machine in same network.

Any thoughts?
ID: 31318 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31321 - Posted: 3 Mar 2010, 6:49:03 UTC

Here is a copy of the debug dump. As stated before, have had hardware components tested, and latest drivers for all components.


Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\030210-21964-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is:
Windows 7 Kernel Version 7600 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7600.16385.amd64fre.win7_rtm.090713-1255
Machine Name:
Kernel base = 0xfffff800`02c4d000 PsLoadedModuleList = 0xfffff800`02e8ae50
Debug session time: Tue Mar 2 15:46:21.756 2010 (UTC - 8:00)
System Uptime: 0 days 0:19:59.755
Loading Kernel Symbols
...............................................................
................................................................
......................................
Loading User Symbols
Loading unloaded module list
......
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 124, {0, fffffa800dcc3028, fe000000, 40015a}

Probably caused by : hardware

Followup: MachineOwner
---------

5: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800dcc3028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000fe000000, High order 32-bits of the MCi_STATUS value.
Arg4: 000000000040015a, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


BUGCHECK_STR: 0x124_GenuineIntel

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

PROCESS_NAME: metropolis_3.1

CURRENT_IRQL: f

STACK_TEXT:
fffff880`03290b58 fffff800`02c16903 : 00000000`00000124 00000000`00000000 fffffa80`0dcc3028 00000000`fe000000 : nt!KeBugCheckEx
fffff880`03290b60 fffff800`02dd3513 : 00000000`00000001 fffffa80`0a2e9650 00000000`00000000 fffffa80`0a2e96a0 : hal!HalBugCheckSystem+0x1e3
fffff880`03290ba0 fffff800`02c165c8 : 00000000`00000728 fffffa80`0a2e9650 fffff880`03290f30 fffff880`03290f00 : nt!WheaReportHwError+0x263
fffff880`03290c00 fffff800`02c15f1a : fffffa80`0a2e9650 fffff880`03290f30 fffffa80`0a2e9650 00000000`00000000 : hal!HalpMcaReportError+0x4c
fffff880`03290d50 fffff800`02c15dd5 : 00000000`00000008 00000000`00000001 fffff880`03290fb0 00000000`00000000 : hal!HalpMceHandler+0x9e
fffff880`03290d90 fffff800`02c09e88 : 00000000`00000000 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpMceHandlerWithRendezvous+0x55
fffff880`03290dc0 fffff800`02cbd7ac : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalHandleMcheck+0x40
fffff880`03290df0 fffff800`02cbd613 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxMcheckAbort+0x6c
fffff880`03290f30 00000000`0040b32d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiMcheckAbort+0x153
00000000`028dfebc 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x40b32d


STACK_COMMAND: kb

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: hardware

IMAGE_NAME: hardware

DEBUG_FLR_IMAGE_TIMESTAMP: 0

FAILURE_BUCKET_ID: X64_0x124_GenuineIntel_PROCESSOR_CACHE

BUCKET_ID: X64_0x124_GenuineIntel_PROCESSOR_CACHE

Followup: MachineOwner
---------

ID: 31321 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31322 - Posted: 3 Mar 2010, 8:40:49 UTC - in response to Message 31321.  

Could you please post what the BSOD says? Is it WHEA_UNCORRECTABLE_ERROR?
Could you also please post your system specifications?
Could you also please state which projects you are attached to and with which you see the error happen mostly? (I don't know which project runs Metropolis).

So far though it points in the direction of a hardware error. CPU/FPU or incorrect driver for some hardware. Specifically due to what the debug says.

Probably caused by : hardware

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

FAILURE_BUCKET_ID: X64_0x124_GenuineIntel_PROCESSOR_CACHE

Are you running with an Nvidia CUDA GPU? If so, with what driver set? Have you tried a previous driver set? Are you running with Windows 7 drivers only?
ID: 31322 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31329 - Posted: 3 Mar 2010, 17:19:51 UTC - in response to Message 31322.  
Last modified: 3 Mar 2010, 18:11:17 UTC

Hi,

Thanks for the reply.

I have tried all drivers from the WDM/Windows native drivers, to everything from the 191 series to the latest 196.75 drivers. All with the same results.

Since I have tried with No GPU's used through the cc_config.xml as well as with the GPU's enabled with the same results, I am wondering why the video driver would be of concern. Is the app still trying to use some portion of the video even when told not to?

I agree the debug points to hardware, but having all hardware tested, I disagree.

It's wierd, the only time I have had any issue with this machine is when I try and run BOINC and it's subsequent science apps.

The BSOD says exactly what the test in the previous post shows. That is printout of the .dmp. I am attaching to Rosetta, World Community Grid, Spinhenge, Climate Prediction, and SETI.

System Specs:

ASUS P67T WS Motherboard
Intel i7 975 Extreme Processor (No overclocking)
12 GB G. Skill 16000CL9T DDR3 Memory (Running at 2000 per XMP)
3x XFX 280GTX XXX video cards in SLI
4x Seagate 500GB ST3500630NS in RAID 0+1 from ICH10R on MB
PC Power&Cooling 1200W PSU (Single Rail 100A)
Lian-Li (Lancool) PC-K62 Case with tons of fans

Intel .inf install 9.1.1.1025
Intel Rapid Store Driver 9.5.0.1037
NVIDIA 196.75
Soundmax Driver 6.10.2.6585
Realtek NIC driver 7.12.1218.2009
Realtek NIC Teaming Driver 6.8.1024.2008

Thanks.

Z
ID: 31329 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31330 - Posted: 3 Mar 2010, 18:15:36 UTC - in response to Message 31329.  

Since you also get the error when you do not use the GPU, it's probably your CPU that's the problem. Any of the projects you mention will stress the CPU to the limit, so any (minor) damage it has will turn into a problem.

Can you please test with Prime95 if it reacts the same? If it does, it's your CPU.

You can also do a test of your memory, with Memtest86+ to see if that's the culprit.
ID: 31330 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31332 - Posted: 3 Mar 2010, 20:06:50 UTC - in response to Message 31330.  
Last modified: 3 Mar 2010, 20:07:09 UTC

Will try Prime 95, have run extensive memory tests without issue already. Also, I have the client set to use no more than 75% of the processor time so I am not peaking it out, but will try the Prime95 and see what happens.

Z
ID: 31332 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31333 - Posted: 3 Mar 2010, 20:20:47 UTC - in response to Message 31332.  
Last modified: 3 Mar 2010, 20:21:04 UTC

Well, that was quick. Prime95 caused the exact same error in less than 1 minute. Am contacting Intel now for RMA on CPU.

Thanks for your help, I appreciate it.

Z
ID: 31333 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31334 - Posted: 3 Mar 2010, 21:10:38 UTC - in response to Message 31333.  

You're welcome, and I'm sorry it is the CPU. But at least that's what BOINC is used for as well, stress testing. If there are shortcomings in the CPU then BOINC will find them.
ID: 31334 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31336 - Posted: 4 Mar 2010, 0:44:07 UTC - in response to Message 31334.  

Ok, spent the better part of the afternoon on the phone with Intel Tech Support. They claim Prime95's failure in no way indicates a failed CPU and refuse to RMA it on just those grounds. Any other ideas?

Thank you.

Z
ID: 31336 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31337 - Posted: 4 Mar 2010, 6:22:03 UTC - in response to Message 31336.  
Last modified: 4 Mar 2010, 6:24:32 UTC

What do they claim that's causing it then, cosmic rays? You did tell them that BOINC projects also cause this BSOD? What else should you test it with then, according to them?

Anyway, here's another CPU stress testing kit: http://downloads.guru3d.com/IntelBurnTest-v1.6-download-2047.html

See what that one does.
ID: 31337 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31347 - Posted: 4 Mar 2010, 16:49:20 UTC - in response to Message 31337.  

Thank you. I will download and test with this also. Not sure what they wanted besides a "qualified shop" to have tested and drawn the same conclusion. If I get the same results with this package I will recontact them and be more insistent on resolution.

Thank you again for the help with this.

Z
ID: 31347 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31349 - Posted: 4 Mar 2010, 17:02:40 UTC - in response to Message 31347.  

I discussed this with one of the BOINC developers. He says it's very unlikely, but worth checking anyway, that it may be a rootkit.

Have you thoroughly scanned your system with a good AV package, after you booted off a CD?
ID: 31349 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31357 - Posted: 5 Mar 2010, 5:21:08 UTC - in response to Message 31349.  
Last modified: 5 Mar 2010, 5:22:59 UTC

Hi Jord,

Thank you for all of the advice. I am running Kaspersky IS 2010 and it is in date and up to date. No nasties detected.

For grins, I borrowed my buddy's processor, same exact model, from his system to test. We have identical setups except for the fact that he runs CrossfireX and I run SLI, and he runs an LSI SATA III RAID card in RAID0 with three 6GB/s drives and I run MB based Intel RAID in RAID 10 with 4 3GB/s drives.

Exact same BSOD with his proc in my box, and no BSOD when we run BOINC on his box.

On a side note, after the second test software and repeating the BSOD Intel agreed to RMA the processor, but it looks as if it was unnecessary.

Still trying to figure it out.

Thanks again for all the help.

Z
ID: 31357 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31358 - Posted: 5 Mar 2010, 6:22:53 UTC - in response to Message 31357.  

Ah... that moves the problem from CPU to motherboard. I'd forgotten about the motherboard, to be honest. Unless you found a production flaw in that specific model of the Intel CPU, but that's even more unlikely than it being a rootkit. ;-)
ID: 31358 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31364 - Posted: 5 Mar 2010, 17:16:51 UTC - in response to Message 31358.  

Time to contact the next vendor. Thanks for your patience and direction. I am looking forward to nailing this down.

Z
ID: 31364 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31375 - Posted: 5 Mar 2010, 23:41:57 UTC - in response to Message 31364.  

So, before I unpacked the entire box to work on RMA the motherboard, I fiddled with the voltage settings. To this point everything has been running at stock. I found if I incease the CPU and QPI voltages I can get Prime95 to finish without a BSOD, but the system will BSOD some random time after that with a different hardware error. So I am leaning towards ASUS running things on the lean side power wise for that much processing. I am going to fiddle with this a bit more before I take it all apart.

Z
ID: 31375 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31384 - Posted: 6 Mar 2010, 8:59:27 UTC - in response to Message 31375.  

Just out of curiosity, what did the other BSOD say?
ID: 31384 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31431 - Posted: 8 Mar 2010, 19:26:54 UTC - in response to Message 31384.  

I'll run it today and get the error.
ID: 31431 · Report as offensive
ffzeus

Send message
Joined: 14 Dec 09
Posts: 12
United States
Message 31460 - Posted: 9 Mar 2010, 2:19:17 UTC - in response to Message 31431.  

Cannot get it to throw the same error again. The issue is directly related to the XMP profile in the memory. If I turn off XMP in the BIOS, no issues. If I adjust the memory settings to what the manufacturer claims works, BSOD. Have run Memtest86 and MS Memory Tester and taken the sticks to a local shop to be tested on memory testing machine they have and they all check out.

Guess it is stock, wimpy memory timings and speed for the time being.

Thank you for all of the help with this.

Z
ID: 31460 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 31465 - Posted: 9 Mar 2010, 7:36:40 UTC - in response to Message 31460.  
Last modified: 9 Mar 2010, 8:08:25 UTC

Guess it is stock, wimpy memory timings and speed for the time being.

Which is why I don't mind throwing an extra buck at it and get something more expensive from Kingston, Crucial or A-Data.

But the previous error should still be stored in your Windows Event Log (Start->Control Panel->Administrative Tools->Event Log), either system or application errors.
ID: 31465 · Report as offensive

Message boards : Questions and problems : BSOD

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.