BOINC projects possible cause for BSOD

Message boards : Questions and problems : BOINC projects possible cause for BSOD
Message board moderation

To post messages, you must log in.

AuthorMessage
Tomcat

Send message
Joined: 10 Jul 10
Posts: 4
Slovenia
Message 33723 - Posted: 10 Jul 2010, 8:39:42 UTC

Hi, I'm running BOINC in unprotected mode from 8PM to 8AM for several months now and have had 18 BSOD's since, caused by various programs and even system drivers, as debug of memory dumps showed. MS debug suggested memory corruption so I have run memtest several times - all ok. All crashes happened when BOINC projects were active. Uninstalled, installed latest 6.10.56 version (for Win7 x64) in protected mode, resumed all projects - and there it is BSOD again! Then I have suspended all projects for a week - no crash since then. Now I have reinstalled latest version again in unprotected mode and have suspended all projects so that only one at a time is allowed to run (hoping to isolate project responsible for crashes). My question: is there a way to trace what BOINC is doing in the background?
ID: 33723 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 33724 - Posted: 10 Jul 2010, 9:06:13 UTC - in response to Message 33723.  

What does the BSOD say? Does the message vary?
Which projects are you attached to?

Have you checked all your hardware? Looked for bulging capacitors, for instance?
ID: 33724 · Report as offensive
Tomcat

Send message
Joined: 10 Jul 10
Posts: 4
Slovenia
Message 33725 - Posted: 10 Jul 2010, 10:41:51 UTC - in response to Message 33724.  

MilkyWay, CPDN, Einstein, Cosmology, SETI, WCG, Rosetta, Lhcathome (mostly idle, can be excluded).

Crash is different almost every time (got 3x google chrome crases when left open during the night, 2x winamp, 2x screen saver, 1x sound driver, 1x LCD control app etc.). This is sample debug output (abstract) from my last crash dump caused by LCD control app:

1: kd> !analyze -v
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
...
FAULTING_IP:
nt!ObReferenceObjectByHandleWithTag+10c
...
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
...
PROCESS_NAME: Lcdctrl.exe
...
STACK_TEXT:
nt!KeBugCheckEx
nt! ?? ::FNODOBFM::`string'+0x40ecb
nt!KiPageFault+0x16e
nt!ObReferenceObjectByHandleWithTag+0x10c
nt!NtWaitForSingleObject+0x69
nt!KiSystemServiceCopyEnd+0x13
...
IMAGE_NAME: ntkrnlmp.exe

I did memtest, left running over night, no errors found. Note also, that this machine is only 1 year old with fairly good hardware and cooling was not neglected as well (memory OCZ Reaper Series with heatsinks, motherboard DFI LanParty JR series, CPU cooler is massive Scythe Zipang with 140 mm vent, in addition to that, there are 1x 120 mm vent pumping air inside and 2x 80 mm vents blowing out hot air on the rear, GPU has it's own cooling circiut). I do however am running overclocked Core2Duo from 3GHz to 3.6GHz and memory at 480MHz - DDR2, but no voltage increase whatsoever. I have run OCCT stress-test for 24 hours few months ago and next day machine was normally responsive, chipset and CPU temperatures did not go past 63°C at the end of the test. I didn't though physically check capacitors as I can't imagine that after only a year I could get bulging capacitors on this mobo as DFI is known of being very robust even when heavyly overclocked (which is not the case here). In addition to that, I regularly vacuum-clean interior and use air filters so that there is a minimum of dust inside the machine. Crash occurs only during the boinc work hours ususally between 11PM and 6AM. Anyway, projects are now active again with latest version, terminated all other background programs during the night and wait for a couple of weeks to see what happens. What I would really need now is some kind of tracing within boinc at least to see what was boinc doing at the time of the crash as there is no other activity except boinc projects.
ID: 33725 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 33726 - Posted: 10 Jul 2010, 11:05:43 UTC - in response to Message 33725.  

If BOINC crashes, it outputs its error to stderrdae.txt in your BOINC Data directory.
If Windows crashes due to something running under BOINC, it's stored in Event Viewer.

As for bulging capacitors, a couple of years back a Taiwanese company released a massive amount of bad capacitors which would break within a year. All motherboard and graphics card manufacturers got these capacitors and used them.

Just as an example, I had two brand new MSI boards break down within 6 months of each other due to bad capacitors. My last motherboard (Asus) broke a memory slot, which would only manifest itself when BOINC was running. It wouldn't show up on normal use, or in memtest 86+. I only figured it out by removing one stick of memory.

Have you tried running with one stick of memory, to see if that fixes things?
(yes, I know, that will disable dual memory channel, but it'll still run).

It's also possible you're infected by a worm or rootkit virus.
See at least this entry in the MSKB for the worm infection. For the root kit infection, you can either use any updated AV package and enable its rootkit scanning ability (may require a reboot into safe mode), or use one of the special programs.

It never hurts to check.
ID: 33726 · Report as offensive
Tomcat

Send message
Joined: 10 Jul 10
Posts: 4
Slovenia
Message 33727 - Posted: 10 Jul 2010, 12:17:53 UTC - in response to Message 33726.  

Crap! Apparently you can't rely on anyone these days anymore... I'll physically check motherboard as you have suggested along with the rootkit check. Will do some testing after that with single memory module as well. Thx for advices for now!
ID: 33727 · Report as offensive
Professor Ray

Send message
Joined: 31 Mar 08
Posts: 59
United States
Message 33763 - Posted: 14 Jul 2010, 4:02:03 UTC
Last modified: 14 Jul 2010, 4:04:05 UTC

Lets not forget this nugget:

http://boinc.berkeley.edu/dev/forum_thread.php?id=5843

While its possible, I'm not leaning towards hardware issues - your box is the fashizle fo shizle - and you're right: your mobo manufacturer isn't known for the sort of prollems described. Mobo support forums should be good indicators on that.

You have a prollem thoough and its a big one too. You said:

'IMAGE_NAME: ntkrnlmp.exe'

Typically those problems can't be fixed; its in the O/S itself. Unfortunatatly the only solution I'm aware of is: reformat & reinstall.

One of the things that alarms me concerning all this is that you state your system isn't over spec V (despite being O/C'd). If you're overclocking the CPU and you haven't given it extra juice there's your prollem. If you're O/C'g the memory bus and there' no mem v control jumper on the mobo in in BIOS for that, your overdriving your memory. Perhaps the memory timings are out of spec for the O/C'd undervolt condition. Are all the mem sticks in the sytem the same type & spec? You'd think that memtest would uncover that sort of issue...

You didn't say what O/S you had. Let me guess: Vista.

How you doing with tweakin' memory management?

I've seen these sort of issues before - Lattice Project (GARLI 5.13) - and people with your sort of boxes can't get the job done. I've presentlty 262 wall-clock hours (240.5 CPU) elapsed w/89.687% WU GARLI 5.13 complete; I expect this WU to be complete around 23 Jul (and that's running 24/7). Without doubt my machine is a lame neutered wimp compared to yours. Maybe that's the prllem: you have way too much power for BOINC; you and I need to swap machines that's what I think.

You get those errors called 'PAGE_FAULT_IN_NONPAGED_AREA' you absolutely must do a checkdsk /x. Then you absolutely MUST blow away your swap file and reboot. You must absolutely set a minimum sized swap file on some other HDD (or partition) available, and then defrag using a 3rd-party defraggler of any repute so as to consolidate free space in the middle of the HDD with errors. Then you can re-create a swap file of sufficient size for your needs (see link).

In short: your swap file is screwed up, your NT kernal is screwed up; you can fix the former, the latter requires reinstall (I'm sorry). You must find the xource of the prollem - its NOT BIONC - or it WILL happen again.
ID: 33763 · Report as offensive
Tomcat

Send message
Joined: 10 Jul 10
Posts: 4
Slovenia
Message 34233 - Posted: 15 Aug 2010, 15:33:54 UTC - in response to Message 33763.  

Hello, back from vacations with some updates about this issue... :)

I would say, that the mobo is 100% ok (had it checked by a friend, electrician) and there was no rootkit or any other malware of any kind as well.

Next I have added following options in cc_config.xml:
<cc_config>
<log_flags>
<app_msg_send>1</app_msg_send>
<app_msg_receive>1</app_msg_receive>
<cpu_sched>1</cpu_sched>
<cpu_sched_debug>1</cpu_sched_debug>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
</cc_config>

so I could see what was going on at the time of the crash with my projects. Then I got several more crashes. See when log ended below, from 2 of them:

16-Jul-2010 03:32:59 [climateprediction.net] [app_msg_send] sent heartbeat to hadsm3dhet2_k60s_006617454_7
16-Jul-2010 03:32:59 [SETI@home] [app_msg_send] sent heartbeat to 02mr07ai.12737.3344.8.10.141_3
16-Jul-2010 03:32:59 [climateprediction.net] [app_msg_receive] got msg from slot 1: <current_cpu_time>5.910898e+005</current_cpu_time>
<checkpoint_cpu_time>5.909999e+005</checkpoint_cpu_time>
<fraction_done>9.917608e-001</fraction_done>

Debug session time: Fri Jul 16 03:33:01.203 2010 (GMT+2

08-Aug-2010 04:59:03 [climateprediction.net] [cpu_sched] Resuming hadsm3dhet2_k3n7_006614373_6
08-Aug-2010 04:59:03 [Einstein@Home] [cpu_sched] Resuming h1_0828.30_S5R4__241_S5GC1a_2
08-Aug-2010 04:59:03 [climateprediction.net] [app_msg_send] sent heartbeat to hadsm3dhet2_k3n7_006614373_6
08-Aug-2010 04:59:03 [Einstein@Home] [app_msg_send] sent heartbeat to h1_0828.30_S5R4__241_S5GC1a_2

Debug session time: Sun Aug 8 04:59:03.731 2010 (GMT+2)

The "Debug session time" is the time of the mempory dump, as reported by WinDbg. Logs from time of other crashes were similar, that is, all projects were changing except that CPDN activity has been always logged just a fraction of second before the crash. I have suspended CPDN on Aug 8 and since then, no more crash. I have even left some other programs doing their job along with the BOINC activity and machine was still running in the morning. CPDN still runs on my other machine, Debian on old Pentium 4 without problems.

@Professor Ray
I have read your opinion and I would say, that you are most likely correct. As I have hand-build this machine by my self, I have given special care for all components included, so RAM is OCZ 4GB Kit DDR2-1066 Reaper HPC (OS is Win7 Ultimate). And you are correct: when I was doing OC, I dind't go very deep into it, so memory timings are very likely not suitable. Now I should probably either go back to mobo defaults, leave CPDN suspended on this machine or spend more time in adjusting memory timings in the BIOS as well (which will for sure take me quite some time...). As you have said, that maybe there is too much power in this machine, I'll most likely just leave CPDN running on my old P4, where no such problems were noticed. I also did the page file clean-up and chkdsk, as you have suggested, but as now machine is stable, I don't plan to reinstall entire OS.

Thank you both for your advices and assistance!
ID: 34233 · Report as offensive

Message boards : Questions and problems : BOINC projects possible cause for BSOD

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.