BOINC freezes computer

Message boards : Questions and problems : BOINC freezes computer
Message board moderation

To post messages, you must log in.

AuthorMessage
rudabega

Send message
Joined: 12 Feb 18
Posts: 5
United States
Message 84769 - Posted: 13 Feb 2018, 23:18:11 UTC

Good afternoon,

So I started having an issue last Friday where my computer would freeze after starting up anywhere from a few minutes to maybe an hour had passed. However, in each instance, the numlock key would turn off. This became the telltale sign that my computer had frozen. This freezing required a hard reboot as nothing would work and I let it sit many a time for an hour or longer. So as the weekend progressed I tried trouble shooting this and that.

I checked the Event Viewer and nothing really pops out. After rebooting, I would see a message labeled "Error" saying " The previous system shutdown at ??? on ??? was unexpected." Then shortly following this event, another event is listed, "Critical" saying "The system has rebooted without cleanly shutting down first. This error could be cause if the system stopped responding, crashed, or lost power unexpectedly." Duh! However, nothing indicated a specific problem after this.

So i continued researching what to do next. I eventually got to the point where I disabled all programs at start-up and this worked. My system seemed fine, no freezing.
After no freezing for about 3 hours, I started BOINC and within about 5 minutes the system froze. Each and every time I start BOINC, the system freezes after a few minutes. I am not sure what is going on or how to diagnose it. I cannot be sure if I upgraded to the latest version of BOINC recently. If there is a way to tell, I can check as it might have been last week, but I cannot remember. Needless to say I have had the system up and running without issue for 2+ days (since Sunday) and no issues. I just started BOINC while responding to another post about a GPU and it froze on me. Thus I have concluded it is BOINC that is doing something and causing my system to freeze.

Following is a small snapshot of my Event Viewer from right after my last reboot due to the freeze (today, 02/13/2018):

Information 2/13/2018 3:09:04 PM Kernel-General 16 None
Information 2/13/2018 3:09:04 PM Kernel-General 16 None
Information 2/13/2018 3:09:04 PM Ntfs (Microsoft-Windows-Ntfs) 98 None
Information 2/13/2018 3:09:04 PM Ntfs (Microsoft-Windows-Ntfs) 98 None
Information 2/13/2018 3:09:03 PM Kernel-Power 172 (203)
Critical 2/13/2018 3:09:03 PM Kernel-Power 41 (63)
Information 2/13/2018 3:09:03 PM FilterManager 6 None
Information 2/13/2018 3:09:03 PM FilterManager 6 None
Information 2/13/2018 3:09:03 PM FilterManager 6 None
Information 2/13/2018 3:09:03 PM FilterManager 6 None
Information 2/13/2018 3:09:03 PM Ntfs (Microsoft-Windows-Ntfs) 98 None
Information 2/13/2018 3:09:02 PM FilterManager 6 None
Information 2/13/2018 3:09:02 PM FilterManager 6 None
Information 2/13/2018 3:09:02 PM FilterManager 6 None
Information 2/13/2018 3:09:52 PM EventLog 6013 None
Information 2/13/2018 3:09:52 PM EventLog 6005 None
Information 2/13/2018 3:09:52 PM EventLog 6009 None
Error 2/13/2018 3:09:52 PM EventLog 6008 None
Information 2/13/2018 3:09:02 PM Kernel-Boot 32 (58)
Information 2/13/2018 3:09:02 PM Kernel-Boot 18 (57)
Information 2/13/2018 3:09:02 PM Kernel-Boot 25 (32)
Information 2/13/2018 3:09:02 PM Kernel-Boot 27 (33)
Information 2/13/2018 3:09:02 PM Kernel-Boot 20 (31)
Information 2/13/2018 3:09:02 PM Kernel-Boot 153 (62)
Information 2/13/2018 3:09:02 PM Kernel-General 12 (1)
Warning 2/13/2018 2:53:28 PM WHEA-Logger 19 None
Error 2/13/2018 2:52:30 PM DistributedCOM 10016 None
Error 2/13/2018 2:50:18 PM DistributedCOM 10016 None
Information 2/13/2018 1:22:50 PM Kernel-General 1 (5)


According to the "Error" at 3:09:52, the system shutdown at 2:42:32. Although, you can see two error messages and a warning after this time. The DistributedCOM messages seem to be a typical message being reported as it also occurs at other times, so I am not sure if it is related. However, the "Warning" event is new from what I can tell. Not sure if this is helpful, but the message box on the General tab reads:

A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Cache Hierarchy Error
Processor APIC ID: 0

The details view of this entry contains further information.


I was not having any issues with BOINC prior to last week. This is the reason that makes me think that I may have upgraded BOINC. This is also because, I recently had my computer off for a couple of months due to a remodel project on my house, so I was not able to use this system. More than likely, the most current BOINC version available around September 2017 was the one that was being used prior to the past couple of weeks.

Ultimately, I am at a loss as to explain the events. I am at the end of my knowledge base for computers and diagnosis and am going to require someone will more understanding. If anyone has any insight or ideas, please share. At this time, my next step is to uninstall BOINC and go back a version or two in time to see if that corrects the issue. However, I will wait for any potential ideas before I do.

Thanks,

Jason
ID: 84769 · Report as offensive
Profile Richie

Send message
Joined: 2 Jul 14
Posts: 186
Finland
Message 84779 - Posted: 14 Feb 2018, 4:45:53 UTC - in response to Message 84769.  

Hi Jason!

I was not having any issues with BOINC prior to last week. This is the reason that makes me think that I may have upgraded BOINC.


You can find out which version you were using by visiting a web site of the project you've been crunching. View your tasks page there and look for completed tasks. Find a completed task from the time period that you're looking for. Click "tasks details" or something like that on that task. 'Stderr output' will tell what the Boinc version your computer had at the time.

Although, I have a feeling the Boinc version is not the problem here.

This is also because, I recently had my computer off for a couple of months due to a remodel project on my house, so I was not able to use this system.


Could you give some additional information about the specifications of your computer, please? Is it a desktop? How old is it, especially the PSU and motherboard? Running Windows or Linux, what version?

It seems your system is not stable during the above-normal stress that running a BOINC can cause. How much stress will your Boinc setup cause for the hardware? So, are you running GPU or CPU tasks and how many? What projects? Is the total load light or more like 100% ?

There can be various reasons why the system isn't stable. You mentioned the computer had been without power for a few months, if I understood correctly. Sometimes if the hardware is old and is nearing its end of its lifecycle there may be problems after powering up after a long pause. Even an old, bad quality PSU from years back may be running fine under a heavy load as long as it's constantly doing that. But after it gets shut down and remain uncharged for some time, it might not recover from that cold period anymore. An old motherboard could suffer from the same.

Have you checked if the computer has gathered dust inside the chassis, on the parts and fans/cooler? There could be something causing the computer to heat up just a tiny bit more. So if it previously was seemingly stable a small addition in heat might now cross the critical line and cause stability problems.
ID: 84779 · Report as offensive
rudabega

Send message
Joined: 12 Feb 18
Posts: 5
United States
Message 84786 - Posted: 14 Feb 2018, 17:09:52 UTC - in response to Message 84779.  

Richie,

My apologies. I forgot to include the first few lines of the event log from BOINC when I first started it before it froze. I know that that log has much of the information you are looking for and I was typing my original message over the course of a couple of hours because I kept getting interrupted and I completely overlooked it. Following it is that information:

2/13/2018 2:47:23 PM | | cc_config.xml not found - using defaults
2/13/2018 2:47:23 PM | | Starting BOINC client version 7.8.3 for windows_x86_64
2/13/2018 2:47:23 PM | | log flags: file_xfer, sched_ops, task
2/13/2018 2:47:23 PM | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
2/13/2018 2:47:23 PM | | Data directory: C:\ProgramData\BOINC
2/13/2018 2:47:23 PM | | Running under account Jason
2/13/2018 2:47:24 PM | | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7800 Series (driver version 2442.9, device version OpenCL 1.2 AMD-APP (2442.9), 2048MB, 2048MB available, 2637 GFLOPS peak)
2/13/2018 2:47:25 PM | | Host name: Powerglide
2/13/2018 2:47:25 PM | | Processor: 8 AuthenticAMD AMD FX(tm)-8350 Eight-Core Processor [Family 21 Model 2 Stepping 0]
2/13/2018 2:47:25 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 tce tbm topx page1gb rdtscp bmi1
2/13/2018 2:47:25 PM | | OS: Microsoft Windows 10: Core x64 Edition, (10.00.16299.00)
2/13/2018 2:47:25 PM | | Memory: 15.95 GB physical, 31.95 GB virtual
2/13/2018 2:47:25 PM | | Disk: 110.42 GB total, 4.90 GB free
2/13/2018 2:47:25 PM | | Local time is UTC -7 hours
2/13/2018 2:47:25 PM | | VirtualBox version: 5.1.26
2/13/2018 2:47:25 PM | Amicable Numbers | URL https://sech.me/boinc/Amicable/; Computer ID 29685; resource share 50
2/13/2018 2:47:25 PM | Cosmology@Home | URL http://www.cosmologyathome.org/; Computer ID 345818; resource share 50
2/13/2018 2:47:25 PM | duchamp | URL https://sourcefinder.theskynet.org/duchamp/; Computer ID 7377; resource share 100
2/13/2018 2:47:25 PM | Einstein@Home | URL https://einsteinathome.org/; Computer ID 12624303; resource share 50
2/13/2018 2:47:25 PM | Milkyway@Home | URL https://milkyway.cs.rpi.edu/milkyway/; Computer ID 764641; resource share 50
2/13/2018 2:47:25 PM | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 8457605; resource share 50
2/13/2018 2:47:25 PM | | General prefs: from https://www.grcpool.com/ (last modified ---)
2/13/2018 2:47:25 PM | | Computer location: home
2/13/2018 2:47:25 PM | | General prefs: no separate prefs for home; using your defaults
2/13/2018 2:47:25 PM | | Reading preferences override file
2/13/2018 2:47:25 PM | | Preferences:
2/13/2018 2:47:25 PM | | max memory usage when active: 16330.43 MB
2/13/2018 2:47:25 PM | | max memory usage when idle: 16330.43 MB
2/13/2018 2:47:25 PM | | max disk usage: 8.59 GB
2/13/2018 2:47:25 PM | | max CPUs used: 7
2/13/2018 2:47:25 PM | | don't compute while active
2/13/2018 2:47:25 PM | | don't use GPU while active
2/13/2018 2:47:25 PM | | suspend work if non-BOINC CPU load exceeds 25%
2/13/2018 2:47:25 PM | | (to change preferences, visit a project web site or select Preferences in the Manager)
2/13/2018 2:47:25 PM | Cosmology@Home | [error] no project URL in task state file
2/13/2018 2:47:25 PM | Einstein@Home | [error] no project URL in task state file
2/13/2018 2:47:25 PM | Einstein@Home | [error] no project URL in task state file
2/13/2018 2:47:25 PM | Einstein@Home | [error] no project URL in task state file
2/13/2018 2:47:25 PM | Amicable Numbers | Task amicable_10_20_26580_1517630401.698072_559_3 is 1.61 days overdue; you may not get credit for it. Consider aborting it.
2/13/2018 2:47:25 PM | | Using account manager grcpool.com
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_26580_1517630401.698072_541_3; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_26580_1517630401.698072_560_3; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_23205_1518119702.482858_289_2; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_2267_1518147902.154508_756_0; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_2267_1518147902.154508_823_1; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_11239_1518102602.415400_14_2; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_23205_1518119702.482858_118_2; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_26580_1517630401.698072_461_4; not started and deadline has passed
2/13/2018 2:47:25 PM | Amicable Numbers | Aborting task amicable_10_20_2267_1518147902.154508_750_0; not started and deadline has passed
2/13/2018 2:47:25 PM | | Contacting account manager at https://www.grcpool.com/
2/13/2018 2:47:29 PM | Amicable Numbers | Sending scheduler request: To report completed tasks.
2/13/2018 2:47:29 PM | Amicable Numbers | Reporting 9 completed tasks
2/13/2018 2:47:29 PM | Amicable Numbers | Not requesting tasks: don't need (CPU: not highest priority project; AMD/ATI GPU: job cache full)
2/13/2018 2:47:31 PM | | Account manager contact succeeded
2/13/2018 2:47:31 PM | | General prefs: from https://www.grcpool.com/ (last modified ---)
2/13/2018 2:47:31 PM | | Computer location: home
2/13/2018 2:47:31 PM | | General prefs: no separate prefs for home; using your defaults
2/13/2018 2:47:31 PM | | Reading preferences override file
2/13/2018 2:47:31 PM | | Preferences:
2/13/2018 2:47:31 PM | | max memory usage when active: 16330.43 MB
2/13/2018 2:47:31 PM | | max memory usage when idle: 16330.43 MB
2/13/2018 2:47:31 PM | | max disk usage: 8.59 GB
2/13/2018 2:47:31 PM | | max CPUs used: 7
2/13/2018 2:47:31 PM | | don't compute while active
2/13/2018 2:47:31 PM | | don't use GPU while active
2/13/2018 2:47:31 PM | | suspend work if non-BOINC CPU load exceeds 25%
2/13/2018 2:47:31 PM | | (to change preferences, visit a project web site or select Preferences in the Manager)
2/13/2018 2:47:32 PM | Amicable Numbers | Scheduler request completed


In addition, this is my main desktop computer. I would say it is approximately 5 to 6 years old, but I honestly cannot remember. This a computer that I built and have upgraded components over time so not everything is original. I just replaced the power supply in January because when I went to turn it on, nothing happened. No power anywhere. EVGA replaced the PSU via the warranty (original PSU was EVGA 750 GQ, 80+ GOLD 750W, Semi Modular), but I did not want to wait the 2 to 3 weeks for the new one, so I bought a new PSU (EVGA 750W SUPERNOVA G3 MODULAR). The motherboard is an ASUS M5A99FX PRO R2.0 ATX AM3+. In searching for the MB, I discovered that I purchased it about two years ago, so the case might be 5 to 6 years old, most components are probably 3 years old or less, as I bought the MB in 2016.

Prior to about a week ago, I did not have any issues with running BOINC. After I replaced the PSU with a new one, everything turned on and was working fine. This happened around the first week of January. I was running BOINC projects without any issues (as far as I knew). The system seemed stable. The event log shows all the projects I am currently running. However, I only recently began crunching for Amicable Numbers (01/12/2018) and OLDK1 (01/27/2018). All the other projects I have been crunching for 8 or more years. However, not all may have been active on this machine. Most likely it would have been SETI and Milkyway. But I would have to go back into each project to know for sure. With regard to tasks, I am running both CPU and GPU tasks. I have the CPU set to 88% to leave some room for the GPU. Prior to the freezing issue, the resource monitor would show most cores in the 90% range or so. I read somewhere to limit the CPU usage when you are using the GPU, so as not to be throttling the GPU because all CPU cores were busy. In setting the limit to 88%, this seems to give a good response to all cores and not max out any of them.

Yes, it always makes me nervous to turn a system off and leave it off for any extended period of time. However, as I noted above, most of the components are fairly new and the PSU is brand new. And as you note, this could be the issue. And I am not ruling it out. However, I am just wondering if there are ways to narrow down potential problems. If it is a hardware issue, I am suspecting the GPU. Unfortunately, I do not have the knowledge begin to narrow down components. Or if it is even possible. It might be that I have to replace the GPU and fire up BOINC to see if it is stable. However, my main question is: If it is a hardware issure, why do I only see the problem when I start BOINC?

Regarding the heating issue, I do not think that is the problem. The CPU is liquid cooled. Heat is a big concern for me, so I regularly take the computer outside and blowout all the dust with my air compressor. I did this just before I fired it back up as well. Now, that is not to say that it isn't the issue. But I would like to think I mitigate this on a regular basis.

If I have not provided any information you were looking for, please do not hesitate to ask.

Thanks for taking the time to respond. It is greatly appreciated.

Jason
ID: 84786 · Report as offensive
mmonnin

Send message
Joined: 1 Jul 16
Posts: 146
United States
Message 84788 - Posted: 14 Feb 2018, 19:36:50 UTC

I've had a brand new PSU that would crash the system as soon as BOINC would put any kind of load on the system. BOINC was just the trigger but any kind of load would have done it. It was an EVGA as well. If that is the case then they have great service.

Before that though I would check on temps. All fans are running? Voltages are ok? You could try memtest as well.

Amicable numbers can use a lot of GPU memory. And memory requirements go up as the project continues I think. BOINC log shows 2gb of GDDR. What happens if you start BOINC with just the CPU running? Then just the GPU? Then try both to see if there are crashes after any of those.
ID: 84788 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 84790 - Posted: 14 Feb 2018, 19:43:04 UTC - in response to Message 84786.  

The blowout with the air compressor may have moved or dislodged slightly the cpu, memory or add in cards. I would remove the memory sticks and reinsert them first. The WHEA error is usually memory based. Next remove the cpu cold plate and chip and reinsert it in the socket and repaste. Remove the graphics cards and any other boards in the PCIe slots and reinsert.

Also while doing this, inspect the mainboard for damage caused by your air compressor cleaning. It is possible you damaged or hit motherboard components with the air nozzle. I have the ASUS M5A99FX Pro 2.0 board and it is a solid board and normally does not give any issues or problems.
ID: 84790 · Report as offensive
rudabega

Send message
Joined: 12 Feb 18
Posts: 5
United States
Message 84801 - Posted: 16 Feb 2018, 0:19:39 UTC - in response to Message 84790.  

mmonnin/Keith,

Just to clarify, you are saying that you had a brand new PSU (that also happened to be an EVGA unit) that when loaded it crashed your system? If this is the case, how did you determine it was the PSU? One of the options I may do is to remove the new PSU I bought about a month ago and replace it with the warranty replaced version that EVGA sent me. The model of the original PSU was noted in my previous post. I got it about a week ago. But I will not do this until I have performed most of the steps I noted below.

I did check my temps and fans using SpeedFan last night. All fans seem to be working. When I would unplug the radiator fan, the CPU temp began to increase and one of the fan readings went to 0 RPM. When I unplugged the liquid cooling pump on the processor, the CPU temp began to increase fairly quickly and the other fan reading went to 0 RPM. my liquid cooling setup is a prepackaged setup with pump and radiator together as a sealed unit. I cannot remember the exact model, but similar to a Zalman LQ315.

I then decided to start BOINC. And sure enough the CPU temp, core temp and GPU temp began to rise. The CPU temp reached 60C and paused BOINC. I did this a few times and let the temp climb a degree or two higher each time before pausing. I then let it go to see where they would top out. The CPU held steady at about 66 to 67C, the core at about 48 to 50C and the GPU at about 55 to 57C. After searching the internet for my processor and GPU, these seemed reasonable temperatures. Thus I let BOINC continue running and I continued watching. It ran for probably 45 minutes to an hour and I thought maybe I has solved the problem. But alas, I spoke too soon. At 1:31 AM, the Numlock light went out and the computer locked up. I did not get to see if BOINC would run with just the CPU, I was too tired at this point. I will try this tonight. Am I able to only have the GPU process WUs? This probably entails suspending all projects and tasks that do not use the GPU. I will try this tonight as well.

When you say voltages, to what voltages are you referring, fans, memory, something else, all of them? I am not sure how to check any specific one. Is there a program that will monitor these?
Also, will memtest only load the memory and not the CPU? I did not check this, but this would be a way to determine if the memory is the issue.

If none of this works, I will follow Keith's suggestions and remove and reset the memory modules and graphics cards. If this does not work, I will remove and reset the cold plate on the CPU. I will also inspect the MB. I try not put the nozzle into the case at all, just blow air from the perimeter to try to minimize the possibility of impacting any MB components. I know some of them can be sensitive. But I will double check as I am resetting other components. What may have sparked this issue is the installation of the new PSU. This was not an easy thing to do. Also, I have to apologize as I lied about the MB model. While I thought I had the M5A99FX Pro 2.0 in this machine, it is not in this machine, but in another machine I have. This machine has an ASUS M5A97 (V1.02 printed on the MB, so not the Rev 2.0 version). Not sure if this matters, but it has been a solid board. And this means that it is probably is about 5 years old.

In the meantime, if there are any other suggestions or recommendations, please share. I really would like to get this issues solved sooner rather than later.

Thanks,

Jason
ID: 84801 · Report as offensive
Profile Richie

Send message
Joined: 2 Jul 14
Posts: 186
Finland
Message 84802 - Posted: 16 Feb 2018, 4:13:17 UTC
Last modified: 16 Feb 2018, 4:14:10 UTC

HWmonitor would show some motherboard voltages (3.3V, 5V, 12V lines). You can check if those stay approximately at their normal level while computer is under stress. Oops, i realized... SpeedFan propably was displaying them also. Well, here it is anyway.
https://www.cpuid.com/softwares/hwmonitor.html

Memtest86+ will concentrete on testing the memory. It will of course use some CPU cycles to run tests, but it will tell if RAM configuration is not stable (malfunctioning... or not stable with the current timings or memory/bus voltage).
http://www.memtest.org/#downiso

Maybe you could run Valley benchmark software to see how the GPU will handle stress (outside Boinc). "Extreme performance and stability test for PC hardware: video card, power supply, cooling system." If that can crash the system, maybe there will arise some new ideas.
https://benchmark.unigine.com/valley

There's for example PassMark BurnInTest that could be used to stress CPU at various different loads. Also for GPU, mem, etc.
https://www.passmark.com/products/bit.htm
ID: 84802 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 84807 - Posted: 16 Feb 2018, 15:00:15 UTC
Last modified: 16 Feb 2018, 15:14:12 UTC

I read this thread because I occasionally have systems lock up when running BOINC. Invariably due to overheat: usually GPU, occassionally CPU and sometimes the disk drive goes bad also due to overheat. I run 24/7 and also have older systems like you. I used to have Semprons, Athlons, Opterons but ended up with all Intel systems due to price drops and motherboards getting too old and popping too many capacitors too often.

HD sentinal has a free version that can check your disk drive for problems.
Efmer's tthrottle can monitor your system provide CPU & GPU throttling, and even alert you if a problem using an email or text message.
I have a pair of HD 7850 and rarely had a problem with them. They are in adjacent slots but there is a large gap and air cooling works fine.

I suspect the problem is "fx 8350 thermal throttling" google that and look at the problems and suggestions especially in the overclocker forums where they have stability issues with that 8350 and Bios fixes for problems with ASUS motherboards and other MB manufacturers.

I used to turn off cool-and-quiet because BOINC was being throttled to death on my Tyan opteron motherboards. I had to put in really huge heat sinks and eventually got rid of all of them due to old capacitors going bad. Your system is much newer. I read there is a bios feature called "line conditioning" or something like that. The overclockers use that to help stabilize their systems and even reported associated BIOS bugs to Asus.

I have a number of ATI graphics boards. They are owned by AMD now and they all run a lot cooler than similar nVidia boards of the same generation. I suspect the problem is the load on the cpu.
ID: 84807 · Report as offensive
mmonnin

Send message
Joined: 1 Jul 16
Posts: 146
United States
Message 84879 - Posted: 21 Feb 2018, 20:50:32 UTC - in response to Message 84801.  

mmonnin/Keith,

Just to clarify, you are saying that you had a brand new PSU (that also happened to be an EVGA unit) that when loaded it crashed your system? If this is the case, how did you determine it was the PSU? One of the options I may do is to remove the new PSU I bought about a month ago and replace it with the warranty replaced version that EVGA sent me. The model of the original PSU was noted in my previous post. I got it about a week ago. But I will not do this until I have performed most of the steps I noted below.

I did check my temps and fans using SpeedFan last night. All fans seem to be working. When I would unplug the radiator fan, the CPU temp began to increase and one of the fan readings went to 0 RPM. When I unplugged the liquid cooling pump on the processor, the CPU temp began to increase fairly quickly and the other fan reading went to 0 RPM. my liquid cooling setup is a prepackaged setup with pump and radiator together as a sealed unit. I cannot remember the exact model, but similar to a Zalman LQ315.

I then decided to start BOINC. And sure enough the CPU temp, core temp and GPU temp began to rise. The CPU temp reached 60C and paused BOINC. I did this a few times and let the temp climb a degree or two higher each time before pausing. I then let it go to see where they would top out. The CPU held steady at about 66 to 67C, the core at about 48 to 50C and the GPU at about 55 to 57C. After searching the internet for my processor and GPU, these seemed reasonable temperatures. Thus I let BOINC continue running and I continued watching. It ran for probably 45 minutes to an hour and I thought maybe I has solved the problem. But alas, I spoke too soon. At 1:31 AM, the Numlock light went out and the computer locked up. I did not get to see if BOINC would run with just the CPU, I was too tired at this point. I will try this tonight. Am I able to only have the GPU process WUs? This probably entails suspending all projects and tasks that do not use the GPU. I will try this tonight as well.

When you say voltages, to what voltages are you referring, fans, memory, something else, all of them? I am not sure how to check any specific one. Is there a program that will monitor these?
Also, will memtest only load the memory and not the CPU? I did not check this, but this would be a way to determine if the memory is the issue.

If none of this works, I will follow Keith's suggestions and remove and reset the memory modules and graphics cards. If this does not work, I will remove and reset the cold plate on the CPU. I will also inspect the MB. I try not put the nozzle into the case at all, just blow air from the perimeter to try to minimize the possibility of impacting any MB components. I know some of them can be sensitive. But I will double check as I am resetting other components. What may have sparked this issue is the installation of the new PSU. This was not an easy thing to do. Also, I have to apologize as I lied about the MB model. While I thought I had the M5A99FX Pro 2.0 in this machine, it is not in this machine, but in another machine I have. This machine has an ASUS M5A97 (V1.02 printed on the MB, so not the Rev 2.0 version). Not sure if this matters, but it has been a solid board. And this means that it is probably is about 5 years old.

In the meantime, if there are any other suggestions or recommendations, please share. I really would like to get this issues solved sooner rather than later.

Thanks,

Jason


I moved the PSU to another system and it also shut down shortly after being under load. That 2nd system that didn't come close to max power output of the PSU and was running for months prior to the PSU swap (and still many months straight replacing original PSU). I at 1st didn't think it was the PSU but after RMAing it to EVGA the same computer has been running fine since.

For voltages, really the best way is a voltmeter if you have one. Otherwise HWMonitor as mentioned will suffice. Some of the other diagnostic options that were mentioned are easier to try than replacing components and could be done 1st.
ID: 84879 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 84885 - Posted: 22 Feb 2018, 19:41:00 UTC - in response to Message 84769.  
Last modified: 22 Feb 2018, 19:43:31 UTC

Guessing here. Work done on house, possible dust and when all cores maxed out you get overheating problems? Sorry, my browser was playing up and it looks like any ideas I had have been covered.
ID: 84885 · Report as offensive

Message boards : Questions and problems : BOINC freezes computer

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.