Hardware problems running BOINC finally debugged

Message boards : Questions and problems : Hardware problems running BOINC finally debugged
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 104050 - Posted: 21 Apr 2021, 18:04:34 UTC

Have a system with 6 core zeon (12 threads), 24gb ram and pair of gtx1060 run running WCG Covid apps that was consistently rebooting. Did not have this problem before with Einstein, Milkyway and WCG. I had swapped out a single RX570 for the pair of 1060s for testing purposes.

I found the problems was due to the eVga motherboard not handling transients and/or poor power supply regulation.

I had 6 CPU tasks suspended, when I resumed all 6 using a single commend from Boinctasks, the system rebooted instantly. I connected the system to a wattmeter and powered it back up. As I resumed each CPU task, one at a time, the wattage jumped by 13 watts then settled down to plus 5 watts. I am guessing that surge was not handled properly by the x5675 power regulator.

I then looked at the pair of gtx 1060 using Tech Power Up's GPU-z. One of the GPUs (on the left side) went immediately into the PerfCap Reason warning: the blue color. The other GPU was ok until the Memory Controller Load went to 55% then that warning kicked in. The warning is that the performance of the gpu is "Liimited by Operating Voltage". I looked at other system I was running and the slot voltage and the 6 pin voltage was consistently 12.1. This system was in the mid too low 11 volts.

hope this helps someone.

ID: 104050 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 104052 - Posted: 21 Apr 2021, 18:32:16 UTC - in response to Message 104050.  
Last modified: 21 Apr 2021, 18:35:10 UTC

ATX 12V spec range is 11.40 V to +12.60. but in my experience things generally aren't happy or exhibit problems below 11.7V. a component needs X amount of *power*. it can get that through a combination of voltage and current. if one is low, it'll pull more from the other. so with low voltage, it's pulling more current, and wouldn't surprise me to see a higher chance of hitting the PSU's OCP in this case if you were already close to the limit.

check out the 3.3V and 5V numbers too. if they are also very low under load, it's most likely that the PSU is wearing out. replace with a new one and overspec to give you more headroom. I prefer to stick to Platinum or better rated PSUs, as along with better power efficiency they are also generally just better built than the cheaper lower efficiency models.

but your perfcap reason of VRel has nothing to do with the PSU supply voltage. that is an internal mechanism of the GPU itself not allowing more voltage into the GPU core. it's a function of the voltage curve built into the GPU. you can probably play around with voltage if you wanted, but you will always run into some kind of performance cap, Voltage, power, thermals, etc.
ID: 104052 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 104076 - Posted: 23 Apr 2021, 14:13:45 UTC - in response to Message 104052.  
Last modified: 23 Apr 2021, 14:28:00 UTC

I was unable to check the 3.3 volt but I may look at it later. Not sure where to test that voltage on the motherboard, I replaced the Seasonic focus 650 bronze (non modular) with a Seasonic focus 850 platinium and that appears to solve the problem.

I found that when I pulled the pair of gtx1060 and put in a gtx1070-ti the system was unstable even with no CPU tasks running. It crashed within seconds of starting BOINC. This system, even with all cpu';s working %100 and pair of gtx1060 never pulled over 400 watts and was usually good for several hours before rebooting

That old bronze power supply must have a problem even rated at 650. I checked for ripple using A/C voltmeter on the 12 and 5 but there was nothing obvious. Maybe the 3.3 volt had the problem?

I am currently running the 1070-ti with 8 or so cpuj tasks and all seems OK. The voltage shown by CPU-ID is the same value as shown then using that older power supply.

Click to Pimp my rig

[/url]
ID: 104076 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 104078 - Posted: 23 Apr 2021, 17:31:05 UTC

3.3V is on the 24 pin connector and also on any SATA connector.
ID: 104078 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 104079 - Posted: 23 Apr 2021, 20:43:02 UTC - in response to Message 104078.  

3.3V is on the 24 pin connector and also on any SATA connector.


to add to this, if you're gonna measure it, make sure it's under system load. it could report totally normal voltage with no load. but I guess I'm spoiled. most of my systems have the ability to look at this real-time via IPMI. you might be able to pick it up in software using HWinfo if your motherboard has that functionality. https://www.hwinfo.com/

but it sounds like Joseph already replaced the PSU anyway.

and you'd need an osciliscope to really measure the ripple. the quick and dirty method of using an OTS multimeter on the AC setting isn't nearly sensitive enough. if you can pick up ripple from a computer PSU this way its WAYYY too much.
ID: 104079 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 104080 - Posted: 23 Apr 2021, 21:36:06 UTC - in response to Message 104079.  

Yes, looking for ripple with a normal multimeter is pointless. Use a scope or a scopemeter type multimeter to see distortion or ripple on a DC waveform.
ID: 104080 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 104086 - Posted: 24 Apr 2021, 15:45:04 UTC - in response to Message 104078.  
Last modified: 24 Apr 2021, 15:48:05 UTC

3.3V is on the 24 pin connector and also on any SATA connector.


I reconnected the old power supply to run some tests and the system is working fine as if there was no problem to start with.

I am guessing the 4+4 motherboard connector was not making good contact. The radiator of the CPU cooler presses hard against the wiring as that connector is directly underneath the radiator. Since this was not a modular power supply any contact problem has got to be on mombo or video board. Alternately, stress on the cables can open a solder joint at the connector. Usually smoke shows up when that happens. Right now the cables are all unstressed as the power supply is hanging above the system. System is fully loaded and working fine. I am going to poke the cables around and will use a magnifier to examine the contacts for any discoloration and put it back together.

When I tried to measure "ripple" I got 0.025 volts a/c. I also got the same 0.025 with voltmeter leads hanging loose in the air.
ID: 104086 · Report as offensive

Message boards : Questions and problems : Hardware problems running BOINC finally debugged

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.