tracking down a Problem

Message boards : Questions and problems : tracking down a Problem
Message board moderation

To post messages, you must log in.

AuthorMessage
NaRoon

Send message
Joined: 17 Feb 15
Posts: 2
United States
Message 72572 - Posted: 21 Sep 2016, 3:14:07 UTC

Hello everyone,
I've been having a strange problem and I"ve tracked it down to Boinc and or one of the projects I run. I have 5 PC and one laptop running Boinc and the following projects; Asteroids@home, BURP, CAS@HOME, Climate Prediction, Collatz Conjecture, Cosmology@Home, Einstein@Home, Malaria Control, MilkyWay@home, POEM@HOME, PrimeGrid, Rosetta@Home, SETI@Home, theSkyNet POGS, Universe@Home, Volpex, yoyo@home, and World Community Grid. The problem is only affecting the 2 i7 system, desktop i7 920 & laptop i7 2670QM, The other systems 2 AMD and 1 Core 2 Dou are not having any issues. The problem as I've been able to track down is that the i7 systems are crashing due to several different type of errors. Sometimes the graphics card halts, ATi 390x & Intel HD 3000, another time I'll get a virtual machine error and yet other times all I get in the event log is a statement that the system rebooted unexpectedly. When I stop Boinc the systems run fine. I have stressed both system with Intel burn in and prime95, Ive also stressed the video cards and both tested fine.
I need some help with ideas on how to shorten the list of possible culprits.

Thx in advance
Na'Roon
ID: 72572 · Report as offensive
Juha
Volunteer developer
Volunteer tester
Help desk expert

Send message
Joined: 20 Nov 12
Posts: 801
Finland
Message 72586 - Posted: 22 Sep 2016, 18:06:51 UTC - in response to Message 72572.  

For starters, you could suspend all but one project and see if that project creates problems. Run that one project for a while (days, week, you decide what's long enough) then move to the next project. It's possible more than one project is causing the problems.

If you find a project and it runs more than one "sub-project" or it has more than one application for CPU, GPU etc you can try the same process of elimination.
ID: 72586 · Report as offensive
Coleslaw
Avatar

Send message
Joined: 23 Feb 12
Posts: 198
United States
Message 72589 - Posted: 22 Sep 2016, 19:11:01 UTC

The second thing you should do is give us a run down of your BOINC settings and the hardware you have in these two systems. OS is important too.

You see there are several factors going on. If you are running all threads and GPU, you could be over stressing the system that the stress tests you did may not fully test for. For example, if you have all your threads running at 100% and the GPU running, that is a lot more stress than just CPU at 100%. Now, throw in the fact that BOINC is going to have Hard Disk resource needs. There is going to be a lot more IO's sent to the hard disk with more work units running. You mention nothing about RAM. If you are short on RAM, your system will use more virtual memory thus adding to those IO's. Since you have a project that uses virtualbox, then it is going to want some additional resources just for the VM. How many work units are you running on the GPU? Just one or do you have more? You may be having driver issues because you are overloading the system. Since it isn't just one error/issue, I would say you need to plan better how you use the hardware.

Reserve a full thread/core just to feed the GPU. I would save another thread/core just for the system to use to help manage things. Especially if BOINC is gonna pick up virtualbox work units. A laptop will have a harder time with heat. What temps are you showing while running.

Seriously... we need a lot more information on EACH setup and possibly even which projects are actually processing work at the time of the problem. Event logs can sometimes help.
ID: 72589 · Report as offensive

Message boards : Questions and problems : tracking down a Problem

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.