BOINC Core client looping on one CPU/core

Message boards : BOINC client : BOINC Core client looping on one CPU/core
Message board moderation

To post messages, you must log in.

AuthorMessage
Udo

Send message
Joined: 17 Nov 07
Posts: 5
Message 13921 - Posted: 17 Nov 2007, 9:52:34 UTC

Starting with BOINC Version 5.8.x I noticed that the BOINC Client is sometimes looping and no longer processing the application.

I have several computers with Windows 2003, Windows 2000 Pro. and Win XP Pro.
I allways install BOINC as a service to let it run directly after starting the computer, some are running unattended as server also.

Only on my Intel Win 2003 computers (at E@H that are: 470358, 589361, 589472, 1034187, 1034197) it happened that after upgrading to 5.8.16 (and also after installing 5.10.20) the BOINC client got 'stuck'.
It was consuming 100% of the CPU/core and the BOINC applications didn't get any CPU.
Starting BOINC GUI caused the GUI to 'freeze'. No respond from the window, no reaction.

The only possibility was to stop the service.
After stopping the service I could start the GUI without problems and the tasks / WUs were running without any problems.
Starting the service again returned the same problem: client using 100% of CPU/core, WU getting no CPU, BOINC GUI 'freezing'.

On my unattended hosts/servers I couldn't let the GUI run all the time.
The only possibility was to downgrade to 5.4.x (5.4.11) which doesn't show this behaviour.

Unfortunately the same problem still exists in the 5.10.x version.

I have one AMD Win2003 box (single core), which is even more unattended (60km away) and I get there only time to time.
I once had the situation that the server didn't return WUs for several days (and so passing the deadline of a WU) but after that started again.
I don't know if the server was rebooted and that fixed the hanging situation or if it was a totally different problem.

If it was the same problem on the AMD box the problem exists on all of my Win2003 Servers, if it was a different error then the error is on all of my Win2003 Server boxes with multiple CPUs/cores

Udo
ID: 13921 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 13929 - Posted: 17 Nov 2007, 16:38:20 UTC - in response to Message 13921.  

Was boinc.exe using all CPU or boincmgr.exe? What project(s) are you running? Do you notice the hang after any particular event, like after a workunit finishes or after getting more work, or anything like that? Anything interesting on the stdoudae.txt log?

BOINC Manager hanging if client is hanged is a bug, but needs quite a redesign of the manager-client communication code to solve...
ID: 13929 · Report as offensive
Profile Ananas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 13941 - Posted: 19 Nov 2007, 0:52:04 UTC

Just a very vague idea because I have noticed something similar twice lately :

Does BOINC behave like this just after a benchmark ran, i.e. with results that had been paused in order to run the benchmark?


The problem I noticed isn't exactly comparable, because I'm using an older BOINC client and on the box where it happened, the core client did not eat CPU time at all, one CPU was idle instead, one result running normal, the other was idle.

The core client showed both running, BOINCview notices when one process that is set active doesn't get any CPU time.

As I'm not on a current client, it makes no sense to open a ticket, it might be fixed in later clients.
ID: 13941 · Report as offensive
Udo

Send message
Joined: 17 Nov 07
Posts: 5
Message 13946 - Posted: 19 Nov 2007, 8:56:18 UTC - in response to Message 13929.  

Was boinc.exe using all CPU or boincmgr.exe? What project(s) are you running? Do you notice the hang after any particular event, like after a workunit finishes or after getting more work, or anything like that? Anything interesting on the stdoudae.txt log?

BOINC Manager hanging if client is hanged is a bug, but needs quite a redesign of the manager-client communication code to solve...


boincmgr.exe didn't run at all... its a server where no one is logged on normally.
Only boinc.exe was running.

On this host Boinc is only conected to Einstein@Home.

I unfortunately didn't notice anything (see above, normally no one is logged on. I just noticed that the host didn't communicate with the project in its appropriate intervall). But as you can see the benchmark started AFTER restarting boinc, as the benchmark time exeeded.
There is no further message between '2007-11-09 05:16:23' and '2007-11-12 08:40:12'.
I don't know at which time boinc started to loop.

STDOUDAE.TXT:
2007-11-09 02:45:32 [Einstein@Home] Sending scheduler request: To fetch work
2007-11-09 02:45:32 [Einstein@Home] Requesting 48 seconds of new work, and reporting 1 completed tasks
2007-11-09 05:16:20 [Einstein@Home] Computation for task h1_0416.00_S5R2__63_S5R3a_0 finished
2007-11-09 05:16:23 [Einstein@Home] [file_xfer] Started upload of file h1_0416.00_S5R2__63_S5R3a_0_0
2007-11-12 08:40:12 [---] Running CPU benchmarks

2007-11-12 08:40:12 [---] Suspending computation - running CPU benchmarks
2007-11-12 08:40:15 [Einstein@Home] [file_xfer] Finished upload of file h1_0416.00_S5R2__63_S5R3a_0_0
2007-11-12 08:40:15 [Einstein@Home] [file_xfer] Throughput 87019 bytes/sec
2007-11-12 08:40:15 [Einstein@Home] Scheduler RPC succeeded [server version 601]
2007-11-12 08:40:15 [Einstein@Home] Got server request to delete file h1_0416.00_S5R2
...
2007-11-12 08:40:15 [Einstein@Home] Got server request to delete file l1_0416.25_S5R2
2007-11-12 08:40:15 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-11-12 08:40:15 [Einstein@Home] Reason: requested by project
2007-11-12 08:40:18 [Einstein@Home] [file_xfer] Started download of file skygrid_0160Hz_S5R3.dat
2007-11-12 08:40:18 [Einstein@Home] [file_xfer] Started download of file h1_0150.25_S5R2
2007-11-12 08:40:20 [Einstein@Home] [file_xfer] Finished download of file skygrid_0160Hz_S5R3.dat
2007-11-12 08:40:20 [Einstein@Home] [file_xfer] Throughput 33198 bytes/sec
...
2007-11-12 08:40:41 [Einstein@Home] [file_xfer] Finished download of file l1_0150.45_S5R2
2007-11-12 08:40:41 [Einstein@Home] [file_xfer] Throughput 958525 bytes/sec
2007-11-12 08:40:44 [---] Benchmark results:
2007-11-12 08:40:44 [---] Number of CPUs: 1
2007-11-12 08:40:44 [---] 1834 floating point MIPS (Whetstone) per CPU
2007-11-12 08:40:44 [---] 3371 integer MIPS (Dhrystone) per CPU
2007-11-12 08:40:46 [---] Resuming computation
2007-11-12 08:40:46 [Einstein@Home] Starting h1_0150.25_S5R2__18_S5R3a_1
2007-11-12 08:40:46 [Einstein@Home] Starting task h1_0150.25_S5R2__18_S5R3a_1 using einstein_S5R3 version 415

it seems to be in the middle of data transfer...

additional Info: I'm connecting to the internet via proxyserver.

Udo
ID: 13946 · Report as offensive
Thyme Lawn

Send message
Joined: 2 Sep 05
Posts: 103
United Kingdom
Message 13952 - Posted: 19 Nov 2007, 12:40:22 UTC

It looks like something caused BOINC networking to stop working. For some reason the start of the periodic benchmark at 08:40:12 caused networking to kick back into life (your scheduler request at 02:45:32 and upload at 05:16:20 didn't get any response until 08:40:15).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 13952 · Report as offensive
Udo

Send message
Joined: 17 Nov 07
Posts: 5
Message 13957 - Posted: 19 Nov 2007, 14:41:47 UTC - in response to Message 13952.  

It looks like something caused BOINC networking to stop working. For some reason the start of the periodic benchmark at 08:40:12 caused networking to kick back into life (your scheduler request at 02:45:32 and upload at 05:16:20 didn't get any response until 08:40:15).


The problem did NOT resolve by itself!
I had to stop the boinc service!

Running boincgui without the service worked well, but starting the service caused boinc.exe to loop.

The upload happend when running some hours with boincgui.

Udo
ID: 13957 · Report as offensive
Profile Ananas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 13960 - Posted: 19 Nov 2007, 21:23:00 UTC - in response to Message 13952.  

It looks like something caused BOINC networking to stop working. For some reason the start of the periodic benchmark at 08:40:12 caused networking to kick back into life ....


It used to be just the other way : benchmark stops network and returns to the previous state after the benchmark is done. As the file transfer sometimes eats CPU time, that has not really been a bad solution.

There is a global setting for "Use network only between the hours of ...", could it be that?
ID: 13960 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 13962 - Posted: 19 Nov 2007, 21:41:54 UTC - in response to Message 13960.  

It looks like something caused BOINC networking to stop working. For some reason the start of the periodic benchmark at 08:40:12 caused networking to kick back into life ....


It used to be just the other way : benchmark stops network and returns to the previous state after the benchmark is done. As the file transfer sometimes eats CPU time, that has not really been a bad solution.

There is a global setting for "Use network only between the hours of ...", could it be that?

The upload was in progress and the client hanged. He had to kill the process. On restart, it did benchmarks.

Now, I see it quite strange that it didn't output stuff like "Starting BOINC client version 5.10.whatever for windows_intelx86" on startup o_O

ID: 13962 · Report as offensive
Udo

Send message
Joined: 17 Nov 07
Posts: 5
Message 14026 - Posted: 22 Nov 2007, 12:49:30 UTC - in response to Message 13962.  


...
The upload was in progress and the client hanged. He had to kill the process. On restart, it did benchmarks.

Now, I see it quite strange that it didn't output stuff like "Starting BOINC client version 5.10.whatever for windows_intelx86" on startup o_O


...well, what else can I do to get more debugging information?
One host is still running with 5.10.20

Udo
ID: 14026 · Report as offensive
Udo

Send message
Joined: 17 Nov 07
Posts: 5
Message 14133 - Posted: 27 Nov 2007, 8:21:49 UTC
Last modified: 27 Nov 2007, 8:22:28 UTC

It seems to be indeed a problem on Win2003 Server!

On the mentioned Host 60 km away, again a WU got stuck.
It's an AMD Sempron, so the problem is neither related to Intel nor to dual core / hyperthreading capabilities of the cpu.

Today evening I will get to the remote location and investigate further.

Udo

Edit: corrected a typo error...
ID: 14133 · Report as offensive

Message boards : BOINC client : BOINC Core client looping on one CPU/core

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.