BOINC "hangs" when network unavailable

Message boards : Questions and problems : BOINC "hangs" when network unavailable
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19939 - Posted: 3 Sep 2008, 18:45:01 UTC

When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network).
At this situation BOINC client manager can't connect to boinc service and hangs (not just shows message like "Cant connect to core client" but exactly hangs).

ID: 19939 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 19943 - Posted: 3 Sep 2008, 20:09:21 UTC - in response to Message 19942.  

What BOINC version and OS?

All versions I can remember. I'm sure for 5.8.16 and 5.10.45.
I've also read posts on the fora somewhere. Can't search now, Grey's Anatomy continues :-)

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
ID: 19943 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19949 - Posted: 4 Sep 2008, 3:15:33 UTC - in response to Message 19942.  
Last modified: 4 Sep 2008, 3:20:46 UTC

What BOINC version and OS?

In my case it's BOINC 5.10.45.
OSes Vista Business edition, Windows Server 2003 x64 & Win98 (I seen such behavior few times already, and when it hitted even my new quad with Vista realised that this bug should be reported).

ADDON: Science app (ovserved this on Einstein@home project app) experienced permanent restarts because of not reciving heartbeats from core client.
ID: 19949 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19953 - Posted: 4 Sep 2008, 5:23:58 UTC - in response to Message 19950.  


In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang?

10h is just approx value, cant rely on it.
Network cable was unplugged in second half of day and on next day morning I noticed described situation (in last case, with Vista OS on quad).
In case of Core2 Duo under Win2003 x64 and P-II under Win98 network cable was unplugged possibly few days.
So, probably this occurs after long network outage.
Sure, if network settings were set to "Network activity suspended" before cable unplug, all work just fine.

ID: 19953 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 19956 - Posted: 4 Sep 2008, 8:16:40 UTC - in response to Message 19950.  

In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. One of the hosts is WinXP and BOINC 5.10.45. Two are Linux and BOINC 6.2.15. The other is Linux and BOINC 5.10.45. None are service install.

The hang occurs as soon as BOINC tries a network connection without a cable plugged in.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
ID: 19956 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 19961 - Posted: 4 Sep 2008, 10:08:02 UTC - in response to Message 19950.  

In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10.

Might it be possible, that these 'about 10 hours' correlate with the time, when the machine's IP lease expires? (I've occasionally had this problem too since some 2 years ago (and reported it few times), but not since last months.)

Peter
ID: 19961 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19967 - Posted: 4 Sep 2008, 13:08:19 UTC

Well, maybe, although Win98 host has manually assigned IP as I recall.
Other hosts have default IP expire time, don't know how long is it for these OSes.

ID: 19967 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 19968 - Posted: 4 Sep 2008, 13:15:54 UTC - in response to Message 19967.  

Other hosts have default IP expire time, don't know how long is it for these OSes.

The expiration delay is defined by the DHCP server, not OS-side.

Peter
ID: 19968 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 19972 - Posted: 4 Sep 2008, 18:38:32 UTC - in response to Message 19964.  

@Gundolf,

Two of the Linux machines attempted a result upload while the cable was unplugged but the manager didn't hang. Maybe it doesn't occur on Linux? The Win machine in my test did not try to use the network while cable was unplugged.

Quite possible. I only have windows machines. The one I'm referring to runs NT4 :-)

I'm quite sure, though, that it's not the IP lease. I do have issues with that too, but never concurrently with BOINC manager hangs (as far as I remember :-)

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
ID: 19972 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19976 - Posted: 4 Sep 2008, 21:11:45 UTC - in response to Message 19968.  

Other hosts have default IP expire time, don't know how long is it for these OSes.

The expiration delay is defined by the DHCP server, not OS-side.

Peter

Do you really think DHCP service built in Windows is not part of OS?
Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )
ID: 19976 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19988 - Posted: 5 Sep 2008, 7:39:29 UTC - in response to Message 19982.  

Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control.


I didnt recive that popup.
When I kill BOINC manager process and restart it it hangs again.
The solution for that case was to stop boinc service, thaen start BOINC manager, then disable network access, then restart boinc service and manager.
So, some "hang" occurs into boinc service IMHO, not BOINC manager itself.
This is supported by einstein@home app behavior - it perpetually restarts with message like no heartbeat for 30 sec. Apparently, science app can't communicate with boinc service too in that situation.
And don't forget increased CPU consumption by boinc.exe process. It looks like service retry his communication attempts too often to be able to do anything else.
I will try to reproduce this situation in more controlled environment.
ID: 19988 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 19989 - Posted: 5 Sep 2008, 9:04:18 UTC - in response to Message 19988.  

I will try to reproduce this situation in more controlled environment.

Try to reproduce it with 6.2.18
I say this because 5.10 is no longer in development, no new(er) versions of 5.10 will be released.
ID: 19989 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 19990 - Posted: 5 Sep 2008, 9:08:50 UTC - in response to Message 19976.  

Other hosts have default IP expire time, don't know how long is it for these OSes.

The expiration delay is defined by the DHCP server, not OS-side.

Do you really think DHCP service built in Windows is not part of OS?

No, I do not, but please note, that I wrote about IP lease from DHCP server (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client).

Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )

I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken).

Peter
ID: 19990 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19991 - Posted: 5 Sep 2008, 10:06:48 UTC - in response to Message 19989.  

I will try to reproduce this situation in more controlled environment.

Try to reproduce it with 6.2.18
I say this because 5.10 is no longer in development, no new(er) versions of 5.10 will be released.

Sorry, I still don't like idea to use 6.x versions, at least on production hosts.
But I guess that good share of 5.x codebase was used in 6.x version. So this bug easely could go to this new version too.

ID: 19991 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 19992 - Posted: 5 Sep 2008, 10:15:48 UTC - in response to Message 19990.  
Last modified: 5 Sep 2008, 10:16:25 UTC

IP lease from DHCP server[/url] (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client).

I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :)

So all affected hosts are connected to Internet via Windows embedded NAT services and have no external IP. Have no idea if it can shed any light on issue under investigation but ...

Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )

I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken).

:)
ID: 19992 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 19993 - Posted: 5 Sep 2008, 10:31:20 UTC - in response to Message 19992.  

I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :)
So all affected hosts are connected to Internet via Windows embedded NAT services and have no external IP.

OK, this way you can at least rule out the machines acting as DHCP server (what the NAT service is) not being available. Thus this is probably no lost IP problem (because I did have such BOINC issues in the past).

Peter
ID: 19993 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 20003 - Posted: 5 Sep 2008, 16:45:47 UTC - in response to Message 19999.  

I agree. It's the client (what you call the service) that hangs on your machine.

Are your hosts connected with WiFi or cables?

I had to stop my attempt to reproduce the problem before 10 hours but I will try again.

P.S. After more thought... the popup is generated by the manager. Since you don't get the popup, the client is obviously hanging before it can tell the manager there is no Internet connection.

I call client as service, because BOINC core client runs as windows service on that hosts (exept Win98 of course, there boinc.exe runs just as separate console app).

Hosts are connected via cable. Cable unplug leads to that situation (long unplug, how long - need to be investigated more).

I usually not run BOINC manager, only boinc.exe runs as service.
So, after boinc.exe met his trouble, BOINC manager (that was launched later) met wounded boinc service. Why BOINC manager doesn't use some timeout for communication with service (core client) in this case - don't know. If service stopped manager shows popup like "can't connect to client". But while service runs, manager hangs.

ID: 20003 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 20041 - Posted: 8 Sep 2008, 15:49:31 UTC - in response to Message 20029.  

BOINC has been running on my Win XP BOINC 5.10.45 system with the network cable unplugged from the hardware router for 25 hours. No hangs, no problems.

Raistmer, have you tried turning on any of the debug options described here? Maybe try these just to see what info turns up:

<file_xfer>
<file_xfer_debug>
<http_debug>
<http_xfer_debug>
<network_status_debug>


There was some delay - AP should be finished before further experiments.
Will try with these options enabled.

ID: 20041 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 20064 - Posted: 9 Sep 2008, 22:58:01 UTC
Last modified: 9 Sep 2008, 22:59:41 UTC

Well, after 28h with unplugged cable BOINC manager can connect to service.
On first try (after ~3 h) it showed popup box, on second and third attempt it just opened main window.
Hard reproducible (or nonreproducible) bug it seems.

But for 28h of running w/o network boinc.exe took 1h14min of CPU.
That is, ~4% of CPU goes to boinc service itself.
Almost every second BOINC tries to reconnect for some of results.
There is ~100 results ready to upload already (SETI produces very short tasks now). These retries take too much CPU time it seems.
Maybe it worth to check if project available once per minute (or less often), not each second?
ID: 20064 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 20066 - Posted: 10 Sep 2008, 0:24:48 UTC

You get the once a second retry if you have your reminder option set to zero.
Only in a 6.3 is the reminder at zero really off.
ID: 20066 · Report as offensive
1 · 2 · Next

Message boards : Questions and problems : BOINC "hangs" when network unavailable

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.