|
Message boards : Questions and problems : BOINC "hangs" when network unavailable
| Author | Message |
|---|---|
|
When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network). | |
| ID: 19939 | | |
|
What BOINC version and OS? | |
| ID: 19942 | | |
What BOINC version and OS? All versions I can remember. I'm sure for 5.8.16 and 5.10.45. I've also read posts on the fora somewhere. Can't search now, Grey's Anatomy continues :-) Gruß, Gundolf ____________ Computer sind nicht alles im Leben. (Kleiner Scherz) ![]() | |
| ID: 19943 | | |
What BOINC version and OS? In my case it's BOINC 5.10.45. OSes Vista Business edition, Windows Server 2003 x64 & Win98 (I seen such behavior few times already, and when it hitted even my new quad with Vista realised that this bug should be reported). ADDON: Science app (ovserved this on Einstein@home project app) experienced permanent restarts because of not reciving heartbeats from core client. ____________ | |
| ID: 19949 | | |
What BOINC version and OS? In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. One of the hosts is WinXP and BOINC 5.10.45. Two are Linux and BOINC 6.2.15. The other is Linux and BOINC 5.10.45. None are service install. | |
| ID: 19950 | | |
10h is just approx value, cant rely on it. Network cable was unplugged in second half of day and on next day morning I noticed described situation (in last case, with Vista OS on quad). In case of Core2 Duo under Win2003 x64 and P-II under Win98 network cable was unplugged possibly few days. So, probably this occurs after long network outage. Sure, if network settings were set to "Network activity suspended" before cable unplug, all work just fine. ____________ | |
| ID: 19953 | | |
In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. One of the hosts is WinXP and BOINC 5.10.45. Two are Linux and BOINC 6.2.15. The other is Linux and BOINC 5.10.45. None are service install. The hang occurs as soon as BOINC tries a network connection without a cable plugged in. Gruß, Gundolf ____________ Computer sind nicht alles im Leben. (Kleiner Scherz) ![]() | |
| ID: 19956 | | |
In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. Might it be possible, that these 'about 10 hours' correlate with the time, when the machine's IP lease expires? (I've occasionally had this problem too since some 2 years ago (and reported it few times), but not since last months.) Peter | |
| ID: 19961 | | |
In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. @Gundolf, Two of the Linux machines attempted a result upload while the cable was unplugged but the manager didn't hang. Maybe it doesn't occur on Linux? The Win machine in my test did not try to use the network while cable was unplugged. @Peter, Yes, it might coincide with IP lease expiry. I have expiry set to 1 week at the moment. Later today I'll set expiry to the minimum, increase work cache and test again. | |
| ID: 19964 | | |
|
Well, maybe, although Win98 host has manually assigned IP as I recall. | |
| ID: 19967 | | |
Other hosts have default IP expire time, don't know how long is it for these OSes. The expiration delay is defined by the DHCP server, not OS-side. Peter | |
| ID: 19968 | | |
@Gundolf, Quite possible. I only have windows machines. The one I'm referring to runs NT4 :-) I'm quite sure, though, that it's not the IP lease. I do have issues with that too, but never concurrently with BOINC manager hangs (as far as I remember :-) Gruß, Gundolf ____________ Computer sind nicht alles im Leben. (Kleiner Scherz) ![]() | |
| ID: 19972 | | |
Other hosts have default IP expire time, don't know how long is it for these OSes. Do you really think DHCP service built in Windows is not part of OS? Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) ) ____________ | |
| ID: 19976 | | |
|
Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control. | |
| ID: 19982 | | |
Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control. I didnt recive that popup. When I kill BOINC manager process and restart it it hangs again. The solution for that case was to stop boinc service, thaen start BOINC manager, then disable network access, then restart boinc service and manager. So, some "hang" occurs into boinc service IMHO, not BOINC manager itself. This is supported by einstein@home app behavior - it perpetually restarts with message like no heartbeat for 30 sec. Apparently, science app can't communicate with boinc service too in that situation. And don't forget increased CPU consumption by boinc.exe process. It looks like service retry his communication attempts too often to be able to do anything else. I will try to reproduce this situation in more controlled environment. ____________ | |
| ID: 19988 | | |
I will try to reproduce this situation in more controlled environment. Try to reproduce it with 6.2.18 I say this because 5.10 is no longer in development, no new(er) versions of 5.10 will be released. ____________ Jord -The BOINC FAQ Service -CUDA/Stream FAQ Courtesy starts with your first post of the thread. | |
| ID: 19989 | | |
Other hosts have default IP expire time, don't know how long is it for these OSes. No, I do not, but please note, that I wrote about IP lease from DHCP server (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client). Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) ) I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken). Peter | |
| ID: 19990 | | |
I will try to reproduce this situation in more controlled environment. Sorry, I still don't like idea to use 6.x versions, at least on production hosts. But I guess that good share of 5.x codebase was used in 6.x version. So this bug easely could go to this new version too. ____________ | |
| ID: 19991 | | |
IP lease from DHCP server[/url] (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client). I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :) So all affected hosts are connected to Internet via Windows embedded NAT services and have no external IP. Have no idea if it can shed any light on issue under investigation but ... Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) ) I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken). :) ____________ | |
| ID: 19992 | | |
I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :) OK, this way you can at least rule out the machines acting as DHCP server (what the NAT service is) not being available. Thus this is probably no lost IP problem (because I did have such BOINC issues in the past). Peter | |
| ID: 19993 | | |
Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control. Interesting. You should have received that popup. Not getting the popup is somehow related to the problem. When I kill BOINC manager process and restart it it hangs again. I agree. It's the client (what you call the service) that hangs on your machine. Are your hosts connected with WiFi or cables? I had to stop my attempt to reproduce the problem before 10 hours but I will try again. P.S. After more thought... the popup is generated by the manager. Since you don't get the popup, the client is obviously hanging before it can tell the manager there is no Internet connection. | |
| ID: 19999 | | |
I agree. It's the client (what you call the service) that hangs on your machine. I call client as service, because BOINC core client runs as windows service on that hosts (exept Win98 of course, there boinc.exe runs just as separate console app). Hosts are connected via cable. Cable unplug leads to that situation (long unplug, how long - need to be investigated more). I usually not run BOINC manager, only boinc.exe runs as service. So, after boinc.exe met his trouble, BOINC manager (that was launched later) met wounded boinc service. Why BOINC manager doesn't use some timeout for communication with service (core client) in this case - don't know. If service stopped manager shows popup like "can't connect to client". But while service runs, manager hangs. ____________ | |
| ID: 20003 | | |
|
BOINC has been running on my Win XP BOINC 5.10.45 system with the network cable unplugged from the hardware router for 25 hours. No hangs, no problems. | |
| ID: 20029 | | |
BOINC has been running on my Win XP BOINC 5.10.45 system with the network cable unplugged from the hardware router for 25 hours. No hangs, no problems. There was some delay - AP should be finished before further experiments. Will try with these options enabled. ____________ | |
| ID: 20041 | | |
|
Well, after 28h with unplugged cable BOINC manager can connect to service. | |
| ID: 20064 | | |
Well, after 28h with unplugged cable BOINC manager can connect to service. In the test I did it retried once per minute, not once per second. We are both running 5.10.45 but maybe you have a different build? It might be worth trying the latest version because even if we pinpoint a bug in the source they are not going to release another 5.10.x. Your options will be to compile a fixed 5.10.45 yourself or update to 6.2.18. | |
| ID: 20065 | | |
|
You get the once a second retry if you have your reminder option set to zero. | |
| ID: 20066 | | |
BOINC has >100 results in upload queue.It retry sending of each result once per minute. But it retry 2 results at once, slightly later - next pair and so on and so on. So, each second some of results being send. And it pretty meaningless, to retry next pair right after prev pair fails. It could check if project site accessible (as it does with reference site) and if project down, stops all retries until project will be reachable again. It can be applied not only to situation when whole network is down (as in case of cable unplug), but when some of projects are down for maintenance too. FOr example, SETI project down few hours each week. ADDON: Why such retries are evil? - Because boinc.exe eats much of CPU in this situation. Now, after ~40h of network outage, it consumed 2h19min of CPU time (on Q9450 system (!) ). 4% of CPU time for quad system - it's too much for management tool. @Ageless. How can I lower CPU consumption of BOINC 5.10.45 in this situation? I didn't change any "reminder" options by hands, should I increase it and where? ____________ | |
| ID: 20070 | | |
When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network). There is a bug that causes a hang if network is unavailable; unrelated to the high CPU usage. ____________ Please use the "Reply" button on posts, instead of "reply to this thread". Keep the "X is a reply to Y". | |
| ID: 20073 | | |
When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network). Maybe. If so, there are 2 separate problems instead of one :) I can't reproduce "hang" still, but high CPU consumption is easely reproduceable. ____________ | |
| ID: 20079 | | |
|
it happens to me too... I have 4 cores, and they run at night... and in the morning, screensaver says Boinc is loading and then one pop up says no network availabe and manager hangs and closes... I will try it crunching with 3 cores to see if it works.. | |
| ID: 20174 | | |
Message boards : Questions and problems : BOINC "hangs" when network unavailable
Copyright © 2009 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.