BOINC "hangs" when network unavailable
BOINC "hangs" when network unavailable

Advanced search

Message boards : Questions and problems : BOINC "hangs" when network unavailable

AuthorMessage
Raistmer
Send message
Joined: Apr 9 06
Posts: 113
Message 19939 - Posted 3 Sep 2008 18:45:01 UTC

    When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network).
    At this situation BOINC client manager can't connect to boinc service and hangs (not just shows message like "Cant connect to core client" but exactly hangs).

    ____________

    Sekerob
    Send message
    Joined: Aug 25 06
    Posts: 1079
    Message 19942 - Posted 3 Sep 2008 19:24:23 UTC - in response to Message 19939.

      What BOINC version and OS?
      ____________
      Coelum Non Animum Mutant, Qui Trans Mare Currunt

      Profile Gundolf Jahn
      Send message
      Joined: Dec 20 07
      Posts: 473
      Message 19943 - Posted 3 Sep 2008 20:09:21 UTC - in response to Message 19942.

        What BOINC version and OS?

        All versions I can remember. I'm sure for 5.8.16 and 5.10.45.
        I've also read posts on the fora somewhere. Can't search now, Grey's Anatomy continues :-)

        Gruß,
        Gundolf
        ____________
        Computer sind nicht alles im Leben. (Kleiner Scherz)

        Raistmer
        Send message
        Joined: Apr 9 06
        Posts: 113
        Message 19949 - Posted 4 Sep 2008 3:15:33 UTC - in response to Message 19942.

          Last modified: 4 Sep 2008 3:20:46 UTC

          What BOINC version and OS?

          In my case it's BOINC 5.10.45.
          OSes Vista Business edition, Windows Server 2003 x64 & Win98 (I seen such behavior few times already, and when it hitted even my new quad with Vista realised that this bug should be reported).

          ADDON: Science app (ovserved this on Einstein@home project app) experienced permanent restarts because of not reciving heartbeats from core client.
          ____________

          Dagorath
          Send message
          Joined: Jun 13 07
          Posts: 638
          Message 19950 - Posted 4 Sep 2008 4:19:02 UTC - in response to Message 19949.

            What BOINC version and OS?

            In my case it's BOINC 5.10.45.
            OSes Vista Business edition, Windows Server 2003 x64 & Win98 (I seen such behavior few times already, and when it hitted even my new quad with Vista realised that this bug should be reported).

            ADDON: Science app (ovserved this on Einstein@home project app) experienced permanent restarts because of not reciving heartbeats from core client.


            In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. One of the hosts is WinXP and BOINC 5.10.45. Two are Linux and BOINC 6.2.15. The other is Linux and BOINC 5.10.45. None are service install.



            Raistmer
            Send message
            Joined: Apr 9 06
            Posts: 113
            Message 19953 - Posted 4 Sep 2008 5:23:58 UTC - in response to Message 19950.


              In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang?

              10h is just approx value, cant rely on it.
              Network cable was unplugged in second half of day and on next day morning I noticed described situation (in last case, with Vista OS on quad).
              In case of Core2 Duo under Win2003 x64 and P-II under Win98 network cable was unplugged possibly few days.
              So, probably this occurs after long network outage.
              Sure, if network settings were set to "Network activity suspended" before cable unplug, all work just fine.

              ____________

              Profile Gundolf Jahn
              Send message
              Joined: Dec 20 07
              Posts: 473
              Message 19956 - Posted 4 Sep 2008 8:16:40 UTC - in response to Message 19950.

                In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10. One of the hosts is WinXP and BOINC 5.10.45. Two are Linux and BOINC 6.2.15. The other is Linux and BOINC 5.10.45. None are service install.

                The hang occurs as soon as BOINC tries a network connection without a cable plugged in.

                Gruß,
                Gundolf
                ____________
                Computer sind nicht alles im Leben. (Kleiner Scherz)

                Pepo
                Avatar
                Send message
                Joined: Apr 3 06
                Posts: 403
                Message 19961 - Posted 4 Sep 2008 10:08:02 UTC - in response to Message 19950.

                  In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10.

                  Might it be possible, that these 'about 10 hours' correlate with the time, when the machine's IP lease expires? (I've occasionally had this problem too since some 2 years ago (and reported it few times), but not since last months.)

                  Peter

                  Dagorath
                  Send message
                  Joined: Jun 13 07
                  Posts: 638
                  Message 19964 - Posted 4 Sep 2008 12:36:41 UTC - in response to Message 19961.

                    In your first post you mention 10 hours. Are you saying the network must be broken for about 10 hours to cause the hang? I am wondering because I unplugged the cable between the host and router on 4 of my hosts and saw no problems. But I left them unplugged for only 1 to 2 hours, not 10.

                    Might it be possible, that these 'about 10 hours' correlate with the time, when the machine's IP lease expires? (I've occasionally had this problem too since some 2 years ago (and reported it few times), but not since last months.)

                    Peter


                    @Gundolf,

                    Two of the Linux machines attempted a result upload while the cable was unplugged but the manager didn't hang. Maybe it doesn't occur on Linux? The Win machine in my test did not try to use the network while cable was unplugged.

                    @Peter,

                    Yes, it might coincide with IP lease expiry. I have expiry set to 1 week at the moment. Later today I'll set expiry to the minimum, increase work cache and test again.

                    Raistmer
                    Send message
                    Joined: Apr 9 06
                    Posts: 113
                    Message 19967 - Posted 4 Sep 2008 13:08:19 UTC

                      Well, maybe, although Win98 host has manually assigned IP as I recall.
                      Other hosts have default IP expire time, don't know how long is it for these OSes.

                      ____________

                      Pepo
                      Avatar
                      Send message
                      Joined: Apr 3 06
                      Posts: 403
                      Message 19968 - Posted 4 Sep 2008 13:15:54 UTC - in response to Message 19967.

                        Other hosts have default IP expire time, don't know how long is it for these OSes.

                        The expiration delay is defined by the DHCP server, not OS-side.

                        Peter

                        Profile Gundolf Jahn
                        Send message
                        Joined: Dec 20 07
                        Posts: 473
                        Message 19972 - Posted 4 Sep 2008 18:38:32 UTC - in response to Message 19964.

                          @Gundolf,

                          Two of the Linux machines attempted a result upload while the cable was unplugged but the manager didn't hang. Maybe it doesn't occur on Linux? The Win machine in my test did not try to use the network while cable was unplugged.

                          Quite possible. I only have windows machines. The one I'm referring to runs NT4 :-)

                          I'm quite sure, though, that it's not the IP lease. I do have issues with that too, but never concurrently with BOINC manager hangs (as far as I remember :-)

                          Gruß,
                          Gundolf
                          ____________
                          Computer sind nicht alles im Leben. (Kleiner Scherz)

                          Raistmer
                          Send message
                          Joined: Apr 9 06
                          Posts: 113
                          Message 19976 - Posted 4 Sep 2008 21:11:45 UTC - in response to Message 19968.

                            Other hosts have default IP expire time, don't know how long is it for these OSes.

                            The expiration delay is defined by the DHCP server, not OS-side.

                            Peter

                            Do you really think DHCP service built in Windows is not part of OS?
                            Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )
                            ____________

                            Dagorath
                            Send message
                            Joined: Jun 13 07
                            Posts: 638
                            Message 19982 - Posted 5 Sep 2008 2:39:30 UTC

                              Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control.

                              Raistmer
                              Send message
                              Joined: Apr 9 06
                              Posts: 113
                              Message 19988 - Posted 5 Sep 2008 7:39:29 UTC - in response to Message 19982.

                                Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control.


                                I didnt recive that popup.
                                When I kill BOINC manager process and restart it it hangs again.
                                The solution for that case was to stop boinc service, thaen start BOINC manager, then disable network access, then restart boinc service and manager.
                                So, some "hang" occurs into boinc service IMHO, not BOINC manager itself.
                                This is supported by einstein@home app behavior - it perpetually restarts with message like no heartbeat for 30 sec. Apparently, science app can't communicate with boinc service too in that situation.
                                And don't forget increased CPU consumption by boinc.exe process. It looks like service retry his communication attempts too often to be able to do anything else.
                                I will try to reproduce this situation in more controlled environment.
                                ____________

                                Profile Ageless
                                Forum moderator
                                Project administrator
                                Avatar
                                Send message
                                Joined: Aug 29 05
                                Posts: 4384
                                Message 19989 - Posted 5 Sep 2008 9:04:18 UTC - in response to Message 19988.

                                  I will try to reproduce this situation in more controlled environment.

                                  Try to reproduce it with 6.2.18
                                  I say this because 5.10 is no longer in development, no new(er) versions of 5.10 will be released.
                                  ____________
                                  Jord

                                  -The BOINC FAQ Service

                                  -CUDA/Stream FAQ

                                  Courtesy starts with your first post of the thread.

                                  Pepo
                                  Avatar
                                  Send message
                                  Joined: Apr 3 06
                                  Posts: 403
                                  Message 19990 - Posted 5 Sep 2008 9:08:50 UTC - in response to Message 19976.

                                    Other hosts have default IP expire time, don't know how long is it for these OSes.

                                    The expiration delay is defined by the DHCP server, not OS-side.

                                    Do you really think DHCP service built in Windows is not part of OS?

                                    No, I do not, but please note, that I wrote about IP lease from DHCP server (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client).

                                    Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )

                                    I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken).

                                    Peter

                                    Raistmer
                                    Send message
                                    Joined: Apr 9 06
                                    Posts: 113
                                    Message 19991 - Posted 5 Sep 2008 10:06:48 UTC - in response to Message 19989.

                                      I will try to reproduce this situation in more controlled environment.

                                      Try to reproduce it with 6.2.18
                                      I say this because 5.10 is no longer in development, no new(er) versions of 5.10 will be released.

                                      Sorry, I still don't like idea to use 6.x versions, at least on production hosts.
                                      But I guess that good share of 5.x codebase was used in 6.x version. So this bug easely could go to this new version too.

                                      ____________

                                      Raistmer
                                      Send message
                                      Joined: Apr 9 06
                                      Posts: 113
                                      Message 19992 - Posted 5 Sep 2008 10:15:48 UTC - in response to Message 19990.

                                        Last modified: 5 Sep 2008 10:16:25 UTC

                                        IP lease from DHCP server[/url] (the server machine defines, for which period of time is the client machine (Windows in our case) allowed to use this IP), not about the client machine's OS' DHCP service (acting as the DHCP client).

                                        I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :)

                                        So all affected hosts are connected to Internet via Windows embedded NAT services and have no external IP. Have no idea if it can shed any light on issue under investigation but ...

                                        Maybe I should belive after that Win is true microkernel and modular OS ?Even RTOS maybe? LoL (offtopic, of course, but... sorry :) )

                                        I've no exact idea, what to compare Windows to, but sure no RTOS :-) (joke taken).

                                        :)
                                        ____________

                                        Pepo
                                        Avatar
                                        Send message
                                        Joined: Apr 3 06
                                        Posts: 403
                                        Message 19993 - Posted 5 Sep 2008 10:31:20 UTC - in response to Message 19992.

                                          I mean that host in local network so it has local IP and this local IP was assigned by server...in "server" role plays Win2003 x64 in one case and WinXP x86 in another case. Both client (with DHCP client service) and server (with embedded into system DHCP server service) are windows :) That's what I mean :)
                                          So all affected hosts are connected to Internet via Windows embedded NAT services and have no external IP.

                                          OK, this way you can at least rule out the machines acting as DHCP server (what the NAT service is) not being available. Thus this is probably no lost IP problem (because I did have such BOINC issues in the past).

                                          Peter

                                          Dagorath
                                          Send message
                                          Joined: Jun 13 07
                                          Posts: 638
                                          Message 19999 - Posted 5 Sep 2008 12:00:20 UTC - in response to Message 19988.

                                            Last modified: 5 Sep 2008 12:05:42 UTC

                                            Raistmer, are you sure the manager is truly hanging? I unplugged the cable on a Win XP host running BOINC 6.2.18 and waited until it attempted to upload a result. It tries the upload 3 or 4 times then gives an application modal popup saying "BOINC couldn't do Internet communication, and no default connection is selected." The manager appears to hang because clicking on it does nothing. However, if I close the popup then the manager regains control.


                                            I didnt recive that popup.


                                            Interesting. You should have received that popup. Not getting the popup is somehow related to the problem.

                                            When I kill BOINC manager process and restart it it hangs again.
                                            The solution for that case was to stop boinc service, thaen start BOINC manager, then disable network access, then restart boinc service and manager.
                                            So, some "hang" occurs into boinc service IMHO, not BOINC manager itself. This is supported by einstein@home app behavior - it perpetually restarts with message like no heartbeat for 30 sec. Apparently, science app can't communicate with boinc service too in that situation. And don't forget increased CPU consumption by boinc.exe process. It looks like service retry his communication attempts too often to be able to do anything else.
                                            I will try to reproduce this situation in more controlled environment.


                                            I agree. It's the client (what you call the service) that hangs on your machine.

                                            Are your hosts connected with WiFi or cables?

                                            I had to stop my attempt to reproduce the problem before 10 hours but I will try again.

                                            P.S. After more thought... the popup is generated by the manager. Since you don't get the popup, the client is obviously hanging before it can tell the manager there is no Internet connection.

                                            Raistmer
                                            Send message
                                            Joined: Apr 9 06
                                            Posts: 113
                                            Message 20003 - Posted 5 Sep 2008 16:45:47 UTC - in response to Message 19999.

                                              I agree. It's the client (what you call the service) that hangs on your machine.

                                              Are your hosts connected with WiFi or cables?

                                              I had to stop my attempt to reproduce the problem before 10 hours but I will try again.

                                              P.S. After more thought... the popup is generated by the manager. Since you don't get the popup, the client is obviously hanging before it can tell the manager there is no Internet connection.

                                              I call client as service, because BOINC core client runs as windows service on that hosts (exept Win98 of course, there boinc.exe runs just as separate console app).

                                              Hosts are connected via cable. Cable unplug leads to that situation (long unplug, how long - need to be investigated more).

                                              I usually not run BOINC manager, only boinc.exe runs as service.
                                              So, after boinc.exe met his trouble, BOINC manager (that was launched later) met wounded boinc service. Why BOINC manager doesn't use some timeout for communication with service (core client) in this case - don't know. If service stopped manager shows popup like "can't connect to client". But while service runs, manager hangs.

                                              ____________

                                              Dagorath
                                              Send message
                                              Joined: Jun 13 07
                                              Posts: 638
                                              Message 20029 - Posted 7 Sep 2008 17:32:02 UTC

                                                BOINC has been running on my Win XP BOINC 5.10.45 system with the network cable unplugged from the hardware router for 25 hours. No hangs, no problems.

                                                Raistmer, have you tried turning on any of the debug options described here? Maybe try these just to see what info turns up:

                                                <file_xfer>
                                                <file_xfer_debug>
                                                <http_debug>
                                                <http_xfer_debug>
                                                <network_status_debug>

                                                Raistmer
                                                Send message
                                                Joined: Apr 9 06
                                                Posts: 113
                                                Message 20041 - Posted 8 Sep 2008 15:49:31 UTC - in response to Message 20029.

                                                  BOINC has been running on my Win XP BOINC 5.10.45 system with the network cable unplugged from the hardware router for 25 hours. No hangs, no problems.

                                                  Raistmer, have you tried turning on any of the debug options described here? Maybe try these just to see what info turns up:

                                                  <file_xfer>
                                                  <file_xfer_debug>
                                                  <http_debug>
                                                  <http_xfer_debug>
                                                  <network_status_debug>


                                                  There was some delay - AP should be finished before further experiments.
                                                  Will try with these options enabled.

                                                  ____________

                                                  Raistmer
                                                  Send message
                                                  Joined: Apr 9 06
                                                  Posts: 113
                                                  Message 20064 - Posted 9 Sep 2008 22:58:01 UTC

                                                    Last modified: 9 Sep 2008 22:59:41 UTC

                                                    Well, after 28h with unplugged cable BOINC manager can connect to service.
                                                    On first try (after ~3 h) it showed popup box, on second and third attempt it just opened main window.
                                                    Hard reproducible (or nonreproducible) bug it seems.

                                                    But for 28h of running w/o network boinc.exe took 1h14min of CPU.
                                                    That is, ~4% of CPU goes to boinc service itself.
                                                    Almost every second BOINC tries to reconnect for some of results.
                                                    There is ~100 results ready to upload already (SETI produces very short tasks now). These retries take too much CPU time it seems.
                                                    Maybe it worth to check if project available once per minute (or less often), not each second?
                                                    ____________

                                                    Dagorath
                                                    Send message
                                                    Joined: Jun 13 07
                                                    Posts: 638
                                                    Message 20065 - Posted 10 Sep 2008 0:05:46 UTC - in response to Message 20064.

                                                      Well, after 28h with unplugged cable BOINC manager can connect to service.
                                                      On first try (after ~3 h) it showed popup box, on second and third attempt it just opened main window.
                                                      Hard reproducible (or nonreproducible) bug it seems.

                                                      But for 28h of running w/o network boinc.exe took 1h14min of CPU.
                                                      That is, ~4% of CPU goes to boinc service itself.
                                                      Almost every second BOINC tries to reconnect for some of results.
                                                      There is ~100 results ready to upload already (SETI produces very short tasks now). These retries take too much CPU time it seems.
                                                      Maybe it worth to check if project available once per minute (or less often), not each second?


                                                      In the test I did it retried once per minute, not once per second. We are both running 5.10.45 but maybe you have a different build? It might be worth trying the latest version because even if we pinpoint a bug in the source they are not going to release another 5.10.x. Your options will be to compile a fixed 5.10.45 yourself or update to 6.2.18.

                                                      Profile Ageless
                                                      Forum moderator
                                                      Project administrator
                                                      Avatar
                                                      Send message
                                                      Joined: Aug 29 05
                                                      Posts: 4384
                                                      Message 20066 - Posted 10 Sep 2008 0:24:48 UTC

                                                        You get the once a second retry if you have your reminder option set to zero.
                                                        Only in a 6.3 is the reminder at zero really off.
                                                        ____________
                                                        Jord

                                                        -The BOINC FAQ Service

                                                        -CUDA/Stream FAQ

                                                        Courtesy starts with your first post of the thread.

                                                        Raistmer
                                                        Send message
                                                        Joined: Apr 9 06
                                                        Posts: 113
                                                        Message 20070 - Posted 10 Sep 2008 11:32:23 UTC - in response to Message 20065.

                                                          Last modified: 10 Sep 2008 11:46:55 UTC


                                                          In the test I did it retried once per minute, not once per second. We are both running 5.10.45 but maybe you have a different build? It might be worth trying the latest version because even if we pinpoint a bug in the source they are not going to release another 5.10.x. Your options will be to compile a fixed 5.10.45 yourself or update to 6.2.18.


                                                          BOINC has >100 results in upload queue.It retry sending of each result once per minute. But it retry 2 results at once, slightly later - next pair and so on and so on.
                                                          So, each second some of results being send.

                                                          And it pretty meaningless, to retry next pair right after prev pair fails. It could check if project site accessible (as it does with reference site) and if project down, stops all retries until project will be reachable again.
                                                          It can be applied not only to situation when whole network is down (as in case of cable unplug), but when some of projects are down for maintenance too. FOr example, SETI project down few hours each week.

                                                          ADDON: Why such retries are evil? - Because boinc.exe eats much of CPU in this situation. Now, after ~40h of network outage, it consumed 2h19min of CPU time (on Q9450 system (!) ). 4% of CPU time for quad system - it's too much for management tool.

                                                          @Ageless. How can I lower CPU consumption of BOINC 5.10.45 in this situation? I didn't change any "reminder" options by hands, should I increase it and where?
                                                          ____________

                                                          Nicolas
                                                          Avatar
                                                          Send message
                                                          Joined: Jan 19 07
                                                          Posts: 1124
                                                          Message 20073 - Posted 10 Sep 2008 15:58:55 UTC - in response to Message 19939.

                                                            When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network).
                                                            At this situation BOINC client manager can't connect to boinc service and hangs (not just shows message like "Cant connect to core client" but exactly hangs).

                                                            There is a bug that causes a hang if network is unavailable; unrelated to the high CPU usage.


                                                            ____________
                                                            Please use the "Reply" button on posts, instead of "reply to this thread". Keep the "X is a reply to Y".

                                                            Raistmer
                                                            Send message
                                                            Joined: Apr 9 06
                                                            Posts: 113
                                                            Message 20079 - Posted 10 Sep 2008 22:27:47 UTC - in response to Message 20073.

                                                              When host runs with broken network (for example, network cable unplugged), boinc.exe (runned as service) begins to consume CPU (~2h of CPU time for ~10 hours w/o network).
                                                              At this situation BOINC client manager can't connect to boinc service and hangs (not just shows message like "Cant connect to core client" but exactly hangs).

                                                              There is a bug that causes a hang if network is unavailable; unrelated to the high CPU usage.


                                                              Maybe. If so, there are 2 separate problems instead of one :)
                                                              I can't reproduce "hang" still, but high CPU consumption is easely reproduceable.
                                                              ____________

                                                              Vaki
                                                              Send message
                                                              Joined: Sep 11 08
                                                              Posts: 5
                                                              Message 20174 - Posted 12 Sep 2008 8:37:22 UTC - in response to Message 20079.

                                                                it happens to me too... I have 4 cores, and they run at night... and in the morning, screensaver says Boinc is loading and then one pop up says no network availabe and manager hangs and closes... I will try it crunching with 3 cores to see if it works..

                                                                Post to thread

                                                                Message boards : Questions and problems : BOINC "hangs" when network unavailable


                                                                BOINC home page | Log in | Create account

                                                                Copyright © 2009 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.