Network problems = Unrecoverable error...

Message boards : Questions and problems : Network problems = Unrecoverable error...
Message board moderation

To post messages, you must log in.

AuthorMessage
-ShEm-

Send message
Joined: 14 Feb 08
Posts: 28
Message 45007 - Posted: 21 Jul 2012, 10:37:33 UTC
Last modified: 21 Jul 2012, 10:38:11 UTC

I very frequently get network errors, like connection lost, connection reset, etc., because of the very bad internet where I currently live. But why can't boinc-client handle that while trying to connect? Every time I either get lots of "exited with zero status but no 'finished' file" or simply get "Unrecoverable error"! WTF? Why are computation and network linked so much together? Shouldn't they be separate processes which doesn't interfere with each other? Considering how many hours of work I'm loosing (today about 12 hours in 4 WU's) I'd rather not run BOINC then...

This has happened both in various windows and Linux OS's, in various BOINC-versions from 5.x.x - 7.x.x and on different computers. I wasn't bothered by it so much before, because I lived in a 1st world country with ok or better internet so it happened maybe once a month, but where I live now it happens daily, even several times a day if I'm not micro-managing, but all the time we're told we're not supposed to be micro-managing boinc (and I don't want to). The way it is now I have to do just that: Manually disabling computation, before allowing network activity, wait for uploads / downloads / server contact to finish, then disable network activity and enable computation. :(

Example from today (I forgot to disable computation):
2012-07-21T15:41:50 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_0
2012-07-21T15:41:55 BDT | Einstein@Home | Finished upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_0
2012-07-21T15:44:02 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3
2012-07-21T15:45:06 BDT |  | Project communication failed: attempting access to reference site
2012-07-21T15:45:06 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error
2012-07-21T15:45:06 BDT | Einstein@Home | Backing off 5 min 37 sec on upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3
2012-07-21T15:46:43 BDT |  | Internet access OK - project servers may be temporarily down.
2012-07-21T15:50:44 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3
2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Deferring communication for 1 min 50 sec
2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_42_2 (process got signal 11)
2012-07-21T15:51:37 BDT | Asteroids@home | Computation for task ps_120622b_632_42_2 finished
2012-07-21T15:51:37 BDT | Asteroids@home | Starting task ps_120622b_621_226_2 using period_search version 10000 in slot 3
2012-07-21T15:51:38 BDT | Asteroids@home | [sched_op] Deferring communication for 3 min 4 sec
2012-07-21T15:51:38 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_43_3 (process got signal 11)
2012-07-21T15:51:38 BDT | Asteroids@home | Computation for task ps_120622b_632_43_3 finished
2012-07-21T15:51:38 BDT | Asteroids@home | Starting task ps_120622b_621_50_2 using period_search version 10000 in slot 0
2012-07-21T15:51:39 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file
2012-07-21T15:51:39 BDT | Einstein@Home | If this happens repeatedly you may need to reset the project.
2012-07-21T15:51:39 BDT | Asteroids@home | Started upload of ps_120622b_632_42_2_0
2012-07-21T15:51:39 BDT | Einstein@Home | Restarting task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 using einsteinbinary_BRP4 version 122 (BRP4SSE) in slot 1
2012-07-21T15:52:00 BDT | SETI@home | Task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 exited with zero status but no 'finished' file
2012-07-21T15:52:00 BDT | SETI@home | If this happens repeatedly you may need to reset the project.
2012-07-21T15:52:00 BDT | SETI@home | Restarting task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 using astropulse_v6 version 601 in slot 2
2012-07-21T15:52:29 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error
2012-07-21T15:52:29 BDT | Einstein@Home | Backing off 15 min 20 sec on upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3
2012-07-21T15:52:30 BDT |  | Project communication failed: attempting access to reference site
2012-07-21T15:52:48 BDT |  | Internet access OK - project servers may be temporarily down.
2012-07-21T15:54:31 BDT |  | Project communication failed: attempting access to reference site
2012-07-21T15:54:31 BDT | Asteroids@home | Temporarily failed upload of ps_120622b_632_42_2_0: transient HTTP error
2012-07-21T15:54:31 BDT | Asteroids@home | Backing off 2 min 22 sec on upload of ps_120622b_632_42_2_0
2012-07-21T15:54:31 BDT | Asteroids@home | Started upload of ps_120622b_632_43_3_0
2012-07-21T15:55:11 BDT | Asteroids@home | [sched_op] Deferring communication for 7 min 58 sec
2012-07-21T15:55:11 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_621_226_2 (process got signal 11)
2012-07-21T15:55:11 BDT | Asteroids@home | Computation for task ps_120622b_621_226_2 finished
2012-07-21T15:55:11 BDT | Asteroids@home | Starting task ps_120622b_632_96_2 using period_search version 10000 in slot 3
2012-07-21T15:55:12 BDT | Asteroids@home | [sched_op] Deferring communication for 15 min 55 sec
2012-07-21T15:55:12 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_621_50_2 (process got signal 11)
2012-07-21T15:55:12 BDT | Asteroids@home | Computation for task ps_120622b_621_50_2 finished
2012-07-21T15:55:12 BDT | Asteroids@home | Starting task ps_120622b_272_153_2 using period_search version 10000 in slot 0
2012-07-21T15:55:13 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file
2012-07-21T15:55:13 BDT | Einstein@Home | If this happens repeatedly you may need to reset the project.
2012-07-21T15:55:13 BDT | Einstein@Home | Restarting task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 using einsteinbinary_BRP4 version 122 (BRP4SSE) in slot 1
2012-07-21T15:55:14 BDT | SETI@home | Task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 exited with zero status but no 'finished' file
2012-07-21T15:55:14 BDT | SETI@home | If this happens repeatedly you may need to reset the project.
2012-07-21T15:55:14 BDT | SETI@home | Restarting task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 using astropulse_v6 version 601 in slot 2
2012-07-21T15:55:45 BDT |  | Internet access OK - project servers may be temporarily down.
2012-07-21T15:55:59 BDT | Asteroids@home | Temporarily failed upload of ps_120622b_632_43_3_0: transient HTTP error
2012-07-21T15:55:59 BDT | Asteroids@home | Backing off 2 min 7 sec on upload of ps_120622b_632_43_3_0
2012-07-21T15:55:59 BDT | Asteroids@home | Started upload of ps_120622b_621_226_2_0
2012-07-21T15:56:02 BDT | Einstein@Home | [sched_op] Starting scheduler request
2012-07-21T15:56:02 BDT | Einstein@Home | Sending scheduler request: To fetch work.
2012-07-21T15:56:02 BDT | Einstein@Home | Reporting 1 completed tasks, requesting new tasks for CPU
2012-07-21T15:56:02 BDT | Einstein@Home | [sched_op] CPU work request: 494462.67 seconds; 0.00 devices
2012-07-21T15:56:30 BDT | Asteroids@home | Finished upload of ps_120622b_621_226_2_0
2012-07-21T15:56:30 BDT | Asteroids@home | Started upload of ps_120622b_621_50_2_0
2012-07-21T15:56:43 BDT | Asteroids@home | Finished upload of ps_120622b_621_50_2_0
2012-07-21T15:56:52 BDT | Einstein@Home | Scheduler request completed: got 10 new tasks
2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Server version 611
2012-07-21T15:56:52 BDT | Einstein@Home | Project requested delay of 60 seconds
2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] estimated total CPU task duration: 600882 seconds
2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah2011G_288.0_40200_0.0_0
2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec
2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Reason: requested by project
2012-07-21T15:56:54 BDT | Asteroids@home | Started upload of ps_120622b_632_42_2_0
2012-07-21T15:56:54 BDT | Einstein@Home | Started download of skygrid_LATeah2011G_0928.0.dat
2012-07-21T15:57:57 BDT | Asteroids@home | Finished upload of ps_120622b_632_42_2_0
2012-07-21T15:58:07 BDT | Asteroids@home | Started upload of ps_120622b_632_43_3_0
2012-07-21T15:59:05 BDT | Asteroids@home | Finished upload of ps_120622b_632_43_3_0
ID: 45007 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 45008 - Posted: 21 Jul 2012, 11:57:09 UTC - in response to Message 45007.  

I'm going out on a limb here, but feel that they're coincidence. Perhaps related, but not in the way that you think they are.

Let me try to explain:

2012-07-21T15:45:06 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error

This error happens due to your bad internet connection, e.g. it happens when during communication with the server the connection is dropped. So yup, your internet connection at play here.

2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_42_2 (process got signal 11)

This one though, is a computation error, and has to do with either the application being not so stable, or trouble with your memory, your virtual memory (page file) or a bad batch of tasks (it happens).

2012-07-21T15:51:39 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file
2012-07-21T15:51:39 BDT | Einstein@Home | If this happens repeatedly you may need to reset the project.

This one happens when something outside of BOINC is interfering with the running of BOINC, like an anti-virus, anti-spyware or other anti-malware program actively scanning the BOINC Data directory.

That you see all three at the same time in your log can be due to extra stresses that BOINC brings along on a normal day. Running all kinds of calculations through BOINC stresses a computer out already, but when there are network problems, the computer will go into extra stress. With the network card these days being standard integrated in the motherboard, it's (part of) the CPU that will have to cater for the network connection. And when that CPU is busy doing intricate calculations...

I see that here as well on an otherwise stable system. Throw a slow network transfer in the bunch and my computer struggles. But then when I use a separate PCI 1000Mbit add-on card, the whole system flies, no matter what. And it ain't an old & slow one either. ;-)

So, first checks first:
1. Do you have an anti-virus or other anti-malware program scanning actively in the background?
1a. Is your BOINC Data directory excluded from being scanned?
2. What kind of system is it?
3. The network card, is it integrated into the motherboard, or a separate add-on card?
3a. If integrated, do you have the option to try an add-on card? (No, I didn't say you have to go out and buy one... :-))

Now, there is a problem with BOINC that I reported recently where when BOINC comes out of hibernation or sleep and the network card hasn't reinitialized yet, and BOINC has downloads waiting, that it will try to do those before there is an internet connection which results in corruption in the files. However, this is a difficult one to track and reproduce. (not all projects have download problems ;-))
ID: 45008 · Report as offensive
mo.v
Avatar

Send message
Joined: 13 Aug 06
Posts: 778
United Kingdom
Message 45009 - Posted: 21 Jul 2012, 12:06:14 UTC

Hi ShEm

To my knowledge the BOINC network access ie the internet connection with the server is completely separate from the running of the tasks. I don't think I've ever had an internet problem or server access problem affect how my tasks run. You shouldn't need to suspend the running of your tasks while your files upload.

I suggest you look at the possible causes of your task crashes. Every crashed task should generate an stderr file (you need to go to your account on the project and then find your task list to see these files). These stderr files are sometimes rather cryptic and you need to get used to what some of the strange phrases mean. Often you see far more details than what's in the Event Log.

For example, from your messages, the Signal 11 error. Here's what Jorden tells us about it in the BOINC FAQs:

http://boincfaq.mundayweb.com/index.php?language=1&view=459&sessionID=e14091fb5d550bc72c9f5619af34ca92

Here's what he says about Zero status but no 'finished' file:

http://boincfaq.mundayweb.com/index.php?language=1&view=116&sessionID=e14091fb5d550bc72c9f5619af34ca92

Here's the home page of the FAQs:

http://boincfaq.mundayweb.com/index.php
ID: 45009 · Report as offensive
-ShEm-

Send message
Joined: 14 Feb 08
Posts: 28
Message 45010 - Posted: 21 Jul 2012, 13:05:38 UTC - in response to Message 45008.  

Thanks to both of you, Ageless and mo.v for the quick replies :) I'm almost 100% sure it's not a coincidence, because it happens _every_ time I have network-problems (if I don't suspend computation), but not exclusively then (busy system sounds plausible, like the old "no heartbeat from client"?). Perhaps the asteroids-application can't handle that situation so they fail. Sadly can't investigate further now as I'm preparing for a longer trip, so it'll be some weeks to get back on this. In the meantime I'll micro-manage ;) Still would like to know what log-flags could maybe help investigate this issue...
ID: 45010 · Report as offensive
BobCat13

Send message
Joined: 6 Dec 06
Posts: 118
United States
Message 45011 - Posted: 21 Jul 2012, 14:07:20 UTC

It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems.

Editing the hosts file to include the IP address of projects and running a local DNS cache has helped, but still has not completely eliminated the problem. So it may be more than DNS and possibly just general internet connection problems that can cause the no heartbeat error.

Are the boinc client communications with the internet and the client communications with the science applications linked together?
ID: 45011 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 45012 - Posted: 21 Jul 2012, 14:29:17 UTC - in response to Message 45011.  
Last modified: 21 Jul 2012, 14:36:20 UTC

It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems.

Editing the hosts file to include the IP address of projects and running a local DNS cache has helped, but still has not completely eliminated the problem. So it may be more than DNS and possibly just general internet connection problems that can cause the no heartbeat error.

Are the boinc client communications with the internet and the client communications with the science applications linked together?

I agree that the task errors and the internet problems are linked, and I also agree that DNS name resolution on the flakey internet connection is likely to be implicated in that linkage.

My suspicion is that when the BOINC client asks the libcurl sub-component to connect (by name) to a project server, everything is put on hold until, at least, the resolved IP address comes back from DNS. If that involves a wait of more than 30 seconds and a timeout (which, in non-corporate environments, is plausible, because the DNS server is likely to live with your ISP at the other end of the local loop), then the heartbeat mechanism may be stalled and the errors follow.

An added complication is that libcurl handles all TCP/IP communications for the client, and - as well as project internet comms - that includes localhost loopback messages between the client and BOINC Manager, and any remote RPC calls that might be issued by a local aggregator like BoincTasks or BoincView.

Comms are tricky things, and failures anywhere can cause delays and problems. I recently lost a host which was listed by name in my remote_hosts.cfg file: I noticed the other machines on my LAN stuttering as they generated the "Can't resolve hostname in remote_hosts.cfg: xxx" message and notice, far more often than I would have thought was necessary.

Edit - communications between the client and the science applications are handled by files written into a shared memory area - a virtual solid-state disk. They should be exempt from the TCI/IP problems.
ID: 45012 · Report as offensive
-ShEm-

Send message
Joined: 14 Feb 08
Posts: 28
Message 45018 - Posted: 22 Jul 2012, 4:28:39 UTC - in response to Message 45012.  

So basically I have to micro-manage for connection until ticket 113 is fixed (not likely) to avoid errors. Was looking for that one yesterday before posting, but couldn't find it so thought it was fixed...
ID: 45018 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 45019 - Posted: 22 Jul 2012, 8:44:03 UTC - in response to Message 45018.  
Last modified: 22 Jul 2012, 9:21:34 UTC


Which DNS Servers do you use?
I use these:

8.8.4.4
8.8.8.8
129.250.35.250
192.168.1.1

(so the Router/ISP's DNS Server comes last)


You can also automate the 'things' by boinccmd:
http://boinc.berkeley.edu/wiki/Boinccmd_tool

Something like this:

BOINC_DoTranfers.bat:
boinccmd --set_run_mode never 3600
boinccmd --set_gpu_mode never 3600
boinccmd --set_network_mode auto

BOINC_Compute.bat:
boinccmd --set_network_mode never 111000
boinccmd --set_run_mode never 5
boinccmd --set_gpu_mode never 5





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 45019 · Report as offensive
-ShEm-

Send message
Joined: 14 Feb 08
Posts: 28
Message 45021 - Posted: 22 Jul 2012, 10:03:03 UTC - in response to Message 45019.  

Which DNS Servers do you use?

208.67.222.222 (OpenDNS)
208.67.220.220 (OpenDNS)
8.8.8.8 (Google)
8.8.4.4 (Google)

I realize now the WUs that error out is actually the applications fault, not BOINC directly, but still it bothers me... I only crunch on 3 out of 8 cores on my main system (to make less heat), so should be (and there is) plenty of room left for other things. Still this happens, not only if network gets problematic, but also if (mechanical) harddrive gets busy: Try create a big non-dynamic virtual harddrive for use in virtualbox, say 100GB and watch BOINC-manager get unresponsive and WUs error out or "no finished file" after creation is finished.
ID: 45021 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 45022 - Posted: 22 Jul 2012, 10:38:35 UTC - in response to Message 45019.  

Which DNS Servers do you use?
I use these:

8.8.4.4
8.8.8.8
129.250.35.250
192.168.1.1

(so the Router/ISP's DNS Server comes last)

The trouble is, all you're doing is to assume that the ISP's DNS is the cause of the problem, and bypassing it.

For that solution to work, the ISP's routers and connectivity (both upstream and downstream) have to be fully present and correct.

If the problem is BOINC's use of synchronous DNS resolving (as the very interesting quote from Nicolas in that trac ticket suggests), then a better solution would be the installation of a local caching DNS server. But then you have some tricky management decisions to make regarding caching: SETI's download server url, for example, deliberately has a TTL of 5 seconds, and according to the rules shouldn't be cached. There may be others - it depends which projects are running.

In Windows, the command

ipconfig /displaydns

is a useful tool for getting an idea of what caching is allowed on the sites you visit regularly - I've just discovered that SETI have slowed down their round-robin DNS with a TTL of at least 50 seconds (ipconfig shows the remaining TTL since the last lookup, not the full value).
ID: 45022 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 45026 - Posted: 23 Jul 2012, 11:55:37 UTC - in response to Message 45021.  
Last modified: 23 Jul 2012, 12:15:53 UTC


What happens if you put the Google DNS Servers at the top?

Since OpenDNS 'proudly presents' ;) "Web filtering, malware and botnet protection" it may wrongly filter some project sites.
http://www.opendns.com/
http://en.wikipedia.org/wiki/OpenDNS

Also Anycast (used by OpenDNS) can have some influence:
"With TCP anycast, there are cases where the receiver selected for any given source may change from time to time as optimal routes change, silently breaking any conversations that may be in progress at the time"
http://en.wikipedia.org/wiki/Anycast

You can also try if VPN like Hotspot Shield improves the connection:
http://www.hotspotshield.com/





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 45026 · Report as offensive
BobCat13

Send message
Joined: 6 Dec 06
Posts: 118
United States
Message 45095 - Posted: 28 Jul 2012, 19:42:36 UTC - in response to Message 45022.  

Here is another example of the no heartbeat error causing tasks to error out with signal 11 on Linux. In this case it was not an internet connection problem, but a misbehaving project (DNA@home) that was holding up the client from communicating with the science applications. First time, 4 Asteroids and 1 WUProp tasks errored, second time Correlizer exited but managed to recover, and the third time 4 WCG HCMD2 and 1 WUProp tasks errored. DNA is now on NNT, so it is not contacting the project anymore.

DNA@Home 7-28-2012 8:13:31 AM Sending scheduler request: To fetch work.
DNA@Home 7-28-2012 8:13:31 AM Requesting new tasks for CPU
Asteroids@home 7-28-2012 8:14:21 AM Computation for task ps_120622b_45259_3_3 finished
Asteroids@home 7-28-2012 8:14:21 AM Resuming task ps_120622b_87028_1_3 using period_search version 10000 in slot 1
Asteroids@home 7-28-2012 8:14:22 AM Computation for task ps_120622b_1333_68_3 finished
correlizer 7-28-2012 8:14:22 AM Starting task rc_5974237_0 using correlizer_rc version 109 in slot 0
Asteroids@home 7-28-2012 8:14:23 AM Started upload of ps_120622b_45259_3_3_0
Asteroids@home 7-28-2012 8:14:23 AM Computation for task ps_120622b_5179_4_3 finished
correlizer 7-28-2012 8:14:23 AM Starting task rc_5977379_0 using correlizer_rc version 109 in slot 3
Asteroids@home 7-28-2012 8:14:25 AM Computation for task ps_120622b_87028_1_3 finished
correlizer 7-28-2012 8:14:25 AM Starting task rc_5961961_1 using correlizer_rc version 109 in slot 1
WUProp@Home 7-28-2012 8:14:26 AM Computation for task wu_v3_1340211602_1228320_0 finished
correlizer 7-28-2012 8:14:27 AM Task rc_5974449_0 exited with zero status but no 'finished' file
correlizer 7-28-2012 8:14:27 AM If this happens repeatedly you may need to reset the project.
correlizer 7-28-2012 8:14:27 AM Restarting task rc_5974449_0 using correlizer_rc version 109 in slot 5
WUProp@Home 7-28-2012 8:14:28 AM Started upload of wu_v3_1340211602_1228320_0_0
WUProp@Home 7-28-2012 8:14:31 AM Finished upload of wu_v3_1340211602_1228320_0_0
7-28-2012 8:15:32 AM Project communication failed: attempting access to reference site
DNA@Home 7-28-2012 8:15:32 AM Scheduler request failed: Timeout was reached
7-28-2012 8:15:35 AM Internet access OK - project servers may be temporarily down.

DNA@Home 7-28-2012 8:16:44 AM Fetching scheduler list
correlizer 7-28-2012 8:17:33 AM Task rc_5974237_0 exited with zero status but no 'finished' file
correlizer 7-28-2012 8:17:33 AM If this happens repeatedly you may need to reset the project.
correlizer 7-28-2012 8:17:33 AM Restarting task rc_5974237_0 using correlizer_rc version 109 in slot 0
correlizer 7-28-2012 8:17:34 AM Task rc_5977379_0 exited with zero status but no 'finished' file
correlizer 7-28-2012 8:17:34 AM If this happens repeatedly you may need to reset the project.
World Community Grid 7-28-2012 8:17:34 AM Finished download of b6783c592df095adaf9983c6a703fc54.dat.gzb
World Community Grid 7-28-2012 8:17:34 AM Started download of 78f88560db6f55288ca0cc71caa330d2.dat.gzb
correlizer 7-28-2012 8:17:34 AM Restarting task rc_5977379_0 using correlizer_rc version 109 in slot 3
correlizer 7-28-2012 8:17:35 AM Task rc_5961961_1 exited with zero status but no 'finished' file
correlizer 7-28-2012 8:17:35 AM If this happens repeatedly you may need to reset the project.
correlizer 7-28-2012 8:17:35 AM Restarting task rc_5961961_1 using correlizer_rc version 109 in slot 1
correlizer 7-28-2012 8:17:37 AM Task rc_5974449_0 exited with zero status but no 'finished' file
correlizer 7-28-2012 8:17:37 AM If this happens repeatedly you may need to reset the project.
correlizer 7-28-2012 8:17:37 AM Restarting task rc_5974449_0 using correlizer_rc version 109 in slot 5
7-28-2012 8:18:44 AM Project communication failed: attempting access to reference site
7-28-2012 8:18:46 AM Internet access OK - project servers may be temporarily down.

DNA@Home 7-28-2012 9:52:52 AM Sending scheduler request: To fetch work.
DNA@Home 7-28-2012 9:52:52 AM Requesting new tasks for CPU
WUProp@Home 7-28-2012 9:53:27 AM Computation for task wu_v3_1340211602_1231743_0 finished
World Community Grid 7-28-2012 9:53:28 AM Computation for task CMD2_2514-NALDLA.clustersOccur-UGPA2A.clustersOccur_84_196518_197284_0 finished
World Community Grid 7-28-2012 9:53:28 AM Starting task CMD2_2514-NALDLA.clustersOccur-UGPA2A.clustersOccur_97_225982_226719_0 using hcmd2 version 640 in slot 0
World Community Grid 7-28-2012 9:53:29 AM Computation for task CMD2_2514-MLRS.clustersOccur-TPM1A.clustersOccur_1_3194_3579_0 finished
World Community Grid 7-28-2012 9:53:29 AM Starting task CMD2_2514-MLRS.clustersOccur-TPM1A.clustersOccur_1_2808_3193_0 using hcmd2 version 640 in slot 2
World Community Grid 7-28-2012 9:53:30 AM Computation for task CMD2_2514-NALDLA.clustersOccur-TELTA.clustersOccur_80_102833_103193_0 finished
World Community Grid 7-28-2012 9:53:30 AM Starting task CMD2_2514-DDX3XA.clustersOccur-SERA.clustersOccur_13_113040_117025_0 using hcmd2 version 640 in slot 1
World Community Grid 7-28-2012 9:53:31 AM Computation for task CMD2_2514-DDX3XA.clustersOccur-SERA.clustersOccur_15_130960_132351_0 finished
World Community Grid 7-28-2012 9:53:31 AM Starting task CMD2_2514-NALDLA.clustersOccur-TELTA.clustersOccur_123_157350_157975_0 using hcmd2 version 640 in slot 3
7-28-2012 9:54:53 AM Project communication failed: attempting access to reference site
DNA@Home 7-28-2012 9:54:53 AM Scheduler request failed: Timeout was reached
7-28-2012 9:54:56 AM Internet access OK - project servers may be temporarily down.
ID: 45095 · Report as offensive
-ShEm-

Send message
Joined: 14 Feb 08
Posts: 28
Message 45647 - Posted: 14 Sep 2012, 10:27:38 UTC

Sorry for resurrecting this one, but during my current trip to a country with decent internet-connections, I found I can very easily replicate this behavior of boincclient (linux / windows, doesn't matter) locking up when trying to connect to internet:

Connect to a router (dhcp / manual, wire / wireless, ISP-dns / router-dns / manual-dns, doesn't matter) so your computer registers it's connected to a network. Disconnect the cable from the router to "the internet" ;) and next time boincclient wants to connect, it's starts locking up. You can't use boincmanager to suspend network activity (I didn't try with boinccmd this time, but I seem to remember that also didn't work).

To get response again from boincclient, first stop boinc-service (I don't have any non-service boinc installation, so didn't check). Then edit "client_state.xml". Locate "<user_network_request>2</user_network_request>" near the end and change the value to "3" and save. Restart boinc-service and all is well again.

My main complaint is that boincclient completely locks up while waiting for connection. Since it obviously has no problem handling multiple processes, why does network-connection seemingly change it into a single-process-only program?
ID: 45647 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 45648 - Posted: 14 Sep 2012, 10:59:02 UTC - in response to Message 45647.  

Forwarded to development.
ID: 45648 · Report as offensive

Message boards : Questions and problems : Network problems = Unrecoverable error...

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.