| log in |
Message boards : Questions and problems : Network problems = Unrecoverable error...
| Author | Message |
|---|---|
|
I very frequently get network errors, like connection lost, connection reset, etc., because of the very bad internet where I currently live. But why can't boinc-client handle that while trying to connect? Every time I either get lots of "exited with zero status but no 'finished' file" or simply get "Unrecoverable error"! WTF? Why are computation and network linked so much together? Shouldn't they be separate processes which doesn't interfere with each other? Considering how many hours of work I'm loosing (today about 12 hours in 4 WU's) I'd rather not run BOINC then... 2012-07-21T15:41:50 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_0 2012-07-21T15:41:55 BDT | Einstein@Home | Finished upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_0 2012-07-21T15:44:02 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3 2012-07-21T15:45:06 BDT | | Project communication failed: attempting access to reference site 2012-07-21T15:45:06 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error 2012-07-21T15:45:06 BDT | Einstein@Home | Backing off 5 min 37 sec on upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3 2012-07-21T15:46:43 BDT | | Internet access OK - project servers may be temporarily down. 2012-07-21T15:50:44 BDT | Einstein@Home | Started upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3 2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Deferring communication for 1 min 50 sec 2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_42_2 (process got signal 11) 2012-07-21T15:51:37 BDT | Asteroids@home | Computation for task ps_120622b_632_42_2 finished 2012-07-21T15:51:37 BDT | Asteroids@home | Starting task ps_120622b_621_226_2 using period_search version 10000 in slot 3 2012-07-21T15:51:38 BDT | Asteroids@home | [sched_op] Deferring communication for 3 min 4 sec 2012-07-21T15:51:38 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_43_3 (process got signal 11) 2012-07-21T15:51:38 BDT | Asteroids@home | Computation for task ps_120622b_632_43_3 finished 2012-07-21T15:51:38 BDT | Asteroids@home | Starting task ps_120622b_621_50_2 using period_search version 10000 in slot 0 2012-07-21T15:51:39 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file 2012-07-21T15:51:39 BDT | Einstein@Home | If this happens repeatedly you may need to reset the project. 2012-07-21T15:51:39 BDT | Asteroids@home | Started upload of ps_120622b_632_42_2_0 2012-07-21T15:51:39 BDT | Einstein@Home | Restarting task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 using einsteinbinary_BRP4 version 122 (BRP4SSE) in slot 1 2012-07-21T15:52:00 BDT | SETI@home | Task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 exited with zero status but no 'finished' file 2012-07-21T15:52:00 BDT | SETI@home | If this happens repeatedly you may need to reset the project. 2012-07-21T15:52:00 BDT | SETI@home | Restarting task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 using astropulse_v6 version 601 in slot 2 2012-07-21T15:52:29 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error 2012-07-21T15:52:29 BDT | Einstein@Home | Backing off 15 min 20 sec on upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3 2012-07-21T15:52:30 BDT | | Project communication failed: attempting access to reference site 2012-07-21T15:52:48 BDT | | Internet access OK - project servers may be temporarily down. 2012-07-21T15:54:31 BDT | | Project communication failed: attempting access to reference site 2012-07-21T15:54:31 BDT | Asteroids@home | Temporarily failed upload of ps_120622b_632_42_2_0: transient HTTP error 2012-07-21T15:54:31 BDT | Asteroids@home | Backing off 2 min 22 sec on upload of ps_120622b_632_42_2_0 2012-07-21T15:54:31 BDT | Asteroids@home | Started upload of ps_120622b_632_43_3_0 2012-07-21T15:55:11 BDT | Asteroids@home | [sched_op] Deferring communication for 7 min 58 sec 2012-07-21T15:55:11 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_621_226_2 (process got signal 11) 2012-07-21T15:55:11 BDT | Asteroids@home | Computation for task ps_120622b_621_226_2 finished 2012-07-21T15:55:11 BDT | Asteroids@home | Starting task ps_120622b_632_96_2 using period_search version 10000 in slot 3 2012-07-21T15:55:12 BDT | Asteroids@home | [sched_op] Deferring communication for 15 min 55 sec 2012-07-21T15:55:12 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_621_50_2 (process got signal 11) 2012-07-21T15:55:12 BDT | Asteroids@home | Computation for task ps_120622b_621_50_2 finished 2012-07-21T15:55:12 BDT | Asteroids@home | Starting task ps_120622b_272_153_2 using period_search version 10000 in slot 0 2012-07-21T15:55:13 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file 2012-07-21T15:55:13 BDT | Einstein@Home | If this happens repeatedly you may need to reset the project. 2012-07-21T15:55:13 BDT | Einstein@Home | Restarting task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 using einsteinbinary_BRP4 version 122 (BRP4SSE) in slot 1 2012-07-21T15:55:14 BDT | SETI@home | Task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 exited with zero status but no 'finished' file 2012-07-21T15:55:14 BDT | SETI@home | If this happens repeatedly you may need to reset the project. 2012-07-21T15:55:14 BDT | SETI@home | Restarting task ap_12mr10ad_B0_P0_00350_20120721_03193.wu_0 using astropulse_v6 version 601 in slot 2 2012-07-21T15:55:45 BDT | | Internet access OK - project servers may be temporarily down. 2012-07-21T15:55:59 BDT | Asteroids@home | Temporarily failed upload of ps_120622b_632_43_3_0: transient HTTP error 2012-07-21T15:55:59 BDT | Asteroids@home | Backing off 2 min 7 sec on upload of ps_120622b_632_43_3_0 2012-07-21T15:55:59 BDT | Asteroids@home | Started upload of ps_120622b_621_226_2_0 2012-07-21T15:56:02 BDT | Einstein@Home | [sched_op] Starting scheduler request 2012-07-21T15:56:02 BDT | Einstein@Home | Sending scheduler request: To fetch work. 2012-07-21T15:56:02 BDT | Einstein@Home | Reporting 1 completed tasks, requesting new tasks for CPU 2012-07-21T15:56:02 BDT | Einstein@Home | [sched_op] CPU work request: 494462.67 seconds; 0.00 devices 2012-07-21T15:56:30 BDT | Asteroids@home | Finished upload of ps_120622b_621_226_2_0 2012-07-21T15:56:30 BDT | Asteroids@home | Started upload of ps_120622b_621_50_2_0 2012-07-21T15:56:43 BDT | Asteroids@home | Finished upload of ps_120622b_621_50_2_0 2012-07-21T15:56:52 BDT | Einstein@Home | Scheduler request completed: got 10 new tasks 2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Server version 611 2012-07-21T15:56:52 BDT | Einstein@Home | Project requested delay of 60 seconds 2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] estimated total CPU task duration: 600882 seconds 2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah2011G_288.0_40200_0.0_0 2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec 2012-07-21T15:56:52 BDT | Einstein@Home | [sched_op] Reason: requested by project 2012-07-21T15:56:54 BDT | Asteroids@home | Started upload of ps_120622b_632_42_2_0 2012-07-21T15:56:54 BDT | Einstein@Home | Started download of skygrid_LATeah2011G_0928.0.dat 2012-07-21T15:57:57 BDT | Asteroids@home | Finished upload of ps_120622b_632_42_2_0 2012-07-21T15:58:07 BDT | Asteroids@home | Started upload of ps_120622b_632_43_3_0 2012-07-21T15:59:05 BDT | Asteroids@home | Finished upload of ps_120622b_632_43_3_0 | |
| ID: 45007 · | |
|
I'm going out on a limb here, but feel that they're coincidence. Perhaps related, but not in the way that you think they are. 2012-07-21T15:45:06 BDT | Einstein@Home | Temporarily failed upload of b2030.20110423.G58.47+01.12.N.b1s0g0.00000_2168_1_3: transient HTTP error This error happens due to your bad internet connection, e.g. it happens when during communication with the server the connection is dropped. So yup, your internet connection at play here. 2012-07-21T15:51:37 BDT | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_632_42_2 (process got signal 11) This one though, is a computation error, and has to do with either the application being not so stable, or trouble with your memory, your virtual memory (page file) or a bad batch of tasks (it happens). 2012-07-21T15:51:39 BDT | Einstein@Home | Task b2030.20110422.G42.42-01.58.S.b3s0g0.00000_80_1 exited with zero status but no 'finished' file This one happens when something outside of BOINC is interfering with the running of BOINC, like an anti-virus, anti-spyware or other anti-malware program actively scanning the BOINC Data directory. That you see all three at the same time in your log can be due to extra stresses that BOINC brings along on a normal day. Running all kinds of calculations through BOINC stresses a computer out already, but when there are network problems, the computer will go into extra stress. With the network card these days being standard integrated in the motherboard, it's (part of) the CPU that will have to cater for the network connection. And when that CPU is busy doing intricate calculations... I see that here as well on an otherwise stable system. Throw a slow network transfer in the bunch and my computer struggles. But then when I use a separate PCI 1000Mbit add-on card, the whole system flies, no matter what. And it ain't an old & slow one either. ;-) So, first checks first: 1. Do you have an anti-virus or other anti-malware program scanning actively in the background? 1a. Is your BOINC Data directory excluded from being scanned? 2. What kind of system is it? 3. The network card, is it integrated into the motherboard, or a separate add-on card? 3a. If integrated, do you have the option to try an add-on card? (No, I didn't say you have to go out and buy one... :-)) Now, there is a problem with BOINC that I reported recently where when BOINC comes out of hibernation or sleep and the network card hasn't reinitialized yet, and BOINC has downloads waiting, that it will try to do those before there is an internet connection which results in corruption in the files. However, this is a difficult one to track and reproduce. (not all projects have download problems ;-)) ____________ Jord -BOINC FAQ Service -BOINC 7.0 FAQ Go, seize the day, wake up and say: This is an Extraordinary life! -- Asia, An Extraordinary Life | |
| ID: 45008 · | |
|
Hi ShEm | |
| ID: 45009 · | |
|
Thanks to both of you, Ageless and mo.v for the quick replies :) I'm almost 100% sure it's not a coincidence, because it happens _every_ time I have network-problems (if I don't suspend computation), but not exclusively then (busy system sounds plausible, like the old "no heartbeat from client"?). Perhaps the asteroids-application can't handle that situation so they fail. Sadly can't investigate further now as I'm preparing for a longer trip, so it'll be some weeks to get back on this. In the meantime I'll micro-manage ;) Still would like to know what log-flags could maybe help investigate this issue... | |
| ID: 45010 · | |
|
It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems. | |
| ID: 45011 · | |
It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems. I agree that the task errors and the internet problems are linked, and I also agree that DNS name resolution on the flakey internet connection is likely to be implicated in that linkage. My suspicion is that when the BOINC client asks the libcurl sub-component to connect (by name) to a project server, everything is put on hold until, at least, the resolved IP address comes back from DNS. If that involves a wait of more than 30 seconds and a timeout (which, in non-corporate environments, is plausible, because the DNS server is likely to live with your ISP at the other end of the local loop), then the heartbeat mechanism may be stalled and the errors follow. An added complication is that libcurl handles all TCP/IP communications for the client, and - as well as project internet comms - that includes localhost loopback messages between the client and BOINC Manager, and any remote RPC calls that might be issued by a local aggregator like BoincTasks or BoincView. Comms are tricky things, and failures anywhere can cause delays and problems. I recently lost a host which was listed by name in my remote_hosts.cfg file: I noticed the other machines on my LAN stuttering as they generated the "Can't resolve hostname in remote_hosts.cfg: xxx" message and notice, far more often than I would have thought was necessary. Edit - communications between the client and the science applications are handled by files written into a shared memory area - a virtual solid-state disk. They should be exempt from the TCI/IP problems. | |
| ID: 45012 · | |
|
So basically I have to micro-manage for connection until ticket 113 is fixed (not likely) to avoid errors. Was looking for that one yesterday before posting, but couldn't find it so thought it was fixed... | |
| ID: 45018 · | |
|
| |
| ID: 45019 · | |
Which DNS Servers do you use? 208.67.222.222 (OpenDNS) 208.67.220.220 (OpenDNS) 8.8.8.8 (Google) 8.8.4.4 (Google) I realize now the WUs that error out is actually the applications fault, not BOINC directly, but still it bothers me... I only crunch on 3 out of 8 cores on my main system (to make less heat), so should be (and there is) plenty of room left for other things. Still this happens, not only if network gets problematic, but also if (mechanical) harddrive gets busy: Try create a big non-dynamic virtual harddrive for use in virtualbox, say 100GB and watch BOINC-manager get unresponsive and WUs error out or "no finished file" after creation is finished. | |
| ID: 45021 · | |
Which DNS Servers do you use? The trouble is, all you're doing is to assume that the ISP's DNS is the cause of the problem, and bypassing it. For that solution to work, the ISP's routers and connectivity (both upstream and downstream) have to be fully present and correct. If the problem is BOINC's use of synchronous DNS resolving (as the very interesting quote from Nicolas in that trac ticket suggests), then a better solution would be the installation of a local caching DNS server. But then you have some tricky management decisions to make regarding caching: SETI's download server url, for example, deliberately has a TTL of 5 seconds, and according to the rules shouldn't be cached. There may be others - it depends which projects are running. In Windows, the command ipconfig /displaydns is a useful tool for getting an idea of what caching is allowed on the sites you visit regularly - I've just discovered that SETI have slowed down their round-robin DNS with a TTL of at least 50 seconds (ipconfig shows the remaining TTL since the last lookup, not the full value). | |
| ID: 45022 · | |
|
| |
| ID: 45026 · | |
|
Here is another example of the no heartbeat error causing tasks to error out with signal 11 on Linux. In this case it was not an internet connection problem, but a misbehaving project (DNA@home) that was holding up the client from communicating with the science applications. First time, 4 Asteroids and 1 WUProp tasks errored, second time Correlizer exited but managed to recover, and the third time 4 WCG HCMD2 and 1 WUProp tasks errored. DNA is now on NNT, so it is not contacting the project anymore. | |
| ID: 45095 · | |
|
Sorry for resurrecting this one, but during my current trip to a country with decent internet-connections, I found I can very easily replicate this behavior of boincclient (linux / windows, doesn't matter) locking up when trying to connect to internet: | |
| ID: 45647 · | |
|
Forwarded to development. | |
| ID: 45648 · | |
Message boards : Questions and problems : Network problems = Unrecoverable error...
Copyright © 2013 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.