Bad things happen on loss of network connectivity

Author	Message
GeneAZ Send message Joined: 28 Jun 14 Posts: 27	Message 54634 - Posted: 28 Jun 2014, 19:30:52 UTC boinc/boincmgr version 7.2.42 (wxWidgets 2.8.12) Linux 3.2.46 (x86_64) AMD FX-4300, 4-core + Nvidia GTX-650 Only Setiathome project active; both CPU and GPU tasks enabled, configured for 3 concurrent CPU and 1 GPU (with CPU allocated to feed it). On June 25, my internet router failed, un-noticed for several hours. The end result was 64 tasks running for a few seconds and ending with computation error status. In Linux it was a signal 11 (segmentation violation). The event log was showing continuous cycles of starting upload -> upload failed -> backoff 1:xx . To try to recreate this failure mode today I "pulled the plug" on the system ethernet cable to the network router. The same boinc failure developed and here is the sequence of symptoms: (1) 08:25:00 network manually disconnected (2) 08:28:35 GPU task #1 finished (3) 08:28:35 GPU task #2 started (4) 08:28:37 start upload task #1 (5) 08:28:58 ..upload failed, backoff 03:51 (6) 08:29:18 Event log shows internet access lost (7) 08:32:49 start upload task #1 (8) 08:33:10 ..upload failed, backoff 02:34 (9) 08:35:44 start upload task #1 (10) 08:36:05 ..upload failed, backoff 03:28 Not shown in Event log, but Project backoff approx. 9 minutes invoked. (11) 08:49:18 start upload task #1 (12) 08:49:38 ..upload failed, backoff 02:18 Not shown in Event log, but Project backoff approx. 29 minutes invoked. (13) 08:59:27 GPU task #2 finished (14) 08:59:27 GPU task #3 started (15) 08:59:29 start upload task #2 * even though Project backoff is still active * (16) 08:59:50 ..upload failed, backoff 1:09 Not shown in Event log, but Project backoff went to ~1 hr 40 min. (17) 09:03:20 start upload task #2 (18) 09:03:40 ..upload failed, backoff 1:28 Not shown in Event log, but Project backoff went to > 8 hours (19) 09:05:09 start upload task #2 (20) 09:05:30 ..upload failed, backoff 1:15 Project backoff probably changed but not recorded. (21) 09:06:47 start upload task #2 (22) 09:07:07 ..upload failed, backoff 1:58 Not shown in Event log, but Project backoff went to ~4 hr 57 min. >>clip<< more upload failures (23) 09:14:08 CPU task #3 finished (24) 09:14:08 CPU task #4 started (25) 09:14:10 upload task #3 (26) 09:14:31 ..upload failed, backoff 1:34 (27) 09:14:31 upload task #2 (28) 09:14:51 ..upload failed, backoff 1:01 * At this time it is noticed that the upload of task #1 remains "pending" presumably because of the long Project Backoff interval. HOWEVER, the subsequent tasks #2 & #3 do NOT seem to honor the Project backoff and continue to attempt uploads on their own (short) backoff intervals. ALSO, it is noted that the 1-second updates of the Tasks and Transfers displays, from boincmgr, freeze for the duration of the 20-second timeout during upload attempts. For the rest of the Event log data there are MANY failed upload attempts, with backoffs approx. 1 minute and 3 tasks cycling through their 20-second timeouts the boincmgr is nearly frozen. I am leaving out the upload messages. (29) 09:30:04 GPU task #5 started (30) 09:43:46 GPU task #5 "finished" exited with zero status HOWEVER, task Status still = Running (31) 09:50:40 CPU task #? finished (32) 09:50:40 CPU task #6 started (33) 09:50:42 upload task (CPU#?) (34) 09:51:03 ..upload failed, backoff 1:00 (35) 09:52:45 GPU task #5 "finished" exited with zero status HOWEVER, task Status still = Running (36) 09:58:29 GPU task #5 finished exited with zero status ...still Running... (37) 10:05:12 GPU task #5 finished (38) 10:05:12 GPU task #6 started (39) 10:05:52 GPU task $6 finished (computation error) (40) 10:05:52 GPU task #7 started (41) 10:06:33 GPU task #7 finished (computation error) *At this time the test was terminated by suspending GPU activity and reconnecting the network cable. My observation is that the second, and all subsequent, tasks in the upload queue do not honor the Project Backoff interval. As additional tasks finish their computation they flood the inoperative upload path and (possibly) block some essential boincmgr, or operating system, functions. Eventually leading to fatal task terminations. Since this failure can be reproduced (relatively) easily I can re-run the test with appropriate cc_config.xml options which may be helpful. It just means I would need to babysit the process for about an hour to intervene when things start to go haywire. Somebody tell me if this is a known bug. Or, if it may be specific to the Seti applications. It just looks to me like a boinc manager issue. Thanks. ID: 54634 ·

Gary Charpentier Send message Joined: 23 Feb 08 Posts: 2465	Message 54636 - Posted: 28 Jun 2014, 21:06:58 UTC - in response to Message 54634. boinc/boincmgr version 7.2.42 (wxWidgets 2.8.12) Linux 3.2.46 (x86_64) AMD FX-4300, 4-core + Nvidia GTX-650 Only Setiathome project active; both CPU and GPU tasks enabled, configured for 3 concurrent CPU and 1 GPU (with CPU allocated to feed it). [snip] ***At this time the test was terminated by suspending GPU activity and reconnecting the network cable. My observation is that the second, and all subsequent, tasks in the upload queue do not honor the Project Backoff interval. As additional tasks finish their computation they flood the inoperative upload path and (possibly) block some essential boincmgr, or operating system, functions. Eventually leading to fatal task terminations. Since this failure can be reproduced (relatively) easily I can re-run the test with appropriate cc_config.xml options which may be helpful. It just means I would need to babysit the process for about an hour to intervene when things start to go haywire. Somebody tell me if this is a known bug. Or, if it may be specific to the Seti applications. It just looks to me like a boinc manager issue. Thanks. Might be a Linux feature (bug described by marketing). I suspect when the network interface goes down the loopback also goes down. As BOINC uses the loopback to signal between processes all heck breaks out. You can check for this with some Linux tools. ID: 54636 ·

ChertseyAl Send message Joined: 17 Jul 09 Posts: 107	Message 54644 - Posted: 29 Jun 2014, 16:55:44 UTC - in response to Message 54636. It's a very old problem - I have it on an Ubuntu 8.04 machine running BOINC 5.10.45 - As soon as I power down the router it shreds all of the WUs in the cache. Bit of a nuisnace really as that machine loses it's network connection nearly every day :( Cheers, Al. ID: 54644 ·

GeneAZ Send message Joined: 28 Jun 14 Posts: 27	Message 54743 - Posted: 5 Jul 2014, 5:02:26 UTC More testing on this problem: it boils down to this -- when network connectivity is lost, the 2nd and subsequent tasks that finish DO NOT honor the project backoff interval. What is the point of having a project backoff if the uploader does not use it? A snipped/edited stdout log follows, in which one can see a result upload failing (becuase of network disconnect) and a project backoff of many minutes being set. But then the result upload just keeps on trying at its own 1 or 2 minute backoff. The "05ja..." task was the SECOND task to finish after network loss; the first task (04se...) made its three upload attempts and then does not try anymore, consistent with the project backoff invoked. 10:39:32 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 10:39:53 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname 10:39:53 [SETI@home] [file_xfer] project-wide xfer delay for 945.028502 sec 10:39:53 [SETI@home] Backing off 00:01:00 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 10:40:54 [SETI@home] [fxd] starting upload, upload_offset -1 10:40:54 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 10:41:14 [SETI@home] [file_xfer] file transfer status -113 (can't resolve hostname) 10:41:14 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname 10:41:14 [SETI@home] [file_xfer] project-wide xfer delay for 1618.068131 sec 10:41:14 [SETI@home] Backing off 00:01:58 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 10:43:12 [SETI@home] [fxd] starting upload, upload_offset -1 10:43:12 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 10:43:12 [SETI@home] [file_xfer] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler 10:43:33 [SETI@home] [file_xfer] file transfer status -113 (can't resolve hostname) 10:43:33 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname 10:43:33 [SETI@home] [file_xfer] project-wide xfer delay for 3642.048836 sec 10:43:33 [SETI@home] Backing off 00:01:00 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0 I see the bug Trac (ticket #113) of 5 years ago regarding the "computation error" and other side effects of a network outage. But that issue seems to me to be an effect of, and not the cause of, uploads being attempted during a project backoff interval. The above event log excerpt was "snipped" out of the stdoutdae.txt log created with the cc_config logfile options: file_xfer_debug, file_xfer, http_debug, and http_xfer_debug. The 27 kbyte file, for the 19 minutes of interest, is saved should anyone want to see it. ID: 54743 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 54744 - Posted: 5 Jul 2014, 7:28:28 UTC - in response to Message 54743. Last modified: 5 Jul 2014, 7:29:03 UTC The policy is currently "try every new upload, just once, even during a project backoff". That was introduced to cover the case where a project has multiple application types, uploading results to different servers, and just one of the servers is down. This allows data to be uploaded to the working servers, even though attempts to upload files to the dead server will result in individual file upload backoffs - which still exist, in addition to the project-wide backoff. I'd like to see the 'try once' policy kept, please, although I'd obviously like to see the bug which causes tasks to fail on network loss fixed as well. ID: 54744 ·

GeneAZ Send message Joined: 28 Jun 14 Posts: 27	Message 54774 - Posted: 5 Jul 2014, 23:15:13 UTC Try "just once" ? I beg to differ. boinc is not doing that! Looking at the event log (see previous posted message) the upload of the SAME result is _attempted_ at 10:39:32 and 10:40:54 and 10:43:12 even though a previous result has failed the "upload failure limit" and has already set the project backoff at least 15 minutes. And if left to continue running, this iteration of: [try upload; fail; set retry between 1 minute and 2 minutes bounds; re-calculate project backoff - but apparently ignore it] continues forever. I can see the rationale for a "try just once" but boinc isn't doing that. I seem to remember some previous boinc version holding new uploads in a "pending" state, if a project backoff is in effect. I like that strategy better. ID: 54774 ·

SuperSluether Send message Joined: 6 Jul 14 Posts: 94	Message 54819 - Posted: 9 Jul 2014, 21:41:05 UTC - in response to Message 54644. It's a very old problem - I have it on an Ubuntu 8.04 machine running BOINC 5.10.45 - As soon as I power down the router it shreds all of the WUs in the cache. Bit of a nuisnace really as that machine loses it's network connection nearly every day :( Cheers, Al. I have problems running tasks on Linux Ubuntu 14 even with a solid internet connection. Maybe it's just a bug in the Linux version... ID: 54819 ·

Grandpa Send message Joined: 20 Jun 14 Posts: 11	Message 54841 - Posted: 11 Jul 2014, 17:43:12 UTC - in response to Message 54819. I am guessing this is the same problem I am seeing and have reported here http://aerospaceresearch.net/constellation/forum_thread.php?id=301 It appears that this only affects certain projects, so far I have found that it affects Constellation and Primebotica. I also have Rioja Science, MindMolding, SZTAKI Desktop Grid and EDGeS@Home runing at the same time on these machines, the others do not appear to be affected by network loss at this time, but that could also just be luck, they may not have been trying to send during the periods of lost internet connectivity. ID: 54841 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.