(All) tasks fail when internet connection is down

Message boards : BOINC client : (All) tasks fail when internet connection is down
Message board moderation

To post messages, you must log in.

AuthorMessage
dentaku

Send message
Joined: 14 Dec 06
Posts: 74
Germany
Message 13565 - Posted: 4 Nov 2007, 17:39:30 UTC
Last modified: 4 Nov 2007, 17:43:54 UTC

When I let my computer on for a long time (and my internet provider disconnects automatically every 24 hours), I notice that some or all of BOINC's tasks failed.

I noticed this now several times and I guess this is a BOINC program error. If an internet connection is required, BOINC should retry in specific cycles - endlessly. Or suspend with some notification. But not by letting tasks fail!
BOINC 7.2.42 (x86_64) on Linux Ubuntu 16.04 (64 Bit), AMD APU 7850K 3.7 GHz, 32 GB RAM.
ID: 13565 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 13567 - Posted: 4 Nov 2007, 17:56:23 UTC - in response to Message 13565.  

That's half a dozen chained problems. I'll see if I can find my original complaint thread, or explain the whole thing again (later).

ID: 13567 · Report as offensive
dentaku

Send message
Joined: 14 Dec 06
Posts: 74
Germany
Message 13578 - Posted: 4 Nov 2007, 22:20:05 UTC

>program BOINC prefs so that it will only seek contact during uptime

And how do I do this?

My config: BOINC 5.10.28 Linux/64
BOINC 7.2.42 (x86_64) on Linux Ubuntu 16.04 (64 Bit), AMD APU 7850K 3.7 GHz, 32 GB RAM.
ID: 13578 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 13581 - Posted: 5 Nov 2007, 10:38:05 UTC
Last modified: 5 Nov 2007, 10:47:49 UTC

Some time ago I lost a buffer-full of workunits myself due to this issue. There are a bunch of inter-related tickets on this problem, ranging from complaints about the message up to the Boinc manager freezing solid and the work units all crashing.

http://boinc.berkeley.edu/trac/ticket/171

http://boinc.berkeley.edu/trac/ticket/113

http://boinc.berkeley.edu/trac/ticket/282

(part of 286 is related to this as well)
http://boinc.berkeley.edu/trac/ticket/286

If you read through them and decide which is the most appropriate one for you, it would be good if you could add as many details as possible to the relevant ticket.

It helps to run with 'network activity disabled', set several days work buffer, and only re-enable it when you are online. Projects with short deadlines don't like that, however.


-- Edit:

PS I like the new forum layout, in particular the spacing line before the text which makes it much easier to read. I've been manually adding a line for ages...
ID: 13581 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 13604 - Posted: 5 Nov 2007, 17:09:42 UTC - in response to Message 13567.  

That's half a dozen chained problems. I'll see if I can find my original complaint thread, or explain the whole thing again (later).

OK, here we go...

Defect: BOINC Manager uses blocking I/O to talk to the core client. It sends a request ("get list of workunits") and then it waits for the reply. While it's waiting, the user interface is unresponsive. This usually isn't noticed, unless the core client hangs too (see below for reasons), or you're connecting remotely and have high latency.

Recent change: The core client uses synchronous DNS resolving. This was done in an attempt to fix the bug where a project server changing its IP was never getting noticed by the client. The side effect is that the core client will hang during DNS resolving. It can't reply GUI RPCs while it's waiting for DNS reply, for example. If your Internet connection is down, the client will hang for 30 seconds on every connection attempt!

Defect: The core client doesn't use a separate thread to calculate disk space in use by projects. If a project has lots of files, this could cause noticeable client hangs too. This is in part fixed now: new versions have a faster method of calculating disk space.

Feature: If a science app doesn't get a "heartbeat" from the client often, it will quit.

Feature: If a science app quits too many times, the client will abort the workunit.

So what's the problem here... If the client hangs for whatever reason (waiting for DNS or calculating diskspace), it can't reply to GUI RPCs (making the manager hang too) nor to application heartbeats (making them quit). When the client un-hangs, it will notice the application quit (wonder why!) and restart it; but if it does the same many times, the client will abort the workunit.

Solving only one of the above problems can be considered a hack. For example, if the client is made not to hang while waiting for DNS, it doesn't solve the problem of the manager being sluggish remotely, and whenever a new reason for client hang appears, the work-aborting & hanged manager problems happen again. If the manager is made to use a separate thread for GUI RPCs, a hanged client will still cause workunit abortion. Both problems need to be solved.

ID: 13604 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 13605 - Posted: 5 Nov 2007, 17:40:05 UTC - in response to Message 13604.  
Last modified: 5 Nov 2007, 17:40:37 UTC

That's half a dozen chained problems. I'll see if I can find my original complaint thread, or explain the whole thing again (later).

OK, here we go...


This is incomplete, since just now I remembered another reason for the core client lagging or hanging: too many workunits or file transfers in queue.
ID: 13605 · Report as offensive
Mr. Kevvy
Avatar

Send message
Joined: 6 Nov 07
Posts: 37
Canada
Message 13612 - Posted: 6 Nov 2007, 1:25:57 UTC

BOINC 5.10.28 (latest) on XP Home Edition SP2.

Project is SHA-1 Collision Search Graz. The project was down so I suspended network connection under Activity -> Network activity suspended.

When I reconnected, the project was still down, and BOINC started eating all the queued completed workunits. Example:

05/11/2007 1:31:47 PM|SHA-1 Collision Search Graz|Started upload of wu_sha1collisionsearchgraz_v53_1194094206_729_0_0
05/11/2007 1:31:48 PM|SHA-1 Collision Search Graz|Giving up on upload of wu_sha1collisionsearchgraz_v53_1194094206_729_0_0: file not found


It ate over a hundred of them like this. I checked in the BOINCprojectsboinc.iaik.tugraz.at_sha1_coll_search folder and there are completed workunits there. I quit BOINC and rebooted (power outage took care of that part for me) and watched the BOINC client start crunching queued SHA-1 workunits that had not yet been touched, and complete them and eat them the same way.

05/11/2007 2:03:55 PM|SHA-1 Collision Search Graz|Starting wu_sha1collisionsearchgraz_v53_1194099750_174_0
05/11/2007 2:32:39 PM|SHA-1 Collision Search Graz|Started upload of wu_sha1collisionsearchgraz_v53_1194099750_174_0_0
05/11/2007 2:32:40 PM|SHA-1 Collision Search Graz|Giving up on upload of wu_sha1collisionsearchgraz_v53_1194099750_174_0_0: file not found


I've been running BOINC on several machines since the day it was released and I've never seen anything like this. Current system is top-of-the-line (QC6600 quad core, 2GB DDR2, WD SATA Raptor, recent install of Windows, no malware) so I doubt it's anything but BOINC freaking out.
ID: 13612 · Report as offensive
Keck_Komputers
Avatar

Send message
Joined: 29 Aug 05
Posts: 304
United States
Message 13615 - Posted: 6 Nov 2007, 6:25:54 UTC

I have seen a case where when one of the application files does not download properly (incomplete/corrupt) all tasks would error out. A project reset to start a fresh download will usually fix it. I thought that may be what is happening here, however the log snips show a problem on upload which makes me think it may be a server problem.
BOINC WIKI

BOINCing since 2002/12/8
ID: 13615 · Report as offensive
Mr. Kevvy
Avatar

Send message
Joined: 6 Nov 07
Posts: 37
Canada
Message 13619 - Posted: 6 Nov 2007, 14:08:58 UTC - in response to Message 13615.  

I have seen a case where when one of the application files does not download properly (incomplete/corrupt) all tasks would error out. A project reset to start a fresh download will usually fix it. I thought that may be what is happening here, however the log snips show a problem on upload which makes me think it may be a server problem.


Thanks for the reply. There's hasn't been an application update for two months on this project and the existing version crunched a few thousand workunits. I tried the remaining queued workunits without touching the "Network activity suspended" option and they are now remaining as queued uploads (and repeatedly retrying) instead of being deleted. This seems like a bug with this option. Another project (SIMAP) is also working fine without any problems like this.
ID: 13619 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 13665 - Posted: 8 Nov 2007, 9:20:48 UTC


Sounds like a different problem to the ones mentioned earlier in the thread.

There is a two-week timeout for uploads to the server, it sounds as if the Boinc client thought that this limit was up for some reason?
ID: 13665 · Report as offensive

Message boards : BOINC client : (All) tasks fail when internet connection is down

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.