Message boards : BOINC client : 5.8.16 project ended, but no fiinish file.
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Aug 06 Posts: 154 |
This happened several times and messages suggest to reset project if this happens continuously. I have switched back to 5.8.15 to see if it still happens or not will let you know. This situation happened on BBC Climate Change several times. Project is not finished according to the persentage completed. When it does happen the project continues to work. It never switches to another project. |
Send message Joined: 16 Apr 06 Posts: 386 |
This message can be caused by many things, and it's usually misleading (don't reset the project unless you have a looping 5.08). The most common cause is that whenever the system clock is automatically synchronised with internet time, if it goes back by even a few milliseconds, the core Boinc client gets confused and shuts everything down for the fraction of a second it takes to catch up with the new time. It can also be caused by network glitches confusing Boinc (as an aside, a current development proposal for 5.9/6.0 is to abort work units which receive too many of these since the last checkpoint. This potentially means that if you have a network glitch, a future version of Boinc will kill all your work units, even if you've been working on them for months). On my own PCs, I have changed the frequency of the time sync so it only happens once per week (otherwise Boinc can be wasting an awful lot of CPU time). http://bbc.cpdn.org/forum_thread.php?id=1573&nowrap=true#12452 http://www.climateprediction.net/board/viewtopic.php?p=50303#50303 http://boinc-wiki.ath.cx/index.php?title=Result_%27%28result%29%27_exited_with_zero_status_but_no_%27finished%27_file (Ageless, is this in the FAQ? I don't see it) |
Send message Joined: 19 Jan 07 Posts: 1179 |
It can also be caused by network glitches confusing Boinc (as an aside, a current development proposal for 5.9/6.0 is to abort work units which receive too many of these since the last checkpoint. This potentially means that if you have a network glitch, a future version of Boinc will kill all your work units, even if you've been working on them for months). Your "complaint" between brackets isn't quite correct. If the app is reset multiple times without any checkpoint, you're wasting CPU time computing the same piece of data again and again. So aborting workunits that reset continuously without checkpointing is actually a good idea. Could you tell me what "network glitch" can cause this problem? |
Send message Joined: 16 Apr 06 Posts: 386 |
The Wiki mentions that problems with DNS will cause zero exits. AstroWX (I ???think??? - can't find the post now) observed a similar thing recently (a large number of zero-exits within minutes while he was having router problems, stopped happening once the router was fixed). As far as I can understand it, it's the boinc core client, not the science app, which is causing the zero-exits in the case of the CMOS clock being reset backwards? (if there are 5 suspended + active tasks from different projects, then every one of them, including suspended ones, are shut down with a zero exit when the time goes back). To me that's the Boinc framework, rather than a problem with the science app? Overall, where it's possible to pin down where particular zero-exits have been triggerred, more often than not it seems to be the Boinc framework, albeit sometimes it's definately the science app or CPU starvation. ...So aborting workunits that reset continuously without checkpointing is actually a good idea. This is true, but only if it's the science app which is causing the zero exits, and not the Boinc framework. In the case of CPDN zero-exits caused by the science app rather than Boinc appear to be rare. -- Edit: It was Arnaud25 - I hope he doesn't mind me quoting from his post because it was in the moderators hidden forum. "Arnaud25" wrote:
|
Send message Joined: 16 Apr 06 Posts: 386 |
I should have added : ...only if it's the science app which is causing the zero exits, and not the Boinc framework or transient environmental issues. Science apps which loop and consume CPU are potentially a problem, but there is already a mechanism to deal with this (the CPU time limit). Aborting long-duration tasks by mistake would be a bigger problem IMHO. |
Send message Joined: 29 Aug 05 Posts: 15573 |
(Ageless, is this in the FAQ? I don't see it) It's on the TODO list. All CPDN ones are in the capable hands of mo.v when her new computer is in. :-) |
Send message Joined: 15 Mar 07 Posts: 20 |
As Mike stated, some of this has to do with clock operations. Since Windows by default has NNTP (a time protocol that allows your clock to be adjusted automatically to the server it talks to), your clock can change forward or backward depending on the interval of this. Plus some processes "pause" the clock for a period of time, because of the intense processing power it needs (virus scanning, disk defragmenting, etc). Since BOINC writes things every so often (and I think actually changing this to a higher number on projects might slow these errors down), it sometimes will write a file at an interval that it does not think it understands. This really went nuts on a couple of versions, because of some bad code, but it also happened to many on the DST change. It's mostly a warning. I am sure they are working on adjusting things to fix these even more. I personally have not seen this error on my machines, but I do not look hard to see if I am getting it. It may happen, but because I know it's not hurting anything, I just leave it be. |
Send message Joined: 16 Apr 06 Posts: 386 |
...Since BOINC writes things every so often (and I think actually changing this to a higher number on projects might slow these errors down), it sometimes will write a file at an interval that it does not think it understands. This really went nuts on a couple of versions, because of some bad code, but it also happened to many on the DST change. Is this the Write to disk at most every: 60 seconds setting in General Preferences? Does it affect client_state.xml? I've been worried about hard disk wear due to the number of writes to that file (I might just be being paranoid of course, I'm good at that). |
Send message Joined: 30 Oct 05 Posts: 1239 |
I thought that was just for checkpointing. But I could be wrong (I'm good that that). Kathryn :o) |
Send message Joined: 19 Jan 07 Posts: 1179 |
Since Windows by default has NNTP (a time protocol that allows your clock to be adjusted automatically to the server it talks to), your clock can change forward or backward depending on the interval of this. A little correction: it's NTP :) NTP: Network Time Protocol NNTP: Network News Transfer Protocol |
Send message Joined: 19 Jan 07 Posts: 1179 |
I thought that was just for checkpointing. But I could be wrong (I'm good that that). I also think it's only for checkpointing, and even some projects don't follow it correctly (like TMRL DRTG, I have seen their code; they don't really have a way to make it follow the setting though). I'll have to look at the code for client_state file saving to see if that is affected by the setting too. |
Send message Joined: 16 Apr 06 Posts: 386 |
I'm sure that none of the climate projects check that setting for checkpoints either (the checkpoints can only happen once per model day at most). If someone could resolve that time sync problem it would save a huge amount of CPU time in the long run :-) |
Send message Joined: 19 Jan 07 Posts: 1179 |
The setting is how much time *at most*. Here's the logic science apps should use: whenever the app is in a good moment to checkpoint: if (boinc client says it's OK to checkpoint now) { checkpoint. tell the boinc client the checkpoint was done. } If checkpoint time is set to 60 seconds, most probably every time the climate model "asks boinc if it it's time to checkpoint", BOINC would reply yes (as CPDN takes more than 60 seconds between checkpoints). So you usually don't notice it unless time between checkpoints is set to 30 minutes, but the project probably still follows the setting ;) |
Send message Joined: 19 Jan 07 Posts: 1179 |
There is a network-related problem that can cause this. BOINC recently switched to using synchronous DNS resolving, in an attempt to workaround a DNS cache bug. That means the core client can't do anything while it's waiting for the DNS to respond; it's essentially hanged until it gets a reply. If the DNS server is not replying, for example, if your internet connection has problems, it takes a relatively long time (say 30 seconds) for it to finally give up. During this period, the science app can't communicate with the core client (as the core client is "hanged", it can't reply). It may quit with the error "No heartbeat from core client for 30 seconds, exiting". When the core client finally gets either a reply from the DNS server, or a timeout, and starts being able to do other things, it notices the science applications had suddenly disappeared. So it gives the error "Task [name] exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." That's the part where the clueless user follows instructions, resets project, and makes the project lose a climate model, all because of a slow or non-working Internet connection! Another problem this DNS thing causes is unresponsive manager. BOINC Manager has always used blocking I/O for GUI RPCs. That means the BOINC Manager can't do anything while it's waiting for the core client to respond; it's essentially hanged until it gets a reply. If the core client is hanged waiting for DNS, it can't respond to the manager, so the manager can't respond to mouseclicks. It all ends in getting a completely unresponsive GUI, all because of a slow or non-working Internet connection! Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed. |
Send message Joined: 29 Aug 05 Posts: 15573 |
Client_state.xml is being written to every multiple times a second regardless of what your write to disk is set to. Don't worry about disk writes though. If your disk can't take many disk writes, it's not a good disk. And it won't reach the amount of writes a database disk gets a second. |
Send message Joined: 11 Aug 06 Posts: 154 |
I found that the same thing happens under 5.8.15, so this leads tme to think that it is not a Boinc problem. After reading all of threads here, I tried a couple of things. First I ran BOinc at the same time that a Norton Antivirus scan was running. This caused it to happen again. Secondly, I suspended everything except CPDN and climateprediction. This also caused it to happen. Thirdly, I ran defrag on my hard drive and rebooted. Presently I am running full bore with all project and not having a problem. Time will tell. Thanks to everyone for your help. |
Send message Joined: 16 Apr 06 Posts: 386 |
... A good and detailed analysis, thanks :-) Hopefully some of these issues can be dealt with in future versions of the Boinc manager. It would be nice to use async calls, but experience shows that async calls are a source of bugs in their own right due to the more complex code required. |
Send message Joined: 16 Apr 06 Posts: 386 |
... Not a problem for either of my PCs (supposedly 24/7 raid quality drives), but we worry about these things over at CPDN particularly in respect of laptops (which tend to be designed for intermittent use rather than continual use). The other aspect of laptop drives is that they're designed to spin down after a period to conserve electricity, but the client_state.xml keeps them continually active. The current version of the climate model does continually read and write huge amounts to disk, but Carl is working on a reduced I/O version which drops this by 95% at the cost of a higher memory footprint. The main remaining thing which keeps the disk busy is the client_state.xml. |
Send message Joined: 16 Apr 06 Posts: 386 |
... Hope everything continues to work OK :-) We have some 'generic' advice over at CPDN/BBC CCE which may help avoid future problems: * Add the Boinc directory to Norton AVs two 'exclusion' lists, so that it gets skipped during the scan. Sometimes it locks files which Boinc / CPDN needs, and causes it to crash (rare, but worth avoiding). * Before running defrag, exit out of Boinc (defrag will only defrag files if they're not in use. The climate model's files get very highly fragmented quite quickly) * Take a backup (simply a copy of the entire Boinc directory) at intervals when Boinc is shut down. I do this roughly weekly after doing the defrag. In the event of a crash this can be used to continue running the climate model. * Before running games, doing video encoding, or anything else which uses the graphics intensively, 'suspend' Boinc so that if the game mangles the graphics card drivers it doesn't cause Boinc to crash. There's been talk about Boinc moving the graphics processing to it's own thread in the future, this would be very useful particularly in avoiding graphics related crashes. |
Send message Joined: 19 Jan 07 Posts: 1179 |
... Just now I'm dealing with a BOINC Mgr on a computer with broken internet connection. I think it hangs on each attempt to connect to a project, but I can't even make it stop trying to connect, because I can't select "Disable network activity". As soon as it starts responding again, it does another RPC and hangs again! Also, the CPU WAS IDLE, probably the apps kept exiting because of the hanged core client waiting for DNS. Who knows for how many hours it has been like that. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.