5.8.16 project ended, but no fiinish file.

Message boards : BOINC client : 5.8.16 project ended, but no fiinish file.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile idahofisherman
Avatar

Send message
Joined: 11 Aug 06
Posts: 154
United States
Message 8776 - Posted: 15 Mar 2007, 22:46:23 UTC

This happened several times and messages suggest to reset project if this happens continuously. I have switched back to 5.8.15 to see if it still happens or not will let you know. This situation happened on BBC Climate Change several times. Project is not finished according to the persentage completed. When it does happen the project continues to work. It never switches to another project.
ID: 8776 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8780 - Posted: 16 Mar 2007, 1:42:30 UTC
Last modified: 16 Mar 2007, 1:48:26 UTC

This message can be caused by many things, and it's usually misleading (don't reset the project unless you have a looping 5.08). The most common cause is that whenever the system clock is automatically synchronised with internet time, if it goes back by even a few milliseconds, the core Boinc client gets confused and shuts everything down for the fraction of a second it takes to catch up with the new time.

It can also be caused by network glitches confusing Boinc (as an aside, a current development proposal for 5.9/6.0 is to abort work units which receive too many of these since the last checkpoint. This potentially means that if you have a network glitch, a future version of Boinc will kill all your work units, even if you've been working on them for months).

On my own PCs, I have changed the frequency of the time sync so it only happens once per week (otherwise Boinc can be wasting an awful lot of CPU time).

http://bbc.cpdn.org/forum_thread.php?id=1573&nowrap=true#12452

http://www.climateprediction.net/board/viewtopic.php?p=50303#50303

http://boinc-wiki.ath.cx/index.php?title=Result_%27%28result%29%27_exited_with_zero_status_but_no_%27finished%27_file

(Ageless, is this in the FAQ? I don't see it)
ID: 8780 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8781 - Posted: 16 Mar 2007, 2:20:28 UTC - in response to Message 8780.  

It can also be caused by network glitches confusing Boinc (as an aside, a current development proposal for 5.9/6.0 is to abort work units which receive too many of these since the last checkpoint. This potentially means that if you have a network glitch, a future version of Boinc will kill all your work units, even if you've been working on them for months).

Your "complaint" between brackets isn't quite correct. If the app is reset multiple times without any checkpoint, you're wasting CPU time computing the same piece of data again and again. So aborting workunits that reset continuously without checkpointing is actually a good idea.

Could you tell me what "network glitch" can cause this problem?
ID: 8781 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8784 - Posted: 16 Mar 2007, 8:46:41 UTC
Last modified: 16 Mar 2007, 8:59:11 UTC

The Wiki mentions that problems with DNS will cause zero exits. AstroWX (I ???think??? - can't find the post now) observed a similar thing recently (a large number of zero-exits within minutes while he was having router problems, stopped happening once the router was fixed).

As far as I can understand it, it's the boinc core client, not the science app, which is causing the zero-exits in the case of the CMOS clock being reset backwards? (if there are 5 suspended + active tasks from different projects, then every one of them, including suspended ones, are shut down with a zero exit when the time goes back). To me that's the Boinc framework, rather than a problem with the science app?

Overall, where it's possible to pin down where particular zero-exits have been triggerred, more often than not it seems to be the Boinc framework, albeit sometimes it's definately the science app or CPU starvation.

...So aborting workunits that reset continuously without checkpointing is actually a good idea.


This is true, but only if it's the science app which is causing the zero exits, and not the Boinc framework. In the case of CPDN zero-exits caused by the science app rather than Boinc appear to be rare.

-- Edit:

It was Arnaud25 - I hope he doesn't mind me quoting from his post because it was in the moderators hidden forum.

"Arnaud25" wrote:

"MikeMarsUK" wrote:

Carl, the dev mailing list mentions 10 'zero exits' before aborting the model - I've seen multiple zero exits caused by a game taking 100% of CPU time (albeit some time ago, 5.4.x and 5.08). If this still happens then it could cause models to be shut down? Any way to persuade them to detect 100% higher-priority CPU usage from external programmes and use this to stop resuming the WUs until the % drops?

Are you talking about the "Task xxx exited with zero status but no 'finished' file" message ?
I've had 83 messages like that due to Internet instability a few days ago and my model is still running. It would be a really bad idea to abort the model after only 10 messages, IMHO.

ID: 8784 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8787 - Posted: 16 Mar 2007, 11:01:01 UTC


I should have added :

...only if it's the science app which is causing the zero exits, and not the Boinc framework or transient environmental issues.

Science apps which loop and consume CPU are potentially a problem, but there is already a mechanism to deal with this (the CPU time limit). Aborting long-duration tasks by mistake would be a bigger problem IMHO.


ID: 8787 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 8788 - Posted: 16 Mar 2007, 11:09:53 UTC - in response to Message 8780.  
Last modified: 16 Mar 2007, 12:41:18 UTC

(Ageless, is this in the FAQ? I don't see it)

It's on the TODO list. All CPDN ones are in the capable hands of mo.v when her new computer is in. :-)
ID: 8788 · Report as offensive
Pooh Bear 27

Send message
Joined: 15 Mar 07
Posts: 20
United States
Message 8798 - Posted: 16 Mar 2007, 16:54:53 UTC

As Mike stated, some of this has to do with clock operations.

Since Windows by default has NNTP (a time protocol that allows your clock to be adjusted automatically to the server it talks to), your clock can change forward or backward depending on the interval of this. Plus some processes "pause" the clock for a period of time, because of the intense processing power it needs (virus scanning, disk defragmenting, etc).

Since BOINC writes things every so often (and I think actually changing this to a higher number on projects might slow these errors down), it sometimes will write a file at an interval that it does not think it understands. This really went nuts on a couple of versions, because of some bad code, but it also happened to many on the DST change.

It's mostly a warning. I am sure they are working on adjusting things to fix these even more. I personally have not seen this error on my machines, but I do not look hard to see if I am getting it. It may happen, but because I know it's not hurting anything, I just leave it be.
ID: 8798 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8799 - Posted: 16 Mar 2007, 18:16:05 UTC - in response to Message 8798.  
Last modified: 16 Mar 2007, 18:16:49 UTC

...Since BOINC writes things every so often (and I think actually changing this to a higher number on projects might slow these errors down), it sometimes will write a file at an interval that it does not think it understands. This really went nuts on a couple of versions, because of some bad code, but it also happened to many on the DST change.
...


Is this the
Write to disk at most every: 60 seconds

setting in General Preferences?

Does it affect client_state.xml? I've been worried about hard disk wear due to the number of writes to that file (I might just be being paranoid of course, I'm good at that).
ID: 8799 · Report as offensive
Profile KSMarksPsych
Avatar

Send message
Joined: 30 Oct 05
Posts: 1239
United States
Message 8804 - Posted: 16 Mar 2007, 19:48:30 UTC

I thought that was just for checkpointing. But I could be wrong (I'm good that that).
Kathryn :o)
ID: 8804 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8808 - Posted: 16 Mar 2007, 20:48:07 UTC - in response to Message 8798.  

Since Windows by default has NNTP (a time protocol that allows your clock to be adjusted automatically to the server it talks to), your clock can change forward or backward depending on the interval of this.

A little correction: it's NTP :)

NTP: Network Time Protocol
NNTP: Network News Transfer Protocol
ID: 8808 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8810 - Posted: 16 Mar 2007, 20:51:11 UTC - in response to Message 8804.  

I thought that was just for checkpointing. But I could be wrong (I'm good that that).

I also think it's only for checkpointing, and even some projects don't follow it correctly (like TMRL DRTG, I have seen their code; they don't really have a way to make it follow the setting though). I'll have to look at the code for client_state file saving to see if that is affected by the setting too.
ID: 8810 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8816 - Posted: 16 Mar 2007, 22:18:35 UTC


I'm sure that none of the climate projects check that setting for checkpoints either (the checkpoints can only happen once per model day at most).

If someone could resolve that time sync problem it would save a huge amount of CPU time in the long run :-)

ID: 8816 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8818 - Posted: 17 Mar 2007, 2:36:44 UTC
Last modified: 17 Mar 2007, 2:40:39 UTC

The setting is how much time *at most*. Here's the logic science apps should use:

whenever the app is in a good moment to checkpoint:

if (boinc client says it's OK to checkpoint now) {
    checkpoint.
    tell the boinc client the checkpoint was done.
}


If checkpoint time is set to 60 seconds, most probably every time the climate model "asks boinc if it it's time to checkpoint", BOINC would reply yes (as CPDN takes more than 60 seconds between checkpoints). So you usually don't notice it unless time between checkpoints is set to 30 minutes, but the project probably still follows the setting ;)
ID: 8818 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8819 - Posted: 17 Mar 2007, 3:02:33 UTC
Last modified: 17 Mar 2007, 3:09:59 UTC

There is a network-related problem that can cause this.

BOINC recently switched to using synchronous DNS resolving, in an attempt to workaround a DNS cache bug. That means the core client can't do anything while it's waiting for the DNS to respond; it's essentially hanged until it gets a reply. If the DNS server is not replying, for example, if your internet connection has problems, it takes a relatively long time (say 30 seconds) for it to finally give up. During this period, the science app can't communicate with the core client (as the core client is "hanged", it can't reply). It may quit with the error "No heartbeat from core client for 30 seconds, exiting".

When the core client finally gets either a reply from the DNS server, or a timeout, and starts being able to do other things, it notices the science applications had suddenly disappeared. So it gives the error "Task [name] exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." That's the part where the clueless user follows instructions, resets project, and makes the project lose a climate model, all because of a slow or non-working Internet connection!

Another problem this DNS thing causes is unresponsive manager. BOINC Manager has always used blocking I/O for GUI RPCs. That means the BOINC Manager can't do anything while it's waiting for the core client to respond; it's essentially hanged until it gets a reply. If the core client is hanged waiting for DNS, it can't respond to the manager, so the manager can't respond to mouseclicks. It all ends in getting a completely unresponsive GUI, all because of a slow or non-working Internet connection!

Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed.
ID: 8819 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 8820 - Posted: 17 Mar 2007, 3:04:56 UTC

Client_state.xml is being written to every multiple times a second regardless of what your write to disk is set to.

Don't worry about disk writes though. If your disk can't take many disk writes, it's not a good disk. And it won't reach the amount of writes a database disk gets a second.
ID: 8820 · Report as offensive
Profile idahofisherman
Avatar

Send message
Joined: 11 Aug 06
Posts: 154
United States
Message 8831 - Posted: 17 Mar 2007, 14:27:10 UTC

I found that the same thing happens under 5.8.15, so this leads tme to think that it is not a Boinc problem. After reading all of threads here, I tried a couple of things.

First I ran BOinc at the same time that a Norton Antivirus scan was running. This caused it to happen again.

Secondly, I suspended everything except CPDN and climateprediction. This also caused it to happen.

Thirdly, I ran defrag on my hard drive and rebooted. Presently I am running full bore with all project and not having a problem. Time will tell.

Thanks to everyone for your help.
ID: 8831 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8839 - Posted: 17 Mar 2007, 21:38:06 UTC - in response to Message 8819.  

...
Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed.


A good and detailed analysis, thanks :-)

Hopefully some of these issues can be dealt with in future versions of the Boinc manager. It would be nice to use async calls, but experience shows that async calls are a source of bugs in their own right due to the more complex code required.
ID: 8839 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8840 - Posted: 17 Mar 2007, 21:50:18 UTC - in response to Message 8820.  

...

Don't worry about disk writes though. If your disk can't take many disk writes, it's not a good disk. And it won't reach the amount of writes a database disk gets a second.


Not a problem for either of my PCs (supposedly 24/7 raid quality drives), but we worry about these things over at CPDN particularly in respect of laptops (which tend to be designed for intermittent use rather than continual use). The other aspect of laptop drives is that they're designed to spin down after a period to conserve electricity, but the client_state.xml keeps them continually active.

The current version of the climate model does continually read and write huge amounts to disk, but Carl is working on a reduced I/O version which drops this by 95% at the cost of a higher memory footprint. The main remaining thing which keeps the disk busy is the client_state.xml.
ID: 8840 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 8841 - Posted: 17 Mar 2007, 21:57:58 UTC - in response to Message 8831.  

...
First I ran BOinc at the same time that a Norton Antivirus scan was running. This caused it to happen again.

Secondly, I suspended everything except CPDN and climateprediction. This also caused it to happen.

...


Hope everything continues to work OK :-)

We have some 'generic' advice over at CPDN/BBC CCE which may help avoid future problems:
* Add the Boinc directory to Norton AVs two 'exclusion' lists, so that it gets skipped during the scan. Sometimes it locks files which Boinc / CPDN needs, and causes it to crash (rare, but worth avoiding).
* Before running defrag, exit out of Boinc (defrag will only defrag files if they're not in use. The climate model's files get very highly fragmented quite quickly)
* Take a backup (simply a copy of the entire Boinc directory) at intervals when Boinc is shut down. I do this roughly weekly after doing the defrag. In the event of a crash this can be used to continue running the climate model.
* Before running games, doing video encoding, or anything else which uses the graphics intensively, 'suspend' Boinc so that if the game mangles the graphics card drivers it doesn't cause Boinc to crash.

There's been talk about Boinc moving the graphics processing to it's own thread in the future, this would be very useful particularly in avoiding graphics related crashes.

ID: 8841 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8889 - Posted: 19 Mar 2007, 16:44:27 UTC - in response to Message 8839.  

...
Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed.


A good and detailed analysis, thanks :-)

Hopefully some of these issues can be dealt with in future versions of the Boinc manager. It would be nice to use async calls, but experience shows that async calls are a source of bugs in their own right due to the more complex code required.

Just now I'm dealing with a BOINC Mgr on a computer with broken internet connection. I think it hangs on each attempt to connect to a project, but I can't even make it stop trying to connect, because I can't select "Disable network activity". As soon as it starts responding again, it does another RPC and hangs again!

Also, the CPU WAS IDLE, probably the apps kept exiting because of the hanged core client waiting for DNS. Who knows for how many hours it has been like that.
ID: 8889 · Report as offensive
1 · 2 · Next

Message boards : BOINC client : 5.8.16 project ended, but no fiinish file.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.