Task ____ exited with zero status but no 'finished' file

Message boards : Questions and problems : Task ____ exited with zero status but no 'finished' file
Message board moderation

To post messages, you must log in.

AuthorMessage
Jazzop

Send message
Joined: 19 Dec 06
Posts: 90
United States
Message 58896 - Posted: 23 Dec 2014, 6:40:26 UTC

I am getting this message from multiple projects, that's why I am asking it here and not on an individual project forum. It has happened a lot in the past few days after I opened up my machines to other projects now that SIMAP is dead.

The projects so far include:

23-Dec-14 00:27:58 | climateprediction.net | Task hadam3p_eu_h5ur_2013_1_008861412_1 exited with zero status but no 'finished' file

23-Dec-14 00:14:29 | WUProp@Home | Task wu_v4_1419226725_17366_0 exited with zero status but no 'finished' file

22-Dec-14 22:32:46 | Einstein@Home | Task LATeah0045E_80.0_420_-2.97e-10_2 exited with zero status but no 'finished' file

22-Dec-14 01:38:17 | EDGeS@Home | Task 7bead5f8-1cfa-468f-9b62-e7137cf7fddd_612b9788-2781-4010-8e81-fa19752ad9e7_32d9ba32-48ce-4218-bee7-548f26db5174_0 exited with zero status but no 'finished' file

21-Dec-14 22:32:18 | rosetta@home | Task tj_12_16_frag2012_v2_X_64_h17_BBGB_19_DDD_wD_abinitio_SAVE_ALL_OUT_232271_609_0 exited with zero status but no 'finished' file


I am using BOINC v7.4.27 on all machines. OSes include Win7x32, Win7x64, Win2008R2, Win2012R2.
ID: 58896 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 58897 - Posted: 23 Dec 2014, 7:03:38 UTC - in response to Message 58896.  

Make sure that you exclude the BOINC Data directory from being actively scanned by things as anti-virus scanners, anti-malware scanners, indexing etc. You get this message when BOINC tries to write its progress to disk but finds that some other process is holding the affected files in memory and therefore from being able to be written to by BOINC.

Also only scan and index the Data directory when BOINC isn't running.
ID: 58897 · Report as offensive
Jazzop

Send message
Joined: 19 Dec 06
Posts: 90
United States
Message 58971 - Posted: 24 Dec 2014, 20:40:23 UTC - in response to Message 58897.  

I do not have indexing turned on (i.e., Windows Search is disabled).

I do not use any antivirus software other than Windows Defender, and it is not set to run any automatic scans. I only run a manual scan a couple of times a year.

It appears that this problem is much more significant on my Server 2012R2 machine. It is happening with about 10% of all tasks, across all projects.

Is there a potential connection to the fact that these errors spiked a couple of days ago after I reconfigured my preferences on BOINCstats, attached/detatched from several projects, and cleared all the local computing preferences on all my hosts?
ID: 58971 · Report as offensive
Profile Gary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2465
United States
Message 58980 - Posted: 25 Dec 2014, 16:47:58 UTC - in response to Message 58971.  

I do not have indexing turned on (i.e., Windows Search is disabled).

I do not use any antivirus software other than Windows Defender, and it is not set to run any automatic scans. I only run a manual scan a couple of times a year.

It appears that this problem is much more significant on my Server 2012R2 machine. It is happening with about 10% of all tasks, across all projects.

Is there a potential connection to the fact that these errors spiked a couple of days ago after I reconfigured my preferences on BOINCstats, attached/detatched from several projects, and cleared all the local computing preferences on all my hosts?

The technical thing going on is a heartbeat. The science applications are required to update a time stamp on a periodic basis. If they don't run, they can't update it. When they don't BOINC assumes they stopped because they are finished and looks for a finished file. No file and you get an error message. The 99%+ reason that the heartbeat is missed is because other higher priority stuff is running on your machine. Frequent causes are windows indexer and anti-virus scans. You know what else your machine does. Perhaps when you changed your computing preferences you may have let BOINC use too much of the machine for the other tasks it has to do?
ID: 58980 · Report as offensive
Jazzop

Send message
Joined: 19 Dec 06
Posts: 90
United States
Message 59040 - Posted: 28 Dec 2014, 9:31:01 UTC - in response to Message 58980.  

The machine with the most errors is used as a media fileserver along with some general web browsing and document management. I usually snooze or exit BOINC when I play media because it is too jittery otherwise.

Is the failure of the time stamp function likely to be associated with bottlenecking of disk access? If so, would adjusting the checkpoint interval in my BOINC preferences help? I currently have it set to 20s.
ID: 59040 · Report as offensive
Claggy

Send message
Joined: 23 Apr 07
Posts: 1112
United Kingdom
Message 59043 - Posted: 28 Dec 2014, 11:38:52 UTC - in response to Message 58980.  
Last modified: 28 Dec 2014, 11:57:16 UTC

Boinc 7.0.37/.38 introduced a new heartbeat mechanism that is supposed to fix this issue,

The question after that is what api version are those projects compiling their apps with, and are they using the new or old heartbeat mechanism:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=4e8b4ddab581839929d4fa666d81161d590a0f8e
- client and API: improve the way an app checks for the death of the client
     Old: heartbeat mechanism
     Problem: if the client is blocked for > 30 secs
         (e.g. because it takes a long time to write the state file,
         of because it's stopped in a debugger)
         then apps exit.
         This is bad is the app doesn't checkpoint and has been
         running for a long time.
     New: the client passes its PID to the app.
         The app periodically (10 sec) checks that the process still exists.
     Notes:
     - For backward compatibility (e.g. new API w/ old client,
         or vice versa) the client still sends heartbeats,
         and the API checks heartbeats if the client doesn't pass a PID.
     - The new mechanism works only if the client's PID isn't assigned
         to a new process within 10 secs of the client exiting.
         Windows 2000 reuses PIDs immediately, so check for Win2K
         and don't use this mechanism if so.

 TODO: For Unix multithread apps,
     critical sections aren't currently being enforced.
     Need to fix this by masking signals.

 svn path=/trunk/boinc/; revision=26147


Claggy
ID: 59043 · Report as offensive
Jazzop

Send message
Joined: 19 Dec 06
Posts: 90
United States
Message 59167 - Posted: 31 Dec 2014, 9:12:52 UTC - in response to Message 59043.  

I believe the problem with my most offending machine had to do with the wimpy onboard VGA adapter (a crummy ASPEED 2-D chip on this server mobo) being overwhelmed and stealing other system resources. I just installed a GTX 750Ti card and the problems went away. I should have done this years ago, but the last time I tried a video card it conflicted with my RAID adapter and the computer wouldn't boot; so I had been a little gun-shy for too long.

This still doesn't explain why the other machines had a surge in the same error, but they have all settled down as well. The only recent change was a major overhaul/resync of my BAM settings and changing email addresses & passwords on all my project accounts.

Oh well...
ID: 59167 · Report as offensive

Message boards : Questions and problems : Task ____ exited with zero status but no 'finished' file

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.