Message boards : Questions and problems : Task ____ exited with zero status but no 'finished' file
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Dec 06 Posts: 90 |
I am getting this message from multiple projects, that's why I am asking it here and not on an individual project forum. It has happened a lot in the past few days after I opened up my machines to other projects now that SIMAP is dead. The projects so far include: 23-Dec-14 00:27:58 | climateprediction.net | Task hadam3p_eu_h5ur_2013_1_008861412_1 exited with zero status but no 'finished' file 23-Dec-14 00:14:29 | WUProp@Home | Task wu_v4_1419226725_17366_0 exited with zero status but no 'finished' file 22-Dec-14 22:32:46 | Einstein@Home | Task LATeah0045E_80.0_420_-2.97e-10_2 exited with zero status but no 'finished' file 22-Dec-14 01:38:17 | EDGeS@Home | Task 7bead5f8-1cfa-468f-9b62-e7137cf7fddd_612b9788-2781-4010-8e81-fa19752ad9e7_32d9ba32-48ce-4218-bee7-548f26db5174_0 exited with zero status but no 'finished' file 21-Dec-14 22:32:18 | rosetta@home | Task tj_12_16_frag2012_v2_X_64_h17_BBGB_19_DDD_wD_abinitio_SAVE_ALL_OUT_232271_609_0 exited with zero status but no 'finished' file I am using BOINC v7.4.27 on all machines. OSes include Win7x32, Win7x64, Win2008R2, Win2012R2. |
Send message Joined: 29 Aug 05 Posts: 15569 |
Make sure that you exclude the BOINC Data directory from being actively scanned by things as anti-virus scanners, anti-malware scanners, indexing etc. You get this message when BOINC tries to write its progress to disk but finds that some other process is holding the affected files in memory and therefore from being able to be written to by BOINC. Also only scan and index the Data directory when BOINC isn't running. |
Send message Joined: 19 Dec 06 Posts: 90 |
I do not have indexing turned on (i.e., Windows Search is disabled). I do not use any antivirus software other than Windows Defender, and it is not set to run any automatic scans. I only run a manual scan a couple of times a year. It appears that this problem is much more significant on my Server 2012R2 machine. It is happening with about 10% of all tasks, across all projects. Is there a potential connection to the fact that these errors spiked a couple of days ago after I reconfigured my preferences on BOINCstats, attached/detatched from several projects, and cleared all the local computing preferences on all my hosts? |
Send message Joined: 23 Feb 08 Posts: 2495 |
I do not have indexing turned on (i.e., Windows Search is disabled). The technical thing going on is a heartbeat. The science applications are required to update a time stamp on a periodic basis. If they don't run, they can't update it. When they don't BOINC assumes they stopped because they are finished and looks for a finished file. No file and you get an error message. The 99%+ reason that the heartbeat is missed is because other higher priority stuff is running on your machine. Frequent causes are windows indexer and anti-virus scans. You know what else your machine does. Perhaps when you changed your computing preferences you may have let BOINC use too much of the machine for the other tasks it has to do? |
Send message Joined: 19 Dec 06 Posts: 90 |
The machine with the most errors is used as a media fileserver along with some general web browsing and document management. I usually snooze or exit BOINC when I play media because it is too jittery otherwise. Is the failure of the time stamp function likely to be associated with bottlenecking of disk access? If so, would adjusting the checkpoint interval in my BOINC preferences help? I currently have it set to 20s. |
Send message Joined: 23 Apr 07 Posts: 1112 |
Boinc 7.0.37/.38 introduced a new heartbeat mechanism that is supposed to fix this issue, The question after that is what api version are those projects compiling their apps with, and are they using the new or old heartbeat mechanism: http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=4e8b4ddab581839929d4fa666d81161d590a0f8e - client and API: improve the way an app checks for the death of the client Old: heartbeat mechanism Problem: if the client is blocked for > 30 secs (e.g. because it takes a long time to write the state file, of because it's stopped in a debugger) then apps exit. This is bad is the app doesn't checkpoint and has been running for a long time. New: the client passes its PID to the app. The app periodically (10 sec) checks that the process still exists. Notes: - For backward compatibility (e.g. new API w/ old client, or vice versa) the client still sends heartbeats, and the API checks heartbeats if the client doesn't pass a PID. - The new mechanism works only if the client's PID isn't assigned to a new process within 10 secs of the client exiting. Windows 2000 reuses PIDs immediately, so check for Win2K and don't use this mechanism if so. TODO: For Unix multithread apps, critical sections aren't currently being enforced. Need to fix this by masking signals. svn path=/trunk/boinc/; revision=26147 Claggy |
Send message Joined: 19 Dec 06 Posts: 90 |
I believe the problem with my most offending machine had to do with the wimpy onboard VGA adapter (a crummy ASPEED 2-D chip on this server mobo) being overwhelmed and stealing other system resources. I just installed a GTX 750Ti card and the problems went away. I should have done this years ago, but the last time I tried a video card it conflicted with my RAID adapter and the computer wouldn't boot; so I had been a little gun-shy for too long. This still doesn't explain why the other machines had a surge in the same error, but they have all settled down as well. The only recent change was a major overhaul/resync of my BAM settings and changing email addresses & passwords on all my project accounts. Oh well... |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.