Thread 'Task ____ exited with zero status but no 'finished' file'

Author	Message
Jazzop Send message Joined: 19 Dec 06 Posts: 90	Message 58896 - Posted: 23 Dec 2014, 6:40:26 UTC I am getting this message from multiple projects, that's why I am asking it here and not on an individual project forum. It has happened a lot in the past few days after I opened up my machines to other projects now that SIMAP is dead. The projects so far include: 23-Dec-14 00:27:58 \| climateprediction.net \| Task hadam3p_eu_h5ur_2013_1_008861412_1 exited with zero status but no 'finished' file 23-Dec-14 00:14:29 \| WUProp@Home \| Task wu_v4_1419226725_17366_0 exited with zero status but no 'finished' file 22-Dec-14 22:32:46 \| Einstein@Home \| Task LATeah0045E_80.0_420_-2.97e-10_2 exited with zero status but no 'finished' file 22-Dec-14 01:38:17 \| EDGeS@Home \| Task 7bead5f8-1cfa-468f-9b62-e7137cf7fddd_612b9788-2781-4010-8e81-fa19752ad9e7_32d9ba32-48ce-4218-bee7-548f26db5174_0 exited with zero status but no 'finished' file 21-Dec-14 22:32:18 \| rosetta@home \| Task tj_12_16_frag2012_v2_X_64_h17_BBGB_19_DDD_wD_abinitio_SAVE_ALL_OUT_232271_609_0 exited with zero status but no 'finished' file I am using BOINC v7.4.27 on all machines. OSes include Win7x32, Win7x64, Win2008R2, Win2012R2. ID: 58896 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15573	Message 58897 - Posted: 23 Dec 2014, 7:03:38 UTC - in response to Message 58896. Make sure that you exclude the BOINC Data directory from being actively scanned by things as anti-virus scanners, anti-malware scanners, indexing etc. You get this message when BOINC tries to write its progress to disk but finds that some other process is holding the affected files in memory and therefore from being able to be written to by BOINC. Also only scan and index the Data directory when BOINC isn't running. ID: 58897 ·

Jazzop Send message Joined: 19 Dec 06 Posts: 90	Message 58971 - Posted: 24 Dec 2014, 20:40:23 UTC - in response to Message 58897. I do not have indexing turned on (i.e., Windows Search is disabled). I do not use any antivirus software other than Windows Defender, and it is not set to run any automatic scans. I only run a manual scan a couple of times a year. It appears that this problem is much more significant on my Server 2012R2 machine. It is happening with about 10% of all tasks, across all projects. Is there a potential connection to the fact that these errors spiked a couple of days ago after I reconfigured my preferences on BOINCstats, attached/detatched from several projects, and cleared all the local computing preferences on all my hosts? ID: 58971 ·

Gary Charpentier Send message Joined: 23 Feb 08 Posts: 2497	Message 58980 - Posted: 25 Dec 2014, 16:47:58 UTC - in response to Message 58971. I do not have indexing turned on (i.e., Windows Search is disabled). I do not use any antivirus software other than Windows Defender, and it is not set to run any automatic scans. I only run a manual scan a couple of times a year. It appears that this problem is much more significant on my Server 2012R2 machine. It is happening with about 10% of all tasks, across all projects. Is there a potential connection to the fact that these errors spiked a couple of days ago after I reconfigured my preferences on BOINCstats, attached/detatched from several projects, and cleared all the local computing preferences on all my hosts? The technical thing going on is a heartbeat. The science applications are required to update a time stamp on a periodic basis. If they don't run, they can't update it. When they don't BOINC assumes they stopped because they are finished and looks for a finished file. No file and you get an error message. The 99%+ reason that the heartbeat is missed is because other higher priority stuff is running on your machine. Frequent causes are windows indexer and anti-virus scans. You know what else your machine does. Perhaps when you changed your computing preferences you may have let BOINC use too much of the machine for the other tasks it has to do? ID: 58980 ·

Jazzop Send message Joined: 19 Dec 06 Posts: 90	Message 59040 - Posted: 28 Dec 2014, 9:31:01 UTC - in response to Message 58980. The machine with the most errors is used as a media fileserver along with some general web browsing and document management. I usually snooze or exit BOINC when I play media because it is too jittery otherwise. Is the failure of the time stamp function likely to be associated with bottlenecking of disk access? If so, would adjusting the checkpoint interval in my BOINC preferences help? I currently have it set to 20s. ID: 59040 ·

Claggy Send message Joined: 23 Apr 07 Posts: 1112	Message 59043 - Posted: 28 Dec 2014, 11:38:52 UTC - in response to Message 58980. Last modified: 28 Dec 2014, 11:57:16 UTC Boinc 7.0.37/.38 introduced a new heartbeat mechanism that is supposed to fix this issue, The question after that is what api version are those projects compiling their apps with, and are they using the new or old heartbeat mechanism: http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=4e8b4ddab581839929d4fa666d81161d590a0f8e - client and API: improve the way an app checks for the death of the client Old: heartbeat mechanism Problem: if the client is blocked for > 30 secs (e.g. because it takes a long time to write the state file, of because it's stopped in a debugger) then apps exit. This is bad is the app doesn't checkpoint and has been running for a long time. New: the client passes its PID to the app. The app periodically (10 sec) checks that the process still exists. Notes: - For backward compatibility (e.g. new API w/ old client, or vice versa) the client still sends heartbeats, and the API checks heartbeats if the client doesn't pass a PID. - The new mechanism works only if the client's PID isn't assigned to a new process within 10 secs of the client exiting. Windows 2000 reuses PIDs immediately, so check for Win2K and don't use this mechanism if so. TODO: For Unix multithread apps, critical sections aren't currently being enforced. Need to fix this by masking signals. svn path=/trunk/boinc/; revision=26147 Claggy ID: 59043 ·

Jazzop Send message Joined: 19 Dec 06 Posts: 90	Message 59167 - Posted: 31 Dec 2014, 9:12:52 UTC - in response to Message 59043. I believe the problem with my most offending machine had to do with the wimpy onboard VGA adapter (a crummy ASPEED 2-D chip on this server mobo) being overwhelmed and stealing other system resources. I just installed a GTX 750Ti card and the problems went away. I should have done this years ago, but the last time I tried a video card it conflicted with my RAID adapter and the computer wouldn't boot; so I had been a little gun-shy for too long. This still doesn't explain why the other machines had a surge in the same error, but they have all settled down as well. The only recent change was a major overhaul/resync of my BAM settings and changing email addresses & passwords on all my project accounts. Oh well... ID: 59167 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.