Stop file deletion on reporting of tasks that error out.

Message boards : Questions and problems : Stop file deletion on reporting of tasks that error out.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 107922 - Posted: 21 Apr 2022, 19:55:10 UTC

Is there some hidden way to do this? Sometimes when a task crashes, the task is badly behaved and the model folder is not deleted. Is there a way to automatically stop the task data from being deleted if the task has crashed? Earlier today, a testing task crashed and the information on the task page in stderr is incomplete and does not show what the problem was. Examination of the files might be helpful. A bonus would be the ability to set a flag for this to only apply to one project at a time or even only one task at a time!

I know what the likely answer is but on the off chance....
ID: 107922 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 107925 - Posted: 21 Apr 2022, 21:44:21 UTC - in response to Message 107922.  
Last modified: 21 Apr 2022, 21:46:38 UTC

You can set <exit_before_start> in cc_config.xml after a task has started. Then the client will exit at the end of that task (no matter the outcome) and before a new task starts. How this works on a multicore system is anyone's guess though.

Maybe if you start the client with the --fetch_minimal_work attribute that it doesn't completely load up all cores.
ID: 107925 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 107927 - Posted: 22 Apr 2022, 5:11:58 UTC

What I think I need to do is to stop the file reporting till all the uploads have completed. I assume there are ways to do this when writing the scientific tasks and packaging them up for BOINC but that doesn't solve my problem with the current tasks.

What I think is happening is the task completes with a few GB of data still uploading. The task reports and the files get deleted resulting in an error and a number of files not being uploaded. I will try later today to catch a task and pause it for long enough to let the files upload before resuming computation. Only a problem on my 5-6MB/second or slower bored band. A faster connection and no problem.
ID: 107927 · Report as offensive
computezrmle

Send message
Joined: 2 Feb 22
Posts: 81
Germany
Message 107928 - Posted: 22 Apr 2022, 5:50:43 UTC - in response to Message 107922.  

A simple workaround (on Linux) might be a command like this:
tail -F /path/to/logfile/joblog.log >/path/to/logfilecopy/joblog.log

Run it in a separate monitoring console just before you start a task that you expect to fail.
It keeps a copy of a single well known logfile even if the main process (e.g. BOINC) quickly removes the original logfile after a crash.
ID: 107928 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 107930 - Posted: 22 Apr 2022, 8:17:14 UTC
Last modified: 22 Apr 2022, 8:18:24 UTC

I am trying to get my head round the sequence of events here. Are you saying that the task crashes because of some programming/data error, and immediately goes into "ready to report" status: but some intermediate upload files, created earlier in the run, are still plodding along and worth saving?

Or are you saying that the task finishes normally, with a "success" status, completes its normal end-of-run housekeeping (including uploading final exit files), but transitions to "ready to report" too quickly? The BOINC client should monitor those end-of-run files, and delay changing status and reporting the task until uploads are complete. Anything else would be a bug, and I think we would have heard about it by now.

I suspect that the real problem may be that CPDN makes extensive use of intermediate upload files not associated with the final exit from the task. These uploads are associated with, but distinct from, 'trickle' reports of "progress so far". I suspect that these intermediate files are uploaded asynchronously (*) while the tasks are running, and not monitored for status when considering whether the task is "ready to report".

If that's the case, I can't think of a way of preventing a report once "ready to report" is reached - it'll be reported an hour later, even if suspended. But if you can suspend the task after the final intermediate ('trickle') file has been created, but before the final wrap-up, that might work. Otherwise, it'll need a bug-fix in the client.

* The sequence of events for a current main-project task is
21/04/2022 23:17:18 | climateprediction.net | Sending scheduler request: To send trickle-up message.
21/04/2022 23:17:19 | climateprediction.net | Scheduler request completed
21/04/2022 23:17:22 | climateprediction.net | Started upload of hadam4_a1f0_200010_13_929_012137174_2_r1038383722_7.zip
So the upload is never complete when reported as a trickle. The project might look at that.
ID: 107930 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 107931 - Posted: 22 Apr 2022, 9:57:04 UTC - in response to Message 107930.  

I suspect that the real problem may be that CPDN makes extensive use of intermediate upload files not associated with the final exit from the task. These uploads are associated with, but distinct from, 'trickle' reports of "progress so far". I suspect that these intermediate files are uploaded asynchronously (*) while the tasks are running, and not monitored for status when considering whether the task is "ready to report".


Yes, I think it is a problem with the intermediate files which a back of the envelope calculation suggests they run to around 15GB in total. With luck today I can be around to monitor what is happening and suspend the task shortly before it finishes to allow the files to upload and to determine whether the task really is crashing near the end or not. - Complicated by the fact that some have crashed without the problem of a backlog of these files to upload.

Glenn who is running these tests has told me which files in the slot directory will have data he wants to monitor using tail so I will run that and see if it gives any clues.
ID: 107931 · Report as offensive
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 284
United Kingdom
Message 107932 - Posted: 22 Apr 2022, 10:44:09 UTC

Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process.
ID: 107932 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 107933 - Posted: 22 Apr 2022, 10:53:05 UTC - in response to Message 107932.  

Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process.


This may be what is happening with my tasks. Not sure what final upload file size is on these yet though it could just as easily be the number of intermediate files which on the first task I ran were backed up by over 5GB when the task finished/crashed.
ID: 107933 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 107934 - Posted: 22 Apr 2022, 11:24:00 UTC - in response to Message 107932.  

Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process.
That certainly has been a problem for CPDN in the past (many years ago), but that was on an even slower scale than Dave's bored band. The life expectancy of an upload file was originally 14 days, but that was too short to replace a failed upload server: the limit was extended (IIRC) to 90 days.
ID: 107934 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 107935 - Posted: 22 Apr 2022, 11:33:04 UTC

I wonder if Dave needs this fix: https://github.com/BOINC/boinc/pull/4575
ID: 107935 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 107939 - Posted: 22 Apr 2022, 21:03:14 UTC
Last modified: 22 Apr 2022, 21:17:56 UTC

Have used tail command to send file to Glenn to see what he can make of it. Hopefully, he can tell me whether it is the problem of my connection being so slow or not though there was only the final upload that was missing as I stopped computation several times to allow the data to go through before continuing.

Edit: read the wrong line on my tasks page and the task completed. Still no proof that it was my pausing the task to let uploads clear before allowing the task to finish that made the difference though.
ID: 107939 · Report as offensive

Message boards : Questions and problems : Stop file deletion on reporting of tasks that error out.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.