Message boards : Questions and problems : Stop file deletion on reporting of tasks that error out.
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Jun 10 Posts: 2675 |
Is there some hidden way to do this? Sometimes when a task crashes, the task is badly behaved and the model folder is not deleted. Is there a way to automatically stop the task data from being deleted if the task has crashed? Earlier today, a testing task crashed and the information on the task page in stderr is incomplete and does not show what the problem was. Examination of the files might be helpful. A bonus would be the ability to set a flag for this to only apply to one project at a time or even only one task at a time! I know what the likely answer is but on the off chance.... |
Send message Joined: 29 Aug 05 Posts: 15551 |
You can set <exit_before_start> in cc_config.xml after a task has started. Then the client will exit at the end of that task (no matter the outcome) and before a new task starts. How this works on a multicore system is anyone's guess though. Maybe if you start the client with the --fetch_minimal_work attribute that it doesn't completely load up all cores. |
Send message Joined: 28 Jun 10 Posts: 2675 |
What I think I need to do is to stop the file reporting till all the uploads have completed. I assume there are ways to do this when writing the scientific tasks and packaging them up for BOINC but that doesn't solve my problem with the current tasks. What I think is happening is the task completes with a few GB of data still uploading. The task reports and the files get deleted resulting in an error and a number of files not being uploaded. I will try later today to catch a task and pause it for long enough to let the files upload before resuming computation. Only a problem on my 5-6MB/second or slower bored band. A faster connection and no problem. |
Send message Joined: 2 Feb 22 Posts: 81 |
A simple workaround (on Linux) might be a command like this: tail -F /path/to/logfile/joblog.log >/path/to/logfilecopy/joblog.log Run it in a separate monitoring console just before you start a task that you expect to fail. It keeps a copy of a single well known logfile even if the main process (e.g. BOINC) quickly removes the original logfile after a crash. |
Send message Joined: 5 Oct 06 Posts: 5124 |
I am trying to get my head round the sequence of events here. Are you saying that the task crashes because of some programming/data error, and immediately goes into "ready to report" status: but some intermediate upload files, created earlier in the run, are still plodding along and worth saving? Or are you saying that the task finishes normally, with a "success" status, completes its normal end-of-run housekeeping (including uploading final exit files), but transitions to "ready to report" too quickly? The BOINC client should monitor those end-of-run files, and delay changing status and reporting the task until uploads are complete. Anything else would be a bug, and I think we would have heard about it by now. I suspect that the real problem may be that CPDN makes extensive use of intermediate upload files not associated with the final exit from the task. These uploads are associated with, but distinct from, 'trickle' reports of "progress so far". I suspect that these intermediate files are uploaded asynchronously (*) while the tasks are running, and not monitored for status when considering whether the task is "ready to report". If that's the case, I can't think of a way of preventing a report once "ready to report" is reached - it'll be reported an hour later, even if suspended. But if you can suspend the task after the final intermediate ('trickle') file has been created, but before the final wrap-up, that might work. Otherwise, it'll need a bug-fix in the client. * The sequence of events for a current main-project task is 21/04/2022 23:17:18 | climateprediction.net | Sending scheduler request: To send trickle-up message. 21/04/2022 23:17:19 | climateprediction.net | Scheduler request completed 21/04/2022 23:17:22 | climateprediction.net | Started upload of hadam4_a1f0_200010_13_929_012137174_2_r1038383722_7.zipSo the upload is never complete when reported as a trickle. The project might look at that. |
Send message Joined: 28 Jun 10 Posts: 2675 |
I suspect that the real problem may be that CPDN makes extensive use of intermediate upload files not associated with the final exit from the task. These uploads are associated with, but distinct from, 'trickle' reports of "progress so far". I suspect that these intermediate files are uploaded asynchronously (*) while the tasks are running, and not monitored for status when considering whether the task is "ready to report". Yes, I think it is a problem with the intermediate files which a back of the envelope calculation suggests they run to around 15GB in total. With luck today I can be around to monitor what is happening and suspend the task shortly before it finishes to allow the files to upload and to determine whether the task really is crashing near the end or not. - Complicated by the fact that some have crashed without the problem of a backlog of these files to upload. Glenn who is running these tests has told me which files in the slot directory will have data he wants to monitor using tail so I will run that and see if it gives any clues. |
Send message Joined: 31 Dec 18 Posts: 293 |
Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process. |
Send message Joined: 28 Jun 10 Posts: 2675 |
Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process. This may be what is happening with my tasks. Not sure what final upload file size is on these yet though it could just as easily be the number of intermediate files which on the first task I ran were backed up by over 5GB when the task finished/crashed. |
Send message Joined: 5 Oct 06 Posts: 5124 |
Isn’t there a problem that affects CPDN due to their large final files or where multiple tasks finish at the same time where Boinc reports that the files have not been processed within the time limit and kills the process.That certainly has been a problem for CPDN in the past (many years ago), but that was on an even slower scale than Dave's bored band. The life expectancy of an upload file was originally 14 days, but that was too short to replace a failed upload server: the limit was extended (IIRC) to 90 days. |
Send message Joined: 29 Aug 05 Posts: 15551 |
I wonder if Dave needs this fix: https://github.com/BOINC/boinc/pull/4575 |
Send message Joined: 28 Jun 10 Posts: 2675 |
Have used tail command to send file to Glenn to see what he can make of it. Hopefully, he can tell me whether it is the problem of my connection being so slow or not though there was only the final upload that was missing as I stopped computation several times to allow the data to go through before continuing. Edit: read the wrong line on my tasks page and the task completed. Still no proof that it was my pausing the task to let uploads clear before allowing the task to finish that made the difference though. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.