occassionally 100% done project never "reports"

Message boards : Questions and problems : occassionally 100% done project never "reports"
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 21670 - Posted: 7 Dec 2008, 20:58:38 UTC
Last modified: 7 Dec 2008, 21:03:54 UTC

I have seen this occassionally and am not sure what is causing it and dont know what to do except abort. The project (for example, rosetta http://boinc.bakerlab.org/rosetta/workunit.php?wuid=193442517 , supposidly completes but never reports or uploads. Progess is 100%, completion date shows "-" and the cpu time never increments. Boincview highlights it in yellow and show 0.0 cpu efficiency. I cant seem to force it to report and blocking all other tasks and issing suspend and resume does not affect the task. I end up aborting it.

The project is on line because other work units get reported. This problem just occured on rosetta 5.10.45 and a dual opteron but I have seen it on other projects and different platforms. Can switching to 6.x fix this? It occurs only rarely, but I have a farm of computers and it is a pain to upgrade all to 6.x

Some time ago I had a fast system that dropped in WU production and when I went back to see what had happened I found a "usually 5 hour" work unit with an error where the diagnostic (stderr) showed a timeout by boinc. As I remember, the WU ran for almost 3 weeks before the timeout occured even though the typical time was about 5 hours. Could this be a sympom of the same problem? Maybe the WU thinks it reported it was done but the report was never sent and there is no way to make it report a second time. I am just guessing.
ID: 21670 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 21671 - Posted: 7 Dec 2008, 21:14:29 UTC - in response to Message 21670.  

What is your setting for "Computer is connected to the Internet about every " in your preferences? As that depicts when tasks are reported.

For BOINC 5.8 and above: Completed work is reported at the first of:

1) 24 hours before deadline
2) Connect Every X before deadline.
3) Immediately if it is later than either 1 or 2 upon completion of the task.
4) 24 hours after completion.
5) On a trickle up message (CPDN only, I believe).
6) On a trickle down request.
7) On a server scheduled connection. Used, but I am not certain by which project.
8) On a request for new work.
9) When the user pushes the update button.
10) When using the <report_results_immediately> option in cc_config.xml it will happen 1 minute after uploading. (requires BOINC 5.10.45 or later)
ID: 21671 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 21674 - Posted: 7 Dec 2008, 23:58:20 UTC - in response to Message 21670.  

I have seen this occassionally and am not sure what is causing it and dont know what to do except abort. The project (for example, rosetta http://boinc.bakerlab.org/rosetta/workunit.php?wuid=193442517 , supposidly completes but never reports or uploads. Progess is 100%, completion date shows "-" and the cpu time never increments. Boincview highlights it in yellow and show 0.0 cpu efficiency.

I assume that the task (not project) is still showing as Running in BOINC Manager's Tasks tab. If so, Ageless's answer wouldn't apply.

I cant seem to force it to report and blocking all other tasks and issing suspend and resume does not affect the task. I end up aborting it.

Did you try completely stopping BOINC (Advanced/Shut down connected client, File/Exit) and restarting, rather than suspending? I have read in several threads that this helped sometimes.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
ID: 21674 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 21675 - Posted: 8 Dec 2008, 2:42:12 UTC - in response to Message 21674.  

The message I see is "Waiting to run". Clicking update does not clear it. I have tried a number of things such as stopping all new work and suspending all other projects and tasks and then a couple of suspend and resume to try to kick it to finish. I only did this after about 3-4 days of observing the task stuck at 100% and the deadline to report approaching.

Next time this happens I will reboot, probably should have tried that. I am pretty sure I waited at least 2 days this last time before aborting. This event is very rare, but I have 12 systems, totaling 36 cores and 1 CUDA. I see it about once a month and only after boincview highlights the "0.0" cpu efficiency or I would not notice.
ID: 21675 · Report as offensive
Aurora Borealis
Avatar

Send message
Joined: 8 Jan 06
Posts: 448
Canada
Message 26145 - Posted: 20 Jul 2009, 15:52:56 UTC
Last modified: 20 Jul 2009, 15:59:51 UTC

This happens on a lot of project. The percentage may be at 100% but that doesn't mean the work is finished. Many project have clean up work that needs to be done after 100% is reached. If the WU happens to switch out during this time it just has to wait it's turn like any other incomplete job. Some project need 10-15 min. to finish off. I've seen project quickly get to 100% than spend an hours to complete. The Devs don't want to keep the WU actively running when they can't know how much extra time may be needed.

Case in point WCG 'Help Cure Muscular Dystrophy' need a lot of extra time to finish. Worse in this case, if you restart Boinc it will often reprocess the entire WU from the beginning and go to 200% before finishing. (This is a project software bug.)

Boinc V 7.4.36
Win7 i5 3.33G 4GB NVidia 470
ID: 26145 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 26148 - Posted: 20 Jul 2009, 16:36:53 UTC - in response to Message 26146.  

I've never seen you report 100% with a 'waiting to run' on WCG... no one has to my recollection.

Be the first then, as this is an application problem, not a BOINC problem.
or what do you want the developers to do? Add a check for 100% and upload the task then anyway, even if it's not completely done by the project application?

If the application reports it is at 100% but still not done, there's nothing else that BOINC can do than wait for the application to really report it is finished. As for why it will stop it running then, that has the same reason as to why some get stopped at 99.99%: BOINC can't possibly know how long the remaining 0.01% takes, this could be seconds, this could be hours or days.

So it's waiting for a sort of end-of-file message from the application. "I'm done now, you can go do what you do with this result. I'm out of here."
ID: 26148 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 26150 - Posted: 20 Jul 2009, 16:57:11 UTC - in response to Message 26149.  
Last modified: 20 Jul 2009, 17:00:40 UTC

having seen this on for me now the 3rd project, rather taken aback that Aurora Borealis writes it happens at many projects... sort of BOINC wide. Calls for a solution I'd think.

Probably because most of the projects use the same sort of code/compiler for their applications. Tell it to the projects.

(e.g. if it's something in Autodock that does this, you'll find this behaviour on WCG, Rosetta, Hydrogen, DrugDiscovery, etc. since they all use the same base Autodock code).

Sort of curious, I just tried your 6.6.76 build test build and this Docking job jumped into High Priority mode. Why would that be and due on July 31?

Because there's a fix in the estimated run time code. You'll probably see this on one task for every project, just the same as you would see it if you had shut down the client for a week and then continued.
ID: 26150 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 26152 - Posted: 20 Jul 2009, 17:13:27 UTC - in response to Message 26151.  

Wonder what happened to the BOINC framework and wrapper concept to make things easier? This ping pong of it's the projects, no it's BOINC, is rather tiring.

You expect a solution to the problem in BOINC, I explained it isn't a BOINC problem. You still expect a solution in BOINC.

I don't mind explaining to you what happens, as long as you stop pointing at BOINC being the culprit. All BOINC does is start the application, then it waits for the application to finish, before uploading the result file to the project.

The application will keep BOINC apprised of its progress, be it directly or through the wrapper file (which by the way doesn't show the real percentage, but only an approximation of it). Every 30 seconds BOINC and the application will exchange a heartbeat signal, just to check if the application is still alive.

When the application thinks it is done with all of the file and it has cleaned up after itself, it will tell BOINC it has done so. Only at that point will BOINC try to upload the result file.

Now I want to give it to you that there is a slight possibility there is a glitch between BOINC Manager (the GUI) and what the application is reporting. Eventually percentages will be rounded up, so if the application is saying it is at 99.9995% (4 digits behind the comma), it may show as 100% in the GUI. But this doesn't automatically mean that the application is completely finished.

Simple, ey? :-)
ID: 26152 · Report as offensive
Aurora Borealis
Avatar

Send message
Joined: 8 Jan 06
Posts: 448
Canada
Message 26160 - Posted: 20 Jul 2009, 18:00:24 UTC
Last modified: 20 Jul 2009, 18:08:09 UTC

I didn't make a note of the WCG WU # when it happened, which is part of the reason why I didn't report it. I don't visit that site much and find navigating it somewhat cumbersome. I post mostly on the SETI site.

I'm guessing it was the WU that shows an error.
Result Log

Result Name: CMD2_ 0017-MYH1.clustersOccur-MYH2A.clustersOccur_ 866_ 838043_ 838215_ 2--

<core_client_version>6.6.31</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
called boinc_finish
called boinc_finish

</stderr_txt>
]]>

The result time is 7.77 where others WU are in the 3.xx range. I do know it was at 100% waiting to run before I restarted Boinc. Out of personal curiousity I let it run to completion and it indicated 200% when it finished. I had previouly aborted a WU that also showed odd behavior of not wanting to complete.

I usually like to document things better (I didn't have the time) if I am to report a bug in a project software. In my post was only using it as an example of a unfinished WU sitting at 100% and waiting to run.

Boinc V 7.4.36
Win7 i5 3.33G 4GB NVidia 470
ID: 26160 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 26162 - Posted: 20 Jul 2009, 18:17:06 UTC - in response to Message 26158.  

.. is not what we like to have hanging around

What you don't want hanging around is irrelevant. BOINC will eventually return to the task and finish it, won't it? There's nothing you need to do about that, just let BOINC do its job.

So all that the BOINC developers now need to do is reprogram the user base and add patience + 10. Or perhaps better patience * 10. As it eventually all boils down to the user not having the patience to see that task sitting there. And so you fuss about it.

Something of the same happens at Seti, with all those people continuously clicking Update, Retry upload and Do network communication, just to get their work uploaded through an already overloaded connection into Seti and onto an already overloaded server. While if they just went about their business and not check every 5 seconds if things were going as they wanted them to go, there would be less threads on the same subject, less fuss and less heated discussions over what Seti should do ASAP or "I am leaving."

So take on the advice, just close BM to the system tray and continue with whatever else you were doing. Waaaaaaay better for your mental health. :-)
ID: 26162 · Report as offensive
crystalsys

Send message
Joined: 29 Jul 09
Posts: 6
United States
Message 26328 - Posted: 29 Jul 2009, 13:10:35 UTC
Last modified: 29 Jul 2009, 13:12:40 UTC

This one is less serious from a getting-the-work-done standpoint, but seems kind of odd. The task completes, appears from the messages tab to have been uploaded. But it persists in the task list as 'Ready to report' until the project is manually updated. Then the message tab shows that the task was reported.

This is happening for all projects, not just the one shown below.

I don't think this was happening prior to the update to 36, but I'm not 100% sure.

7/29/2009 6:02:13 AM Einstein@Home Computation for task h1_0811.20_S5R4__124_S5R5a_1 finished
7/29/2009 6:02:16 AM Einstein@Home Started upload of h1_0811.20_S5R4__124_S5R5a_1_0
7/29/2009 6:02:23 AM Einstein@Home Finished upload of h1_0811.20_S5R4__124_S5R5a_1_0


7/29/2009 8:44:01 AM Einstein@Home update requested by user
7/29/2009 8:44:20 AM Einstein@Home Sending scheduler request: Requested by user.
7/29/2009 8:44:20 AM Einstein@Home Reporting 1 completed tasks, not requesting new tasks

I'm not surprised it isn't requesting a new task, it already has one.

Any ideas?
ID: 26328 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 26332 - Posted: 29 Jul 2009, 13:51:25 UTC - in response to Message 26328.  

I don't think this was happening prior to the update to 36, but I'm not 100% sure.

I am 100% sure that it has happened before the update, since it is how BOINC always works. I have versions 5.8.16 and 5.10.45 running with that behaviour and I've seen it since switching from classic SETI to BOINC.

Gruß,
Gundolf
ID: 26332 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 26333 - Posted: 29 Jul 2009, 13:53:26 UTC - in response to Message 26328.  
Last modified: 29 Jul 2009, 13:58:55 UTC

Uploading and reporting are two different processes.

Uploading happens immediately at the end of a task and is a straight copy of a data file from your computer's hard drive to the server's hard drive.

Reporting will use the database and tell it that you finished task such and so and at what outcome. Reporting multiple tasks at once takes less overhead than reporting one task at a time. This is why BOINC will usually wait 24 hours after the first task is ready to report before it will try to report it, to try to report more than one task at a time.

Completed work is reported at the first of:

1) 24 hours before deadline
2) Connect Every X before deadline.
3) Immediately if it is later than either 1 or 2 upon completion of the task.
4) 24 hours after completion.
5) On a trickle up message (CPDN only, I believe).
6) On a trickle down request.
7) On a server scheduled connection. Used, but I am not certain by which project.
8) On a request for new work.
9) When the user pushes the update button.
10) On an account manager request. (BAM!, GridRepublic, )

So, now you have that answer twice. I already saw the same thing earlier in your Seti thread.

BOINC works on a by debt basis. CPU time spent on one project is equally paid back to other projects you are attached to. So it can happen that when the debt for one project is so large, that BOINC will not download new work for that project until its debt has been equalized towards that of the other projects you are attached to.
ID: 26333 · Report as offensive
crystalsys

Send message
Joined: 29 Jul 09
Posts: 6
United States
Message 26343 - Posted: 29 Jul 2009, 18:42:19 UTC - in response to Message 26333.  
Last modified: 29 Jul 2009, 18:44:25 UTC

Sorry - when I went back looking for an answer to the Seti post, I couldn't find it.
ID: 26343 · Report as offensive

Message boards : Questions and problems : occassionally 100% done project never "reports"

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.