Task exited with zero status but no 'finished' file... (almost) all of them

Message boards : Questions and problems : Task exited with zero status but no 'finished' file... (almost) all of them
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47332 - Posted: 17 Jan 2013, 4:39:58 UTC - in response to Message 47324.  

WTF?


+1


Where's the repository for the stderr of an ongoing task? I want to inspect one to see if this 30s discrepancy between the time BOINC calls a task dead and the time a task calls BOINC dead exists on other tasks.

Anyway I'm filing this in the alpha mailing list if nobody here finds this normal.
ID: 47332 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47334 - Posted: 17 Jan 2013, 5:05:34 UTC

And thanks for the detailed explanation Ageless.
ID: 47334 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 47337 - Posted: 17 Jan 2013, 6:22:47 UTC - in response to Message 47332.  
Last modified: 17 Jan 2013, 6:31:30 UTC


Can you monitor your computer clock?

E.g. by some script that logs the current time every 10 s?
The problem is that the interval (10 s) have to be measured by something different than the current time.
So if you use some tool to pause 10 s - check what happens if you manually adjust computer clock +- by a few minutes (is the pause really 10 s or is it continue immediately if the clock changes)


If the computer clock go backwards by more than 30 s apps and BOINC think there is no 'hearthbeat'





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 47337 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47339 - Posted: 17 Jan 2013, 6:31:09 UTC
Last modified: 17 Jan 2013, 6:40:36 UTC

I thought it's the clock going forward that does that?

I'd be happy to run such a script if anybody can supply one :S

Anyway, if there are methods to track time independently of the system clock (and of course there are--the High Precision Event Timer is just the latest in a whole series of kernel timers, old and new, but all with millisecond precision or much better) wouldn't it make more sense to track time using one of these instead of the system clock for something as critical as the continued survival of a science app?

And if these exit with zero errors occur at the drop of a hat, shouldn't there be a way to ignore them completely, or at least remove the 100 error limit for long-running apps?
ID: 47339 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 47340 - Posted: 17 Jan 2013, 6:48:56 UTC - in response to Message 47339.  


Copy this to Notepad, save, rename to .bat

@ECHO OFF
Echo *** Use Ctrl+C to stop the script ***
Echo === New log === >> _LogCurrentTime.txt

:LOOP
 Echo  %date%  - %time%  
 Echo  %date%  - %time% >> _LogCurrentTime.txt
 Ping 0.0.0.1 -n 1 -w 10000 >NUL
GOTO LOOP


Try what happens if you change the clock manually (exit BOINC first)





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 47340 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47342 - Posted: 17 Jan 2013, 7:14:10 UTC - in response to Message 47340.  

Thanks. It's not exactly running with millisecond precision but correctly logged the time skip when I moved the time forward by 1 minute. I'll leave this running and see what it says the next time one of the exit with zeroes happens.
ID: 47342 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47343 - Posted: 17 Jan 2013, 8:50:04 UTC - in response to Message 47339.  

I thought it's the clock going forward that does that?


I think when BilBG said "if the clock goes backwards by more than 30 seconds" he meant "if the clock falls behind by more than 30 seconds"? Dunno, I don't want to put words in his mouth :-)

Anyway, if there are methods to track time independently of the system clock (and of course there are--the High Precision Event Timer is just the latest in a whole series of kernel timers, old and new, but all with millisecond precision or much better) wouldn't it make more sense to track time using one of these instead of the system clock for something as critical as the continued survival of a science app?


That would work on Windows but not Linux or OSX so they won't do that. Remember, it has to be cross-platform compatible.

And if these exit with zero errors occur at the drop of a hat, shouldn't there be a way to ignore them completely, or at least remove the 100 error limit for long-running apps?


Why? It works perfectly for everybody except you. If you would fix the cause of the problem then you won't need the exception you ask for.
ID: 47343 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 47348 - Posted: 17 Jan 2013, 13:38:21 UTC - in response to Message 47343.  

I think when BilBG said "if the clock goes backwards by more than 30 seconds" he meant "if the clock falls behind by more than 30 seconds"? Dunno, I don't want to put words in his mouth :-)

I mean e.g.:
Set the clock manually 2 minutes in the past (backwards) (e.g. from 11:22:00 to 11:20:00)

Now look in BOINC Manager - all the Running tasks will have 'Elapsed' time stall (the same unchanging value will be displayed for 2 minutes)
Look in Windows Task Manager or Process Explorer - the apps/tasks will still run

Wait 30 seconds - the apps (processes) will exit (disappear from processes)
Look again in BOINC Manager - the tasks that actually exited will still display 'Running'

Now wait for the 2 minutes to pass
At that moment BOINC will restart the apps/tasks
Look in messages and stderr.txt - they will say 'no hearthbeat for 30 s'


I found and posted this bug ~2007 (on SETI forums) and I think it still exists (not fixed, maybe it's a 'feature' ;) )
(I'm using BOINC 6.10.58, can you confirm this for 6.12.34 and 7.0.28)


P.S.
@kdsjsdj:
Are you Dagorath familiar from other sites?:
http://asteroidsathome.net/boinc/show_user.php?userid=2247





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 47348 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47349 - Posted: 17 Jan 2013, 14:01:39 UTC - in response to Message 47348.  

@kdsjsdj:
Are you Dagorath familiar from other sites?:
http://asteroidsathome.net/boinc/show_user.php?userid=2247


Noooo.

ID: 47349 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 47352 - Posted: 17 Jan 2013, 15:52:53 UTC - in response to Message 47348.  

Be careful messing around with the system clock - far worse things that an 'exit 0' can happen.

Have a look at trac ticket [trac]#588[/trac] - from 2008 and BOINC v5.10.45, but still open (although I think they put in fixes for some of the grosser errors since then).
ID: 47352 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 47354 - Posted: 17 Jan 2013, 15:58:16 UTC - in response to Message 47349.  

@kdsjsdj:
Are you Dagorath familiar from other sites?:
http://asteroidsathome.net/boinc/show_user.php?userid=2247


Noooo.


Are you sure ;)

Evidence:
http://boinc.berkeley.edu/dev/forum_thread.php?id=8102&postid=47088#47088

kdsjsdj (Posted: 7 Jan 2013, 1:14:35 UTC):
"And I removed the phrase in the wiki suggesting ncpus = 0 can be used to configure for GPU only."

Dagorath edit (Revision as of 00:50, 7 January 2013 - remove of "to e.g. only allow GPU computing")
http://boinc.berkeley.edu/w/?title=Client_configuration&action=historysubmit&diff=3742&oldid=3728

:)





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 47354 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47355 - Posted: 17 Jan 2013, 16:23:40 UTC

Just witnessed another of these events happening right in front of my eyes.

Bil's script didn't log any system clock misbehaviour.

What I saw was that the "time elapsed" for running apps froze up, although BOINC Manager itself continued to function.

My hunch is that BOINC.EXE, but not boincmgr.exe, was being starved of CPU cycles for some reason.

Although I said I'd raise boinc.exe to realtime priority yesterday, I didn't, because I wanted to see whether the other changes I made made the difference first.

So now there's two things I think I can try, one, set boinc.exe to realtime priority, two, set processor scheduling to "background services" instead of "programs".

One other significant thing I haven't tried yet is the suggestion to pause BOINC depending on "non-BOINC" processor load. Awhile back I found BOINC see-sawing between running and pausing every 10 seconds or so with a 25% non-BOINC load threshold for suspending, and setting it to 0 (no restriction) didn't seem to impact system performance at the time, so that's where I've set it. Am I way off base here?

Will report back with findings...
ID: 47355 · Report as offensive
BilBg
Avatar

Send message
Joined: 18 Jun 10
Posts: 73
Bulgaria
Message 47356 - Posted: 17 Jan 2013, 17:01:06 UTC - in response to Message 47355.  
Last modified: 17 Jan 2013, 17:06:33 UTC

set boinc.exe to realtime priority

'realtime' is too much (and dangerous, may hang the entire system)
Try 'high' or 'above normal' instead if you want to do this experiment.

Did you see high HDD load when the problem happened?
What antivirus and firewall do you use?





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 47356 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47361 - Posted: 18 Jan 2013, 1:41:29 UTC - in response to Message 47356.  

HDD load: I don't think so, but then I wasn't paying attention to that.
Antivirus: Avast! (with exceptions for boinc program and data directory set for the realtime scan)
Firewall: windows' bundled (with exceptions set for the wrong directory until this morning :oops: )
ID: 47361 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47364 - Posted: 18 Jan 2013, 3:16:31 UTC - in response to Message 47354.  

@kdsjsdj:
Are you Dagorath familiar from other sites?:
http://asteroidsathome.net/boinc/show_user.php?userid=2247


Noooo.


Are you sure ;)


Aha! Now we know who actually reads what I write. Or at least isn't afraid to speak what is on his/her mind. Yes, I made the edit but I am not Dagorath. I know because I see her every day and I am quite certain that I am not her. She lives/works down the hall. I asked for her wiki password and she gave it to me, muttered something about not needing it anymore. We both live/work at BAT.

ID: 47364 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47365 - Posted: 18 Jan 2013, 3:38:53 UTC - in response to Message 47355.  
Last modified: 18 Jan 2013, 3:39:51 UTC

What I saw was that the "time elapsed" for running apps froze up, although BOINC Manager itself continued to function.

My hunch is that BOINC.EXE, but not boincmgr.exe, was being starved of CPU cycles for some reason.[


You are probably correct. The manager does not track elapsed time directly, that is the client's (boinc.exe's) job. The manager simply gets the numbers from the client and displays them in the gui.

What happens when you let BOINC run without your music player running (sorry, I forget the name)? Does the problem happen than too?

One other significant thing I haven't tried yet is the suggestion to pause BOINC depending on "non-BOINC" processor load. Awhile back I found BOINC see-sawing between running and pausing every 10 seconds or so with a 25% non-BOINC load threshold for suspending, and setting it to 0 (no restriction) didn't seem to impact system performance at the time, so that's where I've set it. Am I way off base here?


Why do you torture yourself? YOU have it running there, not US. Try different settings, (10, 20, 30...99). Just do it and never mind asking us for permission or recommendation or whatever it is you seem to think you need. It won't blow up and destroy your house, kill your children or anything else bad. At worst it will not crunch, not a big deal. Too much thinking/talking beating around the bush; too little doing, testing and verifying.
ID: 47365 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47366 - Posted: 18 Jan 2013, 4:21:44 UTC - in response to Message 47365.  
Last modified: 18 Jan 2013, 4:50:53 UTC


Why do you torture yourself? YOU have it running there, not US. Try different settings, (10, 20, 30...99). Just do it and never mind asking us for permission or recommendation or whatever it is you seem to think you need. It won't blow up and destroy your house, kill your children or anything else bad. At worst it will not crunch, not a big deal. Too much thinking/talking beating around the bush; too little doing, testing and verifying.


I am trying things... one at a time. Last things I tried were Bil's clock tracking script and the processor scheduling thing I mentioned. Neither made any difference. If I put every change proposed in the stew at the same time, then supposing it actually worked, I'd still not know the root cause in my case right?

It seems the sky's the limit as far as the number of potential things to try to remedy these mysterious zero errors so I do want some expert input on prioritizing which Chinese herbal treatments to try next, yeah. ;) (just kidding)

Setting the processor load threshold back to 25% for now. If these errors keep up (but they don't seem to be impacting production what with the mix of tasks I have running atm, mostly 10-20min WUs which seem to continue running just fine after one of these errors--which is why I'm taking the one-at-a-time troubleshooting approach for now) I'll try exiting my audio app.
ID: 47366 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47368 - Posted: 18 Jan 2013, 8:55:11 UTC - in response to Message 47366.  

so I do want some expert input on prioritizing which Chinese herbal treatments to try next, yeah. ;) (just kidding)


Oh, don't offend the Chinese! Eastern style meditation helps :-)
It's your rig, you're the boss, sorry for getting on your case but I would try the simple things first, that's all.

Setting the processor load threshold back to 25% for now.


Why not 35%? Or 50%? Or 80%? The 25 and 0 aren't the only allowable numbers, 25 just happens to be the default and the 0 is the ignore flag. Try other values and see what works for YOU. Also consider the "exclusive app" mechanism... it works like a hot damn for people in your situation.

If you think there's a bug in the code then dig out the code, pore through it and probe and tickle with what you learn from the code in mind. It seems there is a bug, maybe, but banging away at it without knowing what's in the code is pointless, IMHO. But it's your rig, your life, do with it as you will.

If these errors keep up (but they don't seem to be impacting production what with the mix of tasks I have running atm


Been there done that, it's a familiar story. Tomorrow you'll have a different mix of tasks and the problems will return. We've all been through that one and we don't want to deprive you of the pleasure so carry on carrying on :-)

ID: 47368 · Report as offensive
Joe Bloggs

Send message
Joined: 6 Jan 13
Posts: 40
Hong Kong
Message 47371 - Posted: 18 Jan 2013, 10:13:02 UTC

What I meant was that I am still getting these errors (more of them today in fact) but these short apps all make their way to completion regardless. And if I could understand the code I would be coding boinc instead of just running it ;)
ID: 47371 · Report as offensive
kdsjsdj

Send message
Joined: 5 Jan 13
Posts: 81
Message 47376 - Posted: 18 Jan 2013, 11:19:51 UTC - in response to Message 47371.  

Lol! Well don't let that stop you, it doesn't stop the BOINC devs ;-)

ID: 47376 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Questions and problems : Task exited with zero status but no 'finished' file... (almost) all of them

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.