Invalid tasks, how best to raise an alert.

Message boards : Questions and problems : Invalid tasks, how best to raise an alert.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 64644 - Posted: 3 Oct 2015, 4:09:29 UTC
Last modified: 3 Oct 2015, 4:10:26 UTC

I am looking to find an elegant way to generate or log an alert (such as by email ) when a task is becomes invalid, on the host which generated the task.

The local log files do not record when such an event occurs, and unless you look regularly on the project site these can slip under the radar.

Using ubuntu i was thinking something like running wget daily against the "invalid" web page for each project but wondered if there was a better way (API or some utility etc)

Thank you...
ID: 64644 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 64666 - Posted: 4 Oct 2015, 13:41:31 UTC - in response to Message 64644.  

But wouldn't that only be of use when a host is invalidating all work thrown at it? The one or two erroneous tasks that any host can return at any time is something that the quorum and "max # of error/total/success tasks" takes care of.

The project should keep an eye on the hosts working for it, and notify the owner of a host when it runs rampage. But then the only project I know who does this is Climateprediction.net. They'll make sure this host won't get work again and then its administrator will email the user telling that host XandY is returning only garbage and that it's exempt from getting work until its owner fixes the problem and mails back.

Apparently not all project administrators are that much involved in their own project, or interested enough to try and fix this,

Other than the above method, I don't know of any.
ID: 64666 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 64693 - Posted: 5 Oct 2015, 9:46:22 UTC - in response to Message 64666.  

WCG has a sequential valid / invalid host tracking. Eventually the trickle will reduce to just assigning a single task of the failing project / projects, per 24 hours i.e. if down to 1 and it fails, it takes 24 hours before getting another. A valid will double the number up to a certain daily maximum cap, think it's 35 per core ATM. Presume these are set using standard scheduler features i.e. any project can utilize these functions, not to be forced into a server side panic... pile up of repair jobs that are solely queued for known reliable hosts and everybody else running dry.

A good tracker is BOINCTasks. History will show bad results [those failing on host with an immediate error], listed in red. Those that the project validator rejects and reports as invalid do not feed a signal back to the client, just the log saying that the daily budget is being torqued off, and this possibly generating a Notice, but cannot recollect a single post of this ever being seen at any project. Maybe if this generates a specific code [You have received maximum for day of 1], BOINC could be programmed in a generalized form to pop up a 'needs attention' notice. But how does that work when the GUI is not loaded / or running headless?
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 64693 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 64696 - Posted: 5 Oct 2015, 10:33:05 UTC - in response to Message 64693.  

Automated email perhaps? But that requires a verified email address, a little thingy none of the projects require (or does WCG?).

But even a rogue host at 1 task a day can run rampage, especially when it runs various different (length) tasks, some of which do end correctly. So for each correctly ended task, the quota doubles, which means it can download more work to go wrong on. This wastes project bandwidth and time, for all that work has to be resent to another computer.

And what if that other host is also on a downwards streak?
ID: 64696 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 64698 - Posted: 5 Oct 2015, 11:17:38 UTC - in response to Message 64696.  
Last modified: 5 Oct 2015, 11:19:33 UTC

The cut off / tracking is at app level i.e. having one doing good does not increment the allowed for another. There's no designed result size variation under one app at WCG.

WCG seems to test emails, occasionally see messages such as 'cannot be verified' when a perfectly fine address is used. Hotmail is one of those, possibly because they see it as a form of spamming or attacking from one IP.

AND, there's options at WCG for notices of the various categories which includes an option '...and contribution issues.'. But never seen one or heard of members getting an actual warning... only when their contribution stops altogether for longer than 3 weeks.

[ot] BTW, it used to be when replying and not yet logged in, you would be redirected to the sign in screen. Since a number of months, you get a blank screen with a big print http error, and that's it.[/ot]
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 64698 · Report as offensive
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 64711 - Posted: 5 Oct 2015, 22:50:34 UTC - in response to Message 64698.  
Last modified: 5 Oct 2015, 22:58:30 UTC

Thanks Jord and SekeRob2

As you both have said, invalids are a very significant drain on resources especially big tasks which run to completion, get uploaded then the process needs a third to get involved etc.

The case in question i suspect i may be just starting to hit a GPU stress limit - but i have no way of "looking back" for invalids. For example - were they occurring at a hot time of day /a particular GPU / app / host etc.

I was hoping for something like

http://<project_url>/results.php?hostid=<host_id>&offset=0&show_names=1&state=4&appid=0&format=xml[/url]

the &format=xml - is ignored (for results.php) but works for some php scripts.

I haven't managed to push wget to get past the login screen but that probably is a bit of feeding wget the right cookie.


Thanks again, if i get it working later in the week i'll post back.
ID: 64711 · Report as offensive
Oliver Bock

Send message
Joined: 26 Jun 13
Posts: 8
Germany
Message 64909 - Posted: 16 Oct 2015, 7:39:48 UTC - in response to Message 64666.  
Last modified: 16 Oct 2015, 7:40:03 UTC


The project should keep an eye on the hosts working for it, and notify the owner of a host when it runs rampage. But then the only project I know who does this is Climateprediction.net. They'll make sure this host won't get work again and then its administrator will email the user telling that host XandY is returning only garbage and that it's exempt from getting work until its owner fixes the problem and mails back.

Apparently not all project administrators are that much involved in their own project, or interested enough to try and fix this,



Erm, all BOINC projects do the first part automatically. If a host has issues its daily quota is being reduced automatically by BOINC, preventing it from becoming a task black hole. But yes, we don't email each host owner.

This not a question of being sufficiently involved with one's project but a matter of (usually public sector) resources, in particular for projects with a large number of volunteers (hosts).


JM2C,
Oliver
Einstein@Home Project
ID: 64909 · Report as offensive
Oliver Bock

Send message
Joined: 26 Jun 13
Posts: 8
Germany
Message 64910 - Posted: 16 Oct 2015, 7:47:07 UTC - in response to Message 64644.  
Last modified: 16 Oct 2015, 7:59:23 UTC


I was hoping for something like

http://<project_url>/results.php?hostid=<host_id>&offset=0&show_names=1&state=4&appid=0&format=xml[/url]

the &format=xml - is ignored (for results.php) but works for some php scripts.


What you describe sounds like a general use case which means you shouldn't design a rather convoluted workaround. What you should do instead is to contact the devs (boinc_dev mailing list) and ask for such a feature to be added to the BOINC web code or RPCs. I'm willing to support that and chime in on the discussion...

Oliver
Einstein@Home Project
ID: 64910 · Report as offensive
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 64913 - Posted: 16 Oct 2015, 19:10:00 UTC - in response to Message 64910.  

thanks Oliver

I will look at the RPCs, and then craft a proposal.
ID: 64913 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 64914 - Posted: 16 Oct 2015, 19:21:41 UTC - in response to Message 64909.  

If a host has issues its daily quota is being reduced automatically by BOINC, preventing it from becoming a task black hole.

Which is fine on a one core host, or a host that runs with only one GPU and one application sort. But not a multicore, multi-GPU host capable of running multiple different applications.

Because on such a host the one app that it has troubles with is counteracted by the apps that it has no trouble with. So its quota will stay high, but it runs through the work of the one application as if there's no tomorrow, only returning errors.

This can be solved by having a quota per application (something Einstein e.g. doesn't have yet), so the host stops asking for work for that application it has trouble with, because there the Max tasks per day is 1. For all the applications it doesn't have trouble with the Max is high.
ID: 64914 · Report as offensive
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 64916 - Posted: 17 Oct 2015, 1:13:31 UTC - in response to Message 64913.  

I will look at the RPCs, and then craft a proposal.


Which i have now done.
ID: 64916 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 64925 - Posted: 17 Oct 2015, 15:04:03 UTC - in response to Message 64914.  

If a host has issues its daily quota is being reduced automatically by BOINC, preventing it from becoming a task black hole.

Which is fine on a one core host, or a host that runs with only one GPU and one application sort. But not a multicore, multi-GPU host capable of running multiple different applications.

Because on such a host the one app that it has troubles with is counteracted by the apps that it has no trouble with. So its quota will stay high, but it runs through the work of the one application as if there's no tomorrow, only returning errors.

This can be solved by having a quota per application (something Einstein e.g. doesn't have yet), so the host stops asking for work for that application it has trouble with, because there the Max tasks per day is 1. For all the applications it doesn't have trouble with the Max is high.

As I wrote earlier in the thread, this quota reduction is at app level! Fine working apps will -not- increase the quota of work-units for the failing apps.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 64925 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 64926 - Posted: 17 Oct 2015, 15:31:26 UTC - in response to Message 64925.  

This can be solved by having a quota per application (something Einstein e.g. doesn't have yet), so the host stops asking for work for that application it has trouble with, because there the Max tasks per day is 1. For all the applications it doesn't have trouble with the Max is high.

As I wrote earlier in the thread, this quota reduction is at app level! Fine working apps will -not- increase the quota of work-units for the failing apps.

AgentB is primarily an Einstein volunteer, so the bracketed exception applies to him.

There is also scepticism at projects running the newer forms of the server code as to how effective the 'per application' quota reduction code really is in practice.
ID: 64926 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 64927 - Posted: 17 Oct 2015, 16:12:48 UTC - in response to Message 64925.  

As I wrote earlier in the thread, this quota reduction is at app level! Fine working apps will -not- increase the quota of work-units for the failing apps.

As said, there are projects out there that use the older server back end where all applications use one quota together.
Einstein@Home has 4 applications (Binary Radio Pulsar Search (Arecibo), Binary Radio Pulsar Search (Arecibo, GPU), Binary Radio Pulsar Search (Parkes PMPS XT), Gamma-ray pulsar search #4) and only one quota (Maximum daily WU quota per CPU 32/day).

Let's assume a host running all four of these applications. Which means that when the host returns only errors for the "Gamma-ray pulsar search #4" application, that the quota goes down, but because the "Binary Radio Pulsar Search (Arecibo)" application returns good work, each of its tasks doubles the quota again. The "Binary Radio Pulsar Search (Arecibo, GPU)" and "Binary Radio Pulsar Search (Parkes PMPS XT)" applications also return good work, so these three applications together can keep the one quota high enough for the first application returning only garbage to continue unhindered.

Until either the system's user finds application A has been giving trouble for the past several days, weeks or months, or another user may have PMed him to tell him about it - this of course, if the user hasn't hidden his computers and thus is running as an anonymous user.
ID: 64927 · Report as offensive
Profile Agentb
Avatar

Send message
Joined: 30 May 15
Posts: 265
United Kingdom
Message 65063 - Posted: 23 Oct 2015, 21:22:07 UTC - in response to Message 64916.  

I will look at the RPCs, and then craft a proposal.


Which i have now done.

and an issue has been raised at boinc github

Thanks Oliver and David for helping get it there!
ID: 65063 · Report as offensive

Message boards : Questions and problems : Invalid tasks, how best to raise an alert.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.