Why doesn't Boinc schedule earlier deadlines first?

Message boards : Questions and problems : Why doesn't Boinc schedule earlier deadlines first?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 94953 - Posted: 13 Jan 2020, 23:49:26 UTC - in response to Message 94951.  

First point:
The data needed to start the next section doesn't get created until the end.
Climate models are complex.

Second point:
Too many people wanting the own ideas implemented, and too few developers for phones?
ID: 94953 · Report as offensive
Nick Name

Send message
Joined: 14 Aug 19
Posts: 55
United States
Message 94959 - Posted: 14 Jan 2020, 0:20:18 UTC - in response to Message 94944.  

I'd like a setting in cc_config to force jobs to run to completion once they start.

You can do that by changing the "switch between tasks every" parameter. I change from the default of 60 minutes to 360 minutes so that a GPUGrid job runs to completion and never exits or suspends. The application can't handle restarting on a different device in a mixed type gpu configuration and the switch parameter is how I get around the issue.

Thanks, I'm aware of that setting and while it sometimes works that way it's not quite the same. As Richard said, "Switch between tasks' is permissive, not directive [emphasis mine]- BOINC may switch tasks after this minimum interval, but doesn't have to." I would say it's a guide not a rule, and that's why I'd like the cc_config option. I've seen this setting ignored many times over the years even with this parameter set to values far above application run times. When I checked it was set to 240, I'm not sure if that's the default or I might have changed it and forgotten. Brada run times are usually around three hours max on this host, so 240 minutes is more than enough to keep them running to completion. That job never should have been suspended if this setting worked as expected, especially with deadline pressure but it happened. I increased it to 720 to basically match my cache setting but in past experience setting values this high hasn't had the desired effect.
Team USA forum
Follow us on Twitter
Help us #crunchforcures!
ID: 94959 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2516
United Kingdom
Message 94961 - Posted: 14 Jan 2020, 8:01:11 UTC - in response to Message 94941.  

Plus in 6 months something bad could happen to that computer, or it might not be processing Boinc any more, or it might not meet the deadline, so the project at least gets some of the data and can send the remainder of the calculations to someone else.


It would be nice if they had a mechanism to send the remainder of the calculations to someone else but sadly if the task gets reissued it starts from scratch. The problem is that minor differences between computers mean that two computers running the same task will not necessarily produce exactly the same results back. This is one of the reasons all task types only go out on one OS these days. (WAH2 Windows, HADAM3CS and HADAM4 LINUX) etc. And the current long deadlines of over a year mean that if a computer just stops communicating with the project the task is left in purgatory for ever.
ID: 94961 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 94963 - Posted: 14 Jan 2020, 9:00:35 UTC - in response to Message 94949.  

Keith's comment about applications trashing tasks if they start on the "wrong" processor ...
How common are computers with differing GPUs? If there's quite a few, perhaps Boinc should try to make them run on the same one again?
Keith's comment probably related to one specific application on one specific project that he and I both run. That's one specific failure by a project developer that should have known better, not a general rule. In fact, I think the project in question may have lost the ability to write their own applications (staff turnover), and bought in a replacement app not originally designed to run under BOINC (they're now using the wrapper).

Using multiple GPUs is quite common amongst enthusiasts, and they can make it work well. This project's problem is unusual - it's the first time I can remember it happening in 10 years, and the only part which is BOINC's fault is the failure to predict bad programming.
ID: 94963 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94964 - Posted: 14 Jan 2020, 9:41:17 UTC - in response to Message 94952.  

Running a torrent download is VERY, VERY, VERY, VERY much simpler than performing the sort of calculations that most projects do.
A torrent download is simply a bit string spilt into small segments, each segment has a unique identifier which includes a sequence number. Get all the bits and bolt them together in the right order and the job's done. Now most torrent tools set up an array for all the bits and populate it in sequence order as they go along, keeping a "vacant slots" array alongside so it knows what segments are missing, and as progress gets nearer completion so it gets faster to search this array, and so apparently raise the priority of the task.
Now think about a science application, there are some pretty simple calculations, like an FFT, then some pattern matching and maybe some 2d-pattern recognition, 3d-shape matching, a bit of matrix rotation and inversion, result collation etc. And that is all being done on apparently random incoming data and returns a "sensible" result. All this has to be done in the correct sequence, and the input from one stage is the output from a previous stage (or several previous stages). Changing the "priority" during such a run is a very good way of getting corruption that in the data due to loss of synchronisation in processes. Even a humble single-core application may actually be running several threads which must be correctly synchronised to get the correct answer, and one thing changing priority can do is to upset the exact timing and sequencing.
ID: 94964 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94965 - Posted: 14 Jan 2020, 9:46:41 UTC - in response to Message 94957.  

First point:
The data needed to start the next section doesn't get created until the end.
Climate models are complex.


But surely if I half complete a task, someone else can get my saved data and start where I left off, just as my client could do so if I switched off my computer.



One of the ideas behind BOINC is that the project can break down the problem into small chunks of work each of which can be sent out to different people. HOWEVER there are problems with some sorts of analysis that make this very difficult, and CPDN is one of them. It is not uncommon for the intermediate data to be in a form that is not readily transposable between different computers, so CPDN have chosen the safe if long run-time solution of using one computer for one model in its entirety.
ID: 94965 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94966 - Posted: 14 Jan 2020, 10:03:03 UTC - in response to Message 94957.  

Second point:
Too many people wanting the own ideas implemented, and too few developers for phones?


I guess it depends on how much work is done by phones, so if it's worth spending time on them.


Actually there are two problems in that.
First is there are probably as many ideas as to how BOINC should work as there are users of BOINC. The team who are currently maintaining and developing BOINC are largely volunteers, there are few of them and the their priority has to be fixing real bugs not perceived bugs (I've a great list of the latter, and have been ticking them off for a long time)
Second, that is probably quite true. But the phone manufacturers don't help with this either. I do some work for a company that was developing an bespoke application for a client to run on one particular, rugged , handset that the client had a lot of. The application was in the final stages of testing before being rolled out. The client changed their main IT supplier, who replaced all the handsets with a newer, and coincidentally less rugged, model from the same manufacturer. The application doesn't work, a small but fundamental change in the operating system has stopped it working.
It's pretty much the same when one looks at the various generations of Android (the phone/tablet I'm most familiar with) - each generation has changed bits so that applications that used to work don't, and not every new version is backwards compatible :-(

Given the overall performance of phones I would be very sceptical about using them for any heavy weight computation - they are OK for short burst activity (word processing and the like), video streaming (so long as the battery holds out!), even dare I say it making phone calls (so long as you have a signal - I love my engineering SIM!)
ID: 94966 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 94969 - Posted: 14 Jan 2020, 12:03:13 UTC - in response to Message 94957.  

Peter Hucker said:

But surely if I half complete a task, someone else can get my saved data and start where I left off, just as my client could do so if I switched off my computer.

When the process controlling the start and end dates of a model gets to the end date, a new process is started which collects all of the data needed to start the next section.
And THAT zip file is usually far larger than the normal data files. So much so, that the newest model produces a Restart zip hundreds of Megs in size. As a consequence, the project people have added extra code so that they can easily control whether or not to produce a Restart file.
The current models are not, just a very small dummy file.

All of which is way off topic for this thread about the BOINC schedular.
ID: 94969 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 95021 - Posted: 14 Jan 2020, 18:32:24 UTC - in response to Message 95013.  

This interests me. Surely there is only one correct answer to a mathematical calculation!! How on earth could different results be produced?
Talk to Eric Mcintosh at CERN: IEEE 754 as intended, especially 'ETH Paper on Bit-Reproducible Portable HPC Applications'.
ID: 95021 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 95024 - Posted: 14 Jan 2020, 18:42:17 UTC

Most Torrent tools do not change the priority of the download on the CPU while running, they change the internal priority of one download when compared to another, so one that is nearly finished gets a bit more attention than one that is just starting. This may either be a real internal change, or an apparent one.

BOINC does some prioritisation, especially when it sees that a task near the back of the list won't finish in time unless it starts NOW... While called priority it is a form of scheduling.

If your new phone has an ARM processor it is a RISK processor - RISK stands for "Reduced Instruction Set Computer", and there are very many different instruction sets that are RISC sets.

GPUs are actually HSMP devices (Highly Symmetrical Multi Processors), and most are RISC processors, lots of them, all working in parallel.
ID: 95024 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 95026 - Posted: 14 Jan 2020, 18:45:31 UTC

[url] This interests me. Surely there is only one correct answer to a mathematical calculation!! How on earth could different results be produced?[/url]
While this is true for deterministic calculations some of the projects use non-deterministic calculations to save a lot of time - I think CPDN is one of them, and I would expect those working with protein shapes do as well. With non-deterministic calculations there is no single "correct" answers, but a range of "most probable" answers.
ID: 95026 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 95047 - Posted: 14 Jan 2020, 20:02:08 UTC

It was also researched about 2004/5 by one of the original Oxford people, who found that just the different way that different processors handled the maths, caused differences in the end results.
It was said that different computers, starting from the exact same data, produced what was effectively a different climate model. Not a massive difference, but noticeable.
I think that research paper is still in the Publications area. Somewhere.
ID: 95047 · Report as offensive
Profile Gary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2462
United States
Message 95052 - Posted: 14 Jan 2020, 20:33:03 UTC - in response to Message 95047.  

It was also researched about 2004/5 by one of the original Oxford people, who found that just the different way that different processors handled the maths, caused differences in the end results.
It was said that different computers, starting from the exact same data, produced what was effectively a different climate model. Not a massive difference, but noticeable.
I think that research paper is still in the Publications area. Somewhere.

The class is called "Numerical Analysis" and is undergraduate in computer and science paths. Ignore (skip) it at your own peril. The IEEE 754 standard attempted to put some sanity into things. Unfortunately the people who build chips do not make their chips to the standard. For instance many 64 bit chips may have an 80 bit internal register. More accurate? Perhaps. Different result that a 64 bit register - absolutely.

Most high level language compilers have an option to force IEEE 754 floating point math. The problem is it is much slower than using the floating point on the chip. It is all the rounding and bounds checking tests that chew up processor cycles that aren't used in the chip math. But it is repeatable.

This is really basic computer science and not BOINC related.
ID: 95052 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 95055 - Posted: 14 Jan 2020, 21:06:43 UTC

And as such has nothing to do with BOINC, but might have a lot to do with each and every project that uses computers to do calculations of any sort.
Remember BOINC does none of the science work, it only provides an environment in which some science may be done.
ID: 95055 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Questions and problems : Why doesn't Boinc schedule earlier deadlines first?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.