Inaccurate "time left"

Message boards : Questions and problems : Inaccurate "time left"
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94705 - Posted: 5 Jan 2020, 16:24:46 UTC

This really a question for the folks at Einstein as they provide the data that BOINC uses to "guess" at the run-time for a given model.
ID: 94705 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 94706 - Posted: 5 Jan 2020, 16:28:38 UTC - in response to Message 94704.  

This is a well-known consequence of a decision taken by the Einstein project in 2010 (yes, ten years ago).

Before 2010, the central BOINC code (and hence every project) kept track of task speed estimated locally, through a single variable called the Duration Correction Factor. But after GPU computing was introduced in 2008, it quickly became clear that the single-value approach couldn't cope with multiple applications and multiple device speeds - just as you are now reporting.

Code for handling multiple DCF values was developed by a BOINC volunteer, but it was rejected centrally and replaced instead by server-based code incorporating a new system for calculating credit and for runtime estimation.

Einstein decided not to adopt the new credit system, and in the process rejected the runtime estimation system too. They never bothered to develop an alternative.
ID: 94706 · Report as offensive
Gary Roberts

Send message
Joined: 7 Sep 05
Posts: 130
Australia
Message 94733 - Posted: 7 Jan 2020, 10:37:42 UTC - in response to Message 94712.  

.... I had cynically assumed this was to encourage people to flock to Einstein to get their global stats up :-)
You obviously didn't think the situation through at all :-).

Richard has succinctly summarised the situation very well, but he has left out one important detail - Locality Scheduling (LS) :-). This is the current description. As I remember it, there used to be a lot more about it. Notice that it's very brief these days and suggests there are two versions, Limited and Standard. Whilst there's some description about Limited, what exists for Standard is just a "highly project-specific version ... used by Einstein@home."

My understanding is that LS was developed by Bruce Allen and David Anderson in the very early days (circa 2004/5 - maybe even earlier) as a means for enabling the Einstein project to be viable in the first place. Bruce is no slouch as a programmer since he is the author of the SMART system used for monitoring the health of (and predicting potential failures in) hard disks. These days he's just the Director of the AEI which runs the E@H project :-).

E@H kicked off in early 2005 and LS allowed volunteers to cope with downloading a very limited sub-set of data, split up into lots of discrete frequency bins so that individual hosts could be sent lots of consecutive 'discrete frequency' tasks that all were based on the same small subset of large data files. In other words, the bandwidth needed for both the project and the volunteers could be effectively managed and minimised. The problem is that no other BOINC project seems to need this ( I don't really know of any) so it's not really surprising that a more general version has never seen the light of day.

It's also not surprising that E@H Devs have put a lot of time and effort into getting their specialised version to work the way they need it to. They did a lot of development and tuning work in the years between 2005 and 2010. They were then faced with the option of changing to new server code and porting all their modifications to that new code base or staying with the "devil they knew". They would have considered the amount of work needed to keep porting that special code every time a new version of the server code came out. I remember seeing comments from Bernd at the time to the effect that porting to the new code on a continuing basis was simply untenable and they needed to stay with what they had.

I'm not a programmer so I make no comment about that. DCF has basically worked OK for many years since that time. GPU apps started appearing around 2011 and there really hasn't been much of a problem with the swings generated by DCF changes because the estimates built into the workunits used to allow the DCF to be relatively close to 1. Until fairly recently, that is. The estimate for the gamma-ray pulsar (GRP) GPU work was always a bit too high so the DCF would always settle below 1. When paired with CPU work of any description, it didn't seem to matter much - as long as the work cache size wasn't too high. There would be a few more CPU tasks than could be crunched in the configured time but still within the deadline for even a 3 or 4 day work cache size. The much faster finishing GPU tasks would quickly counter any slow CPU tasks which pushed the DCF higher, towards or even above 1.

Issues started with the GW GPU app. The estimate is quite wrong but the real problem is that it's wrong in the completely opposite direction to the way it is for the GRP app. There is effectively something like an order of magnitude difference in a stable DCF for GRP tasks compared to that required for GW tasks. I've seen DCF values for GW tasks in the 4 to 6 range. The Devs must surely be seeing this. The real puzzling bit is that there seems to be no attempt being made to reduce this mismatch. Even if they were both a bit wrong but in the same general direction that would be a significant improvement to what currently exists. Maybe it's just a matter of too many irons in the fire and too few people to tend to them. Lots of users complaining might get some attention.

So back to the relevance of the original quote to which this message is responding ;-). If the cynical view was correct, ie., E@H is buying extra participation with credits, then they are doing it in a rather stupid fashion. If you look at the split of tasks you get when having both GPU searches enabled, I'm sure you would see more of the lower credit GW tasks than the higher credit GRP tasks. The reason for that bias (my opinion) is that the project does want to process the GW stuff as quickly as possible since the whole point of the project (for many long years) is to detect continuous GW. They are doing that through getting the scheduler to preferentially send GW tasks when preferences allow it to do so. If the project were trying to 'buy' participation, the GW tasks would be much higher in credit worth and there would be no need to tweak the scheduler to prefer GW tasks.

And finally, shouldn't you be making the complaint from your original message at Einstein and perhaps trying to get other voices to join the chorus? The Einstein Devs aren't likely to notice it here.
Cheers,
Gary.
ID: 94733 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 94734 - Posted: 7 Jan 2020, 12:15:58 UTC

And one extra snafu is that under WINE BOINC somehow gets a much lower number for the crunching capacity of your cpu so estimates on all projects will be a lot higher than reality.
ID: 94734 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 94735 - Posted: 7 Jan 2020, 12:39:05 UTC - in response to Message 94734.  

And one extra snafu is that under WINE BOINC somehow gets a much lower number for the crunching capacity of your cpu so estimates on all projects will be a lot higher than reality.
Under DCF, that wouldn't be a problem. DCF adjusts, relatively quickly, to the actual measured runtime of your real tasks - since CPDN isn't planning to use GPUs any time soon, you might ask them to turn off the '<dont_use_dcf/>' flag in their scheduler replies. (*)

That cannot work on GPU projects, because there is no fixed ratio of the speeds of the CPU and the GPU in any given host. Mass-market manufacturers use relatively decent CPUs, but pair them with cheap and slow bottom-end GPUs. Enthusiast home builders use a basic CPU, but throw all their cash into the best GPU money can buy.

* They'll have to hack the server code. It should be possible to set that via a Project configuration option, but it appears to be missing.
ID: 94735 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 94736 - Posted: 7 Jan 2020, 13:16:22 UTC - in response to Message 94735.  

And one extra snafu is that under WINE BOINC somehow gets a much lower number for the crunching capacity of your cpu so estimates on all projects will be a lot higher than reality.
Under DCF, that wouldn't be a problem. DCF adjusts, relatively quickly, to the actual measured runtime of your real tasks - since CPDN isn't planning to use GPUs any time soon, you might ask them to turn off the '<dont_use_dcf/>' flag in their scheduler replies. (*)

That cannot work on GPU projects, because there is no fixed ratio of the speeds of the CPU and the GPU in any given host. Mass-market manufacturers use relatively decent CPUs, but pair them with cheap and slow bottom-end GPUs. Enthusiast home builders use a basic CPU, but throw all their cash into the best GPU money can buy.

* They'll have to hack the server code. It should be possible to set that via a Project configuration option, but it appears to be missing.


It does adjust. I think the problem is that it gets screwed up by task types that are nominally the same so appear the same to the client but because they cover different areas, have different resolutions etc. they mess each other up making the process a lot longer than it should be.
ID: 94736 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 94737 - Posted: 7 Jan 2020, 14:05:09 UTC - in response to Message 94736.  

DCF adjusted downwards by 10% of the error, per task, and upwards by the full amount immediately. The modern server-based runtime estimation requires of the order of 100 completed tasks to normalise, in either direction.
ID: 94737 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 94749 - Posted: 7 Jan 2020, 19:27:49 UTC - in response to Message 94746.  

Before I pester the obviously busy Einstein folk, can I get this straight? There is only one DCF (for Einstein as they use the older coding) which my client adjusts itself every time it runs a task. But it's based on an estimate of runtime given by the server. Therefore they could just change that estimate for one of the applications to make it closer? Or does it depend a lot on the GPU in use - somebody else might have completely different timings for the two applications?
The project server doesn't actually calculate the runtime estimate you see in the BOINC Manager.

The project admins say how much work needs to be done to complete the task:
<rsc_fpops_est>17500000000000.000000</rsc_fpops_est> for BRP4 tasks
<rsc_fpops_est>525000000000000.000000</rsc_fpops_est> for FGRPB1G tasks
Those figures are intrinsic to the job itself, and will be the same for every computer.

The server also keeps track of your computer's speed for each type of application:
<flops>16474211923.891272</flops> for BRP4
<flops>134145015228.871060</flops> for FGRPB1G

Dividing one by the other - in your client - gives (Floating Point Operations) divided by (Floating Point Operations per second), or seconds: about 1,062 seconds per task for BRP4, and 3,913 seconds per task for FGRPB1G on this machine. To which, a DCF of 4.5 is applied.

All of which is completely fubar'd. That DCF is derived from FGRPB1G tasks, and the estimate is about right (just under 5 hours). But BRP4 is estimating 80 minutes, when they run in nearer 10 minutes.

I'm not sure exactly where Einstein gets the different speed estimates for the two applications from: both are running on the same hardware. You may be able to see some of the workings next time I request new work, at https://einsteinathome.org/host/1001562/log.
ID: 94749 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 94770 - Posted: 8 Jan 2020, 12:44:06 UTC - in response to Message 94768.  

Perhaps most people leave Boinc on defaults.


Pretty sure that is the case based on the CPDN boards and some of the questions asked there.
ID: 94770 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 94772 - Posted: 8 Jan 2020, 19:16:09 UTC - in response to Message 94771.  

The Gamma Ray GPU tasks run about the same time as the Gravity Wave GPU tasks on the same hardware. Maybe GRP is a few tens of seconds faster on average. 10-15 minutes on average with GTX 1070 Ti cards. Both applications.
FGRPB1G application https://einsteinathome.org/host/12600970/tasks/4/40?sort=desc&order=Sent
O2MDFG2 application https://einsteinathome.org/host/12600970/tasks/4/54?sort=desc&order=Sent
ID: 94772 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 94780 - Posted: 8 Jan 2020, 22:17:48 UTC - in response to Message 94778.  

I notice your CPU time is about the same as the task time. Mine is a lot lower. Have you checked your CPU isn't throttling the GPU?
Peter has mentioned a RX560 GPU (AMD?): Keith a GTX 1070 Ti (NV). The two manufacturers have supplied very different programming and runtime support environments. CPU usage is one of the big differences.
ID: 94780 · Report as offensive
1 · 2 · Next

Message boards : Questions and problems : Inaccurate "time left"

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.