Scheduler Concerns

Message boards : Questions and problems : Scheduler Concerns
Message board moderation

To post messages, you must log in.

AuthorMessage
Bill
Avatar

Send message
Joined: 13 Jun 17
Posts: 91
United States
Message 93277 - Posted: 23 Oct 2019, 1:17:21 UTC

Basically, I'm curious how the Boinc scheduler is set up. My one computer crunches 24/7. It normally crunches Seti exclusively, but when I installed version 7.16.3 for beta testing, I thought I would crunch a few other projects that I have in the past (Miklyway and Einstein). I have encountered two symptoms that have caused a significant amount of tasks to be cancelled.

1. Milkyway: The N-body tasks (operating on one core) typically take a significant higher amount of time to crunch than their ETA. Because of this, tasks run longer, and with typically a two week deadline, a lot of tasks end up being cancelled as the deadline is crossed.

2. Einstein: I'm not sure what happened here, but I just had 300+ tasks error out due to the deadline. I know some of this was because Milkyway was hogging time, but I have been essentially crunching E@H tasks 24/7 for more than a week. I do recall that there were a good amount of E@H tasks that were cancelled a few weeks ago due to the M@W deadline (they had similar deadlines), and these were re-downloaded even though I had NNT set for E@H.

I have not had the time to sit down and document all of my steps to fully understand what is happening here. I feel that I have run Boinc during this time period as "set it and forget it". I was not suspending projects (at least, until I saw deadlines may not be reached, and that was to help focus on the tasks that needed to be done first). I was running the computer 24/7, and, I am pretty sure I have had the storage setting to 2 days or less. So, with task deadlines being 10-14 days out from when the task was downloaded, I am confused why for two projects I have had a plethora of tasks be abandoned.

I have brought this up on the Milkyway boards, but the only response I have gotten was to reduce the amount of tasks stored. Although that may be a way to deal with the symptom, I feel this does not cure the disease. Ultimately, I am concerned about the casual Boinc user experience. CPUs are starting to have more and more cores. I suspect that casual Boinc users could focus on one project, but I am willing to bet there are users who assign several projects to one computer. If their computer is cancelling and abandoning tasks because they expire, I could see this being a deterrent to the casual user.

Is there some further investigation that needs to be done? Perhaps there has been more discussion in details in Github, but I don't have the time to browse through all the issues there. So, apologizes if I'm asking something that has been brought up and debated several times before.
ID: 93277 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 93285 - Posted: 23 Oct 2019, 10:15:23 UTC - in response to Message 93277.  

It's true that there are scheduler changes in the new v7.16.3 client, but they are mainly to do with deciding which task(s) to run, not directly to do with work fetch - although they do have side effects on work fetch.

I'd make two comments on the issues you describe.

1) Milkway - over-running initial runtime estimates, so there's no time left for other tasks to run.
There are two causes for this. One is bad initial estimation pf the computational load of the task. That's down to the project, and I have to say that my impression is that Milkyway in particular (but other projects as well) pays too little attention to getting these details right. The second cause is that BOINC is particularly bad at estimation in the initial stages of running a new project and/or a new application version. The estimates will eventually become a lot more accurate, but it takes a long time - too long, in my view. Again the adjustments are made on the server, not under the control of the client.

2) Einstein - re-sending lost tasks, even though NNT is set.
This is a known and deliberately intended policy, but it should only apply to genuinely "lost" tasks. These are tasks unknown in your current client cache, but still marked as 'in progress' on the project web site. If you had tasks cancelled because of over-running deadlines, they should have been marked as 'aborted' on the website and not been eligible for re-sending. If you can document any sample case of a task being cancelled and reported to the project, but still being resent as a 'lost' task, please report it here or directly to Einstein.

Having said that, I concur with the advice you've received: when joining a new project, it's always wise to reduce your cache size initially until you get a feel for task performance and deadline. Also, be aware that initially resource share considerations will mean that your old projects have had more than their fair share of computer resources, and the new projects will appear to hog the machine until they've caught up. That's normal for BOINC.
ID: 93285 · Report as offensive
Bill
Avatar

Send message
Joined: 13 Jun 17
Posts: 91
United States
Message 93291 - Posted: 23 Oct 2019, 22:28:13 UTC - in response to Message 93285.  

I forgot that detail, the tasks for Einstein were lost because I renamed my boinc directory for 7.14.2 to test 7.16.3. they were casualties that were picked up later, so that may have something to do with it.

I see what you are saying about the new projects catching up with the old one. So, if I have a computer crunching one project exclusively for nearly a year, does that mean the new projects have to crunch a lot before more tasks of the old project are downloaded?
ID: 93291 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 93292 - Posted: 23 Oct 2019, 22:43:32 UTC - in response to Message 93285.  

Richard you should take a look at this thread at Einstein. https://einsteinathome.org/content/scheduler-bug-work-requests-o2md1-work-staff-please-read

And read the posts from Zalster about his problems of the scheduler sending "lost tasks"

On another topic mentioned . . . debt between projects . . . doesn't changing the <rec_half_life_days>10.000000</rec_half_life_days> in cc_config.xml to <rec_half_life_days>1.000000</rec_half_life_days> somewhat ameliorate the issue of balancing the credit debt between projects?

And especially with projects like Einstein that are still using very old server software with outdated application processing rate algorithms, then setting a very low cache level to only a couple of tenths of a day mostly fixes that projects problem of sending way too much work that can't be finished before the normal 14 day deadlines for tasks.
ID: 93292 · Report as offensive
Bill
Avatar

Send message
Joined: 13 Jun 17
Posts: 91
United States
Message 93294 - Posted: 24 Oct 2019, 2:07:26 UTC - in response to Message 93292.  

And especially with projects like Einstein that are still using very old server software with outdated application processing rate algorithms, then setting a very low cache level to only a couple of tenths of a day mostly fixes that projects problem of sending way too much work that can't be finished before the normal 14 day deadlines for tasks.
So this kindof gets to my point. Einstein is running in a way that either negativity affects itself (cancelling tasks because they get too many at once) or other projects (downloads too many tasks such that no other projects can send tasks). If I am a casual user, I am not aware of the rec half life option in cc config. If they are unaware of this switch and see that only one or two projects are completing tasks when they have signed up for five, what motivates them to stay signed up for so many projects? I mean no disrespect to Einstein or any other projects, but if the projects can't police themselves, doesn't Boinc have this responsibility?
ID: 93294 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 93295 - Posted: 24 Oct 2019, 6:25:58 UTC - in response to Message 93294.  

but if the projects can't police themselves, doesn't Boinc have this responsibility?

I'm sorry but it doesn't. All that BOINC is is a infrastructure to build whatever project you want to run on it. It is up to the project managers to figure out how to implement their science applications and run on whatever server hardware they can scrounge up. If they don't want to use the latest server software, then that is their choice. BOINC does not have any policing authority nor does it want any. If you have a problem with a project all you can do is inform the project managers that you have a problem. It is up to them whether they respond to your issue and fix it. If the problem lies with the underlying BOINC server software, then the BOINC developers are pretty darn good in fixing that pretty quickly. For volunteer developers especially.

But the issue with the task completion estimation issue and oversupply of work by Einstein is solely because the project has made the decision to stick with older software and to not implement the current application processing rate algorithm that most projects use in the current server software. The project manager is aware of the issue and has decided to not do anything about it. So the typical Einstein project volunteer has to adapt to the vagaries of the project. Most of the people I know that run Einstein only run it by itself on individual computers. They know that it does not play nicely with other projects and make allowances for that fact.
ID: 93295 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 93297 - Posted: 24 Oct 2019, 7:01:16 UTC

I would also add to Keith's comments about why BOINC can't do anything about the actions of projects - BOINC is Open Source, which means you, Keith, myself, or anyone else can modify the source to suit our needs, desires or wishes and, as a project, BOINC has no control over those modifications.
ID: 93297 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2518
United Kingdom
Message 93306 - Posted: 24 Oct 2019, 16:04:14 UTC

The other thing that I suspect affects choices in this is the amount of work involved in upgrading to the latest server code and the skill set of those carrying out the work. I know it took Andy at CPDN quite a while to do the change over and there were a lot of other pressures on his time. I suspect that long term it would pay the projects still using outdated code to do the swap but there would likely be a lot of complaints in their fora before everything was sorted out.

I am not making a judgement call one way or the other on those projects that are still on old code. CPDN was for a long time and there are still a couple of niggles with some of the changes. Just saying that there will be reasons for not swapping even if we disagree with them.
ID: 93306 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 93307 - Posted: 24 Oct 2019, 17:59:39 UTC - in response to Message 93306.  

Well Dr. Bruce Allen is the main scientist at Einstein and is also one of the main principles of BOINC. He certainly has the capability to upgrade his server software if he chooses to do so. And is more than capable of configuring it. The main reason he refuses to upgrade is he does not want to abandon the project's ability to set its own credit award for work completed. Something he would lose the ability to do since the current BOINC server software uses the CreditNew award mechanism which awards significantly less credit for the work done and does not in anyway obey the original definition of a cobblestone.
https://en.wikipedia.org/wiki/BOINC_Credit_System

Many projects award disproportionate credit for the actual FLOPS required to process a task. But that high credit award draws volunteers to the high paying projects that are credit mongers and don't have any love per se for the actual science they are producing. They just want the high credits to be at the top of the TOP RAC lists. I can appreciate the logic in their decision if the project is one that doesn't have any innate built-in popularity like most of the math projects. I can't work up any enthusiasm for most of them. I really don't need to discover the umpteenth prime number or factor or whatever. I am drawn to the physical sciences like physics, astronomy related and biomedicine. So I crunch for those type of projects. If I solely wanted to be a credit monger, I would be crunching PrimeGrid or whatever.
ID: 93307 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15478
Netherlands
Message 93309 - Posted: 24 Oct 2019, 18:23:03 UTC - in response to Message 93306.  

I suspect that long term it would pay the projects still using outdated code to do the swap but there would likely be a lot of complaints in their fora before everything was sorted out.
It's not just outdated software, it's also running on outdated hardware. Yes, you can upgrade to the latest BOINC server but that doesn't necessarily mean you can run it and its required Linux version on your server. Not all projects have the resources (money and manpower) to change that all.
ID: 93309 · Report as offensive

Message boards : Questions and problems : Scheduler Concerns

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.