Suprised to find an overabundance of "High Priority" tasks

Author	Message
mitrichr Send message Joined: 21 May 07 Posts: 349	Message 47069 - Posted: 6 Jan 2013, 4:16:25 UTC I just checked my stats on several machines. I was surprised to find most projects flat-lined since mid to late December. So, I looked at the tasks. I found an over-abundance of "High Priority" tasks. Some go for over 100 hours. This is on a number of different projects. Now, this is on several of my least capable machines, two laptops , one a hyper threaded dual core running three threads, one a hyper threaded quad running six threads. The third machine is a hyper threaded quad desktop running six threads My supposition is that I inherited these tasks from people who shut down for the holiday period, which I did not do. I cannot see any way that all of these tasks can be successfully finished by their deadlines. Is there not something in the BOINC process which assigns work to machines capable of doing that work? If my machines fail at some of these tasks, then will that not affect the "reliability" measures on those machines for those projects? http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 47069 ·

kdsjsdj Send message Joined: 5 Jan 13 Posts: 81	Message 47071 - Posted: 6 Jan 2013, 6:05:48 UTC - in response to Message 47069. If you're sure the tasks aren't going to meet deadline then the lesser of the evils is to just abort them. Don't worry about the reliability issue. It's not what you think it is and your hosts will recover any lost reliability in short order anyway. It sounds to me like the 3 hosts you mention are downclocking and that is why they are taking so long to complete tasks and that is why they're in high priority and about to miss deadline. The most likely cause for downclocking is overheating and the two most common causes of overheating are dust and failed or failing cooling fans. What to do? Fire up the diagnostic softwares and check temperatures and fan speeds while BOINC is running. Listen to the fans to see if you can detect a howling bearing. That's one fan failure mode... the bearings wear out then the shaft doesn't spin freely and the fan won't rev up to high enough speed to do its job. It makes a squealing/howling noise. Make a simple stethoscope from a piece of hose and listen to each fan. If you hold a bad fan just right, rotate the blades slowly with your fingers and concentrate you can sometimes feel the roughness in the bearing if it's in very bad shape. The other failure mode is that the magnets in the motor become weak due to spinning through Earth's magnetic field and other causes so the motor loses power. There are no audible clues to that mode you can only use the fan tachometer in the diagnostic software. If you're not cleaning the dust out at least twice a year you're asking for trouble. I replace fans once per year whether they need it or not. I reduce the number of fans I use to a minimum through shrewd thinking and I buy top quality fans at lo cost because I know where to buy them. I install 'em myself because it's easy and because I've learned how to not injure myself with a screwdriver. It works out to be very inexpensive and my hosts never go down just over something as silly as overheating or being at the geek shop for two days accumulating a $200 bill for $7.97 worth of parts and 10 minutes labor. Preventive maintenance: fixing stuff before it breaks and before you lose production. Maintenance: fixing stuff after it breaks and after you lose production. ID: 47071 ·

Claggy Send message Joined: 23 Apr 07 Posts: 1112	Message 47074 - Posted: 6 Jan 2013, 12:21:11 UTC - in response to Message 47069. Last modified: 6 Jan 2013, 12:22:03 UTC I found an over-abundance of "High Priority" tasks. What are your cache settings? (both of them) Claggy ID: 47074 ·

mitrichr Send message Joined: 21 May 07 Posts: 349	Message 47076 - Posted: 6 Jan 2013, 15:40:21 UTC - in response to Message 47071. It sounds to me like the 3 hosts you mention are downclocking and that is why they are taking so long to complete tasks and that is why they're in high priority and about to miss deadline. The most likely cause for downclocking is overheating and the two most common causes of overheating are dust and failed or failing cooling fans. What to do? Fire up the diagnostic softwares and check temperatures and fan speeds while BOINC is running. Listen to the fans to see if you can detect a howling bearing. That's one fan failure mode... the bearings wear out then the shaft doesn't spin freely and the fan won't rev up to high enough speed to do its job. It makes a squealing/howling noise. Make a simple stethoscope from a piece of hose and listen to each fan. If you hold a bad fan just right, rotate the blades slowly with your fingers and concentrate you can sometimes feel the roughness in the bearing if it's in very bad shape. The other failure mode is that the magnets in the motor become weak due to spinning through Earth's magnetic field and other causes so the motor loses power. There are no audible clues to that mode you can only use the fan tachometer in the diagnostic software. If you're not cleaning the dust out at least twice a year you're asking for trouble. I replace fans once per year whether they need it or not. I reduce the number of fans I use to a minimum through shrewd thinking and I buy top quality fans at lo cost because I know where to buy them. I install 'em myself because it's easy and because I've learned how to not injure myself with a screwdriver. It works out to be very inexpensive and my hosts never go down just over something as silly as overheating or being at the geek shop for two days accumulating a $200 bill for $7.97 worth of parts and 10 minutes labor. Preventive maintenance: fixing stuff before it breaks and before you lose production. Maintenance: fixing stuff after it breaks and after you lose production. Thanks very much. While heat has been a problem in the past, especially on the laptops, I have taken the necessary steps to minimize that, vacuuming the fan(s), I run tthrolle, I also use exterior fans to exhaust heat. My temps on the two laptops are in the 50's C, the desktop 40's C. But, the really telling thing for me, when I "Show Active Tasks", I have many tasks from my projects in the 1-5 hour range, and then, waiting to start, I see one of over 100 hours. So, I do not believe that it is my machines, I believe rather, it is something in the assignment of the tasks. I know BOINC recognizes the CPU, and both laptop CPU's designations end in the letter "M", so BOINC knows they are laptops. Thanks for the suggestions. http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 47076 ·

mitrichr Send message Joined: 21 May 07 Posts: 349	Message 47077 - Posted: 6 Jan 2013, 15:41:53 UTC - in response to Message 47074. [quote What are your cache settings? (both of them) Claggy[/quote] Sorry, what caches and where? Regrets,I am not technically proficient. Thanks http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 47077 ·

Claggy Send message Joined: 23 Apr 07 Posts: 1112	Message 47078 - Posted: 6 Jan 2013, 16:22:00 UTC - in response to Message 47077. Last modified: 6 Jan 2013, 16:23:36 UTC What are your cache settings? (both of them) Claggy Sorry, what caches and where? Regrets,I am not technically proficient. Thanks Your cache setting is how many days work you want, there are two settings, you set it in your computing preferences at one of your projects or locally in Boinc Manager (the Boinc Manager local preferences overide the Web preferences), Say you put in 10 + 10 days, and you have a project where the deadline for all work is three days, all the work for that project will have to run in high priority or it'll be late (otherwise it'll be 20 days before it even starts) For Setiathome the Computing Preferences are here: http://setiathome.berkeley.edu/prefs.php?subset=global (Note this a Global Preference, you only need to set it at one project, once Boinc contacts that project it'll then use the latest preferences) the following should be suitable for a Boinc 7 host running lots of projects: Network usage Maintain enough tasks to keep busy for at least 1 days (max 10 days). ... and up to an additional 0.01 days Claggy ID: 47078 ·

kdsjsdj Send message Joined: 5 Jan 13 Posts: 81	Message 47082 - Posted: 6 Jan 2013, 19:20:37 UTC - in response to Message 47076. But, the really telling thing for me, when I "Show Active Tasks", I have many tasks from my projects in the 1-5 hour range, and then, waiting to start, I see one of over 100 hours. So, I do not believe that it is my machines, I believe rather, it is something in the assignment of the tasks. I know BOINC recognizes the CPU, and both laptop CPU's designations end in the letter "M", so BOINC knows they are laptops. BOINC neither knows nor cares that the CPU model numbers end with an M. It's simply regurgitating info it received from the OS. It has no idea it's running on a laptop nor should it need to know or care. The deadline problem and high priority status are most assuredly due to the 100 hour task. BOINC attempts to cache tasks but obviously if it caches too many or it caches tasks that take too long then your computer won't be able to meet deadlines. That's easy to understand. In order to decide how many tasks to cache the portion of BOINC known as "the scheduler" uses info on how fast your computer can crunch and info on how long tasks take to complete. If either of those informations is inaccurate then the scheduler will make bad scheduling decisions and either cache more work than you can complete on time or else cache too little. Too little is not harmful; too much is harmful because then you miss deadlines. There are other informations the scheduler uses too such as how many hours per day you host runs, how many hours per day BOINC is allowed to run and other infos. All of those infos are either estimates that are subject to huge inaccuracies or very volatile operating parameters. It would be nice if scheduling were not subject to questionable info but the reality is that's the best that can be done if you want to cache tasks. That's the theory part of it. Whqat happens in practice is that the scheduler gets fed a lot of BS. Projects sometimes send tasks they estimate to require 5 hours but they end up taking 100. There is little the scheduler can do after it has downloaded a task like that except give the task high priority status and hope it meets the deadline. Trouble is that deprives other tasks of the time they were scheduled to receive so their deadline is jeopardized too. You will hear many volunteers scream blue murder about the scheduler and high priority and missed deadlines but scheduling work when you're fed BS is somewhere between very difficult and impossible. Given all that, the question boils down to this... What can you do to mitigate all the chaos and BS inherent in scheduling tasks and decrease the chances of tasks missing deadlines? The answer to that is simple.... Don't cache tasks. Use the KISS principle and don't download the next task until the task you're crunching now is finished. It's called "crunch one get one". I call it COGO. The downside to COGO is that if all the projects you crunch for are unavailable then your host can sit there for hours with nothing to do. Why would ALL of your project be unavailable? Well, could be you crunch only one project and its server is down. Could be you have lost connection to the net. On the other hand, if you crunch for several projects and your connection is very reliable and not restricted then COGO is all you need. And maybe the few times your hosts sit with nothing to do for a few hots is not a big problem yo you anyway. If you need to cache and you really understand all the theory I explained above then you can see that if you cache lots of work and something goes wrong then your chances of missing deadlines is high unless you constantly monitor operations and abort a few tasks when the scheduler over books. You can also see that if you don't want to watch your hosts all the time then the prudent thing to do is keep a very small cache. A small cache allows more time for the scheduler to cope with tasks that arrive in your cache with severely underestimated duration estimates. You can try to avoid projects that underestimate task duration but then you eliminate assisting good research. You can try to convince the admins of those projects to wise up and estimate duration more accurately but frequently it is impossible to make accurate estimates. Indeed some projects do but that doesn't mean they all can. Some algorithms are non-deterministic which means they may take 5 hours or they may take 500 and it's impossible to predict. The only thing you can do is understand the difficulty in scheduling work in a chaotic environment, appreciate the limitations inherent in the system, be familiar with what your projects are sending you, know how the limitations they face can cause big trouble on your hosts. Personally, before I add a new project I scour their forum for any indications that their duration estimates are not accurate. When I'm ready I'll add the project to one host and only one host. I'll run it for weeks and see what it does and even more important what it does not do before I consider adding the project to my other hosts. Claggy recommends a 1 and 0 cache. That's bigger than what I recommend but if it gives you enough time to cope with underestimated tasks then it works for you. I prefer even more time so I use 0.1 and 0. Whatever works for you. Some people use 2 and 3 and get away with it because the projects they crunch have very accurate and very reliable duration estimates and they have taken measures to ensure that none of the other factors that can go wrong (I didn't cover all of them for sake of brevity) ever go wrong for them. Or else they are on constant watch and are prepared to jump in and manually correct when things go awry. More recent versions of BOINC client have functionality that will sometimes abort tasks if they are in deadline trouble but you can never rely on that to happen. My feelings are that functionality should be made more reliable and less hesitant but there are valid arguments to the contrary. Things are complicated in the BOINC universe, rarely as simple as they first seem to be. ID: 47082 ·

mitrichr Send message Joined: 21 May 07 Posts: 349	Message 47083 - Posted: 6 Jan 2013, 20:49:37 UTC - in response to Message 47078. Last modified: 6 Jan 2013, 20:51:00 UTC What are your cache settings? (both of them) Claggy O.K., too many terms for my pea sized brain. I leave "Connect every" at the default on installation. Additional work buffer is mostly 1.25, in case of ISP problems. I have been using these settings for a long time with no difficulties. You know, I have been crunching for a long time without these difficulties. I do not think anything here has changed. I think that I got "High Priority" WU's inherited from people who turned off their equipment over the holidays. That is what it looks like in my flat-lined stats in BOINC Manager. It's just too much to be coincidence. I am now not going to touch anything, and I will see if it straightens out as this stuff moves through the systems. http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 47083 ·

kdsjsdj Send message Joined: 5 Jan 13 Posts: 81	Message 47085 - Posted: 6 Jan 2013, 23:22:27 UTC - in response to Message 47083. Last modified: 6 Jan 2013, 23:25:41 UTC O.K., too many terms for my pea sized brain. Pea sized? Then you're one of the fortunate ones. Lots of us are limping along with brains the size of a grain of rice. Short grained rice! You know, I have been crunching for a long time without these difficulties. I do not think anything here has changed. Things you cannot see change so that is never a solid argument. I think that I got "High Priority" WU's inherited from people who turned off their equipment over the holidays. That is what it looks like in my flat-lined stats in BOINC Manager. It's just too much to be coincidence. I don't follow the logic you are using there. Probably because you're operating on a misconception and I think I know what that misconception is. You have heard that resent tasks are sent at high priority, right? Now you think you've received one of those tasks and it's stamped with "high priority" so your host is giving it high priority? Right? If that's what you think then you're wrong. Some projects give their resends "high priority" but that doesn't mean what you think it means. It means: 1) the task is sent to a host that has a short turn around time and has a low rate of failed tasks 2) in addition to 1), the server can shorten the deadline of the task, for example if the normal deadline is 10 days the server might shorten that to 6 days When such a ask arrives on your host there is no tag or stamp or anything to designate it as high priority. It looks and smells like a normal task. Before I proceed I will first mention that the scheduler attempts to schedule tasks on your host so that they will finish at least 24 hours before the deadline. If it appears that a task will not finish 24 hrs. ahead of deadline the scheduler will elevate it to high priority and give the task more than its normal share of CPU time in an attempt to complete it 24 hrs ahead of deadline. It is possible the tasks running at high priority on your host are resends and it is possible the server reduced the deadline so drastically that your host has to run it at high priority. If that is what the server did then the scheduler should have realised the task would have to be run at high priority and that should have prevented the scheduler from sending that task to your host. Review that. It's perhaps complicated but that is the way the scheduler is designed to operate and it does so for the most part and you need to understand the mechanics I've outlined above if you hope to understand what's going on in this case. If it's not making sense at this point then don't bother going any further because it will make even less sense until you understand the mechanisms above. OK, that's how the scheduler is designed to operate. Unfortunately it gets tricked and fooled into making bad scheduling decisions. Initially the scheduler thought the task could be completed 24 hours ahead of deadline then something happened to change the scheduler's assessment of the situation and make it decide to elevate the task to high priority. What happened to change the scheduler's assessment? I dunno. There are a number of possible causes and I'm not going to go into those now and I will not until I see an indication from you that you understand the basics of how high priority works and show you understand what I have explained so far. If you don't understand that then any further explanations would be a waste of my time. I am now not going to touch anything, and I will see if it straightens out as this stuff moves through the systems. It may very well straighten out on its own. The numbers you gave for buffer size are reasonably low so maybe the scheduler has enough uncommitted CPU time on your host to get everything done ahead of deadline. You see high priority doesn't equate to disaster and an impending explosion. It simply means the scheduling didn't go quite the way the scheduler thought it would so an adjustment has to be made. That's all. It happens here all the time in spite of my ultra-conservative cache (0.1 and 0) and I haven't missed a deadline in years. And I haven't had to abort tasks either. So leave things as they are but don't let your host start a task if it has a chance of missing deadline, abort it instead. ID: 47085 ·

mitrichr Send message Joined: 21 May 07 Posts: 349	Message 47087 - Posted: 6 Jan 2013, 23:59:39 UTC kdsjsdj Thanks, for a new guy, you are very articulate. http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 47087 ·

kdsjsdj Send message Joined: 5 Jan 13 Posts: 81	Message 47089 - Posted: 7 Jan 2013, 2:07:24 UTC - in response to Message 47087. Thank you :-) ID: 47089 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.