Message boards : BOINC client : BOINC 5.2.13 client still does not schedule correctly.
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Dec 05 Posts: 93 |
I am running Red Hat Enterprise Linux 3 on a machine with two hyperthreaded 3.06GHz Xeon Processors (the ones with a 1 MByte L3 Cache), so my Linux kernel treats this as a 4-processor machine. Consequently, I have my preferences set to run up to 4 processors. There is 4 GBytes RAM on the machine. Now ever since the 5.2* series of the BOINC client came out, the scheduler in it has been acting more reasonable, and is running 4 instances of the climateprediction.net application. Three of these (the sulfur cycle ones) require A LOT of cpu time, and have short (4 months or so) deadlines. The old (4.3.* series) BOINC client did not schedule well, and these sulfur cycle applications progressed too slowly because the client insisted on getting other work from other applications and these had shorter deadlines and caused the sulfur cycle applications to be deferred. It seemed as though the scheduler looked only at the deadlines and ignored estimated time to completion. Now the new 5.2 series does a little better with this. When I got the first one, it immediately entered overcommitted mode, let the near-term deadline applications to finish, then refused to download any new work until today, running the four climateprediction.net applications. I had some hope that the three sulfur cycle applications might finish. But today, it finished the regular climateprediction application (hadsm3um, I think it is called), and proceeded to download some rosetta and some predictor@home work. That might have been OK, if it decided to run those on the now-spare processor. But id did not do that. Since the new work has a shorter deadline, it stopped all execution of the sulfur cycle stuff and put all 4 processors to work on the new stuff. My guess is that it should have downloaded less work, just enough that the single free processor could do it in the time allowed, and let the other three processors continue with the sulfur cycle work. But it did not. THIS REALLY NEEDS TO BE FIXED. I am going to miss the deadlines for the sulfur cycle stuff that are due in mid January 2006. These all have predicted remaining times of around 1400 hours, 2240 hours, and 3850 hours. Even if they get all three processors all the time, they may not finish by the deadline. It is true that those three processes run faster than the estimates, but not so much faster that I can be confident they will finish by the deadline. But with these other processes taking all four processors instead of just the spare one, I am going to be in big trouble, even if the climateprediction.net project accepts late results (I got the impression that they would). |
Send message Joined: 30 Aug 05 Posts: 297 |
"it should have downloaded less work" - YOU control this, through the cache size setting. I really doubt in your case it would make any difference, however. I am having trouble believing 3850 hours for a CPDN Sulphur WU. My AMD 3700+ is not 3 times faster than your Xeon, and my time for one is only 872 hours. Also realize that CPDN does not enforce deadlines, so even if you miss the deadline, you will still be getting credit as long as trickles keep being sent. You have two physical CPUs, but four "logical" CPUs. You really should never allow more than two CPDN WUs on your system at the same time, unless you intend to do _only_ CPDN. Having three of them WILL cause exactly what you are seeing. CPDN+HT don't always get along. Unfortunately, the only fix for this is to complete your current CPDN work, even if it means limiting BOINC to using the 2 physical CPUs. In fact, this is what I would strongly urge you to do, because that will enable you to complete the CPDN work _much_ faster. (Note: once you are within a couple of weeks of the deadline, if it hasn't finished one of the WUs by then, you will have to suspend one of the running CPDN results periodically, say once a week, just long enough for the "other" one that's not running to send a trickle. Then resume the one BOINC wants to work on first.) And "No new work" CPDN so it doesn't get more before you have finished all of what you have... then always remember to set your system "down" to 2 CPUs before allowing it to get work from CPDN, then back up to 4 if you want to run work from other projects. If someone with a Dual-HT system and CPDN experience has other advice, please let us know! |
Send message Joined: 19 Dec 05 Posts: 93 |
"it should have downloaded less work" - YOU control this, through the cache size setting. I really doubt in your case it would make any difference, however. What is the "cache size setting"? If you mean the amount of disk space, I have a private partition for BOINC and it is 8 GBytes, and I tell the BOINC client it may use 95% of it. As far as processor speed is concerned, I have not yet completed a SulfurCycle work unit yet. I am about 1/2 way through phase 5 of the one closest to finishing and the time spent so far is 2261 hours according to boincmgr, and it is at 90.40% of the way to completion, claiming, as of now, 1382 hours to go. Now I know the figures for time to completion from boincmgr are pessimistic, and I have heard, as you say, that climateprediction is not enforcing the deadlines. (I just got a sulfur cycle wu for my other machine (dual 550 MHz Pentium IIIs) and it estimates 7611 hours to do that on that machine.), but what if some project did enforce its deadlines? As far as allowing more than 2 climateprediction wus on the machine, it was the BOINC client that decided to allow that. IIRC, this was with the 4.43 client when it ran out of work and the other projects were not downloading any work, so it got two more from climateprediction. Maybe my present 5.2.13 (?) one would not be so stupid? I do not really want to limit the BOINC client to running only two applications at a time because when the processors go idle, the processor voltage runs right up to the upper limit (about 1.47 volts) allowed for them instead of the 1.400 volts that they ask for. Even when running 4 climate prediction work units at once, it does send up the trickles whenever they are ready. That has not been a problem. I should not have to do any diddling of the number of processors. When the cp work unit finished this morning, the BOINC client should have asked for only enough work for one processor to complete by the deadlines of the new applications. And even if it got more, it should not have asked for so much as to push it into overcommitted state so that the shortest deadline applications ran to the exclusion of the rest. |
Send message Joined: 30 Aug 05 Posts: 297 |
What is the "cache size setting"? If you mean the amount of disk space, I have a private partition for BOINC and it is 8 GBytes, and I tell the BOINC client it may use 95% of it. No, I mean the "Connect to network about every (determines size of work cache; maximum 10 days)" setting in the preferences. If you haven't changed this, the default is 0.1 day (2.4 hours) so it should have only gotten 1 or 2 results for each of the other projects besides CPDN. You don't say how many it got, but from what you've said, I assume it was more than a couple, so I suspect you have set this higher. As far as allowing more than 2 climateprediction wus on the machine, it was the BOINC client that decided to allow that. IIRC, this was with the 4.43 client when it ran out of work and the other projects were not downloading any work, so it got two more from climateprediction. Maybe my present 5.2.13 (?) one would not be so stupid? No, if you have it set to use 4 CPUs, and CPDN is the only place it can get work, it will get 4 of them. That may not be "smart", but BOINC has no way to do anything else. That is why I have CPDN set (always!) to "no new work", except when I specifically want to get one. The problem here is the "rule" that if there is an idle CPU, BOINC _will_ get more work. It doesn't know about HT, it sees "4 CPUs" and assumes 4 physical CPUs. It has no way of knowing that 2 of them are very inefficient because of what you're running on them. I do not really want to limit the BOINC client to running only two applications at a time because when the processors go idle, the processor voltage runs right up to the upper limit (about 1.47 volts) allowed for them instead of the 1.400 volts that they ask for. Even when running 4 climate prediction work units at once, it does send up the trickles whenever they are ready. That has not been a problem. The problem is that you do not have 4 CPUs. Running two copies of the _same_ application on an HT processor causes severe contention for the shared resources, giving you almost no gain at all from having HT in the first place. (I believe the figure is something like 4%, where with _different_ processes, it can be 25%.) I agree fully that your goal should be to use all 4 logical CPUs, I'm just stating that the reason for all of your problems is the fact that you had 4 _CPDN_ processes running. Two CPDNs and two "something else" would be fine. However, you WILL NOT get to that point until you can convince your computer it is not overcommitted, and to do that you must either finish or abort at least one CPDN result. And at this point in the deadline, I suspect all three, not just one. I should not have to do any diddling of the number of processors. When the cp work unit finished this morning, the BOINC client should have asked for only enough work for one processor to complete by the deadlines of the new applications. And even if it got more, it should not have asked for so much as to push it into overcommitted state so that the shortest deadline applications ran to the exclusion of the rest. Agreed, to a point. CPU affinity has been asked for, and there is a 3rd-party compile of BOINC that has it, but I'm not sure to what extent. That would allow you to "assign" a CPU to a project. Then you could assign 1 CPDN result to each physical processor, and BOINC _would_ correctly deal with the other two logical processors. This would still break at any point you allow more than two CPDN results to get on your machine however. The _current_ design of BOINC is very simple on when it is overcommitted. If a simulation shows that any of the results on your machine might not make deadline, based on the resource shares you have set, then it will go into EDF mode. Where we disagree is on the fact that it got "too much" work because YOU TOLD IT TO, with the cache size setting. If you have that at "4 days", then it got 4 days worth of work from the other projects. I also should have asked early what your resource shares for the various projects are - it really won't matter at this point with deadline danger on multiple CPDNs though. The "deadline risk" is going to be the CPDN, so it's going to do the other work first in order to get it out of the way, so it can get back to CPDN, but then it'll run out of work on one CPU, and this same cycle will repeat, over and over, and you will never finish the CPDN work. You either have to limit the number of CPUs or the number of results. If you MUST run all 4 logical CPUs, then you can limit the number of results that will be loaded from other projects to 1, through the use of the cache. Set it down to 0.001 - NO MORE than that. And "no new work" CPDN. When the non-CPDN work you have now is finished, it should only get work for one CPU from other projects, and will leave CPDN running on the other "3". This won't solve the problem of CPDN running way longer than it should, but it'll do most of what you're asking for. |
Send message Joined: 8 Sep 05 Posts: 168 |
Are you sure you only have 4 mons, I am running 6 Sulphur on 2 840ee HT and my deadline is one year, and yes BOINC runs the same way on both of my computers,,,,,,but I do not feel it is broken... BOINC Wiki |
Send message Joined: 19 Dec 05 Posts: 93 |
What is the "cache size setting"? If you mean the amount of disk space, I have a private partition for BOINC and it is 8 GBytes, and I tell the BOINC client it may use 95% of it. I have it set to 1.5 days, I believe, because I so frequently get no work at all, even though I am signed up for four projects. If I set it much less than this, the machine goes idle, or at least, some of the processors go idle, for too many days. As far as allowing more than 2 climateprediction wus on the machine, it was the BOINC client that decided to allow that. IIRC, this was with the 4.43 client when it ran out of work and the other projects were not downloading any work, so it got two more from climateprediction. Maybe my present 5.2.13 (?) one would not be so stupid? Hyperthreading has nothing to do with this problem. Hyperthreading may not be as good as more real processors, but it is better than nothing. While I have not tested it with anything but setiathome, it is clear than I can run one or two of those simultaneously and they do not interfere with each other. When I ran three, I got more overall throughput, but not 50% more (of course), and when I ran four, I got more than with three, but not 33% more. So it pays to run all four processors, but not as much as a naive person would expect. The problem is with the scheduling. When there is a free processor, it should get more work, if there is any, and it should ask for an amount of work such that the scheduler does not go into nearest-deadline-first mode. As it is, there are three sulphur cycle processes running right now, and it just downloaded 6 proteinfolding (predictor) work units. I would have no problem with that, but it preempted the three sulfur cycle ones (that are at risk) in order to run the predictor ones that are not at risk. It should not do that. I do not really want to limit the BOINC client to running only two applications at a time because when the processors go idle, the processor voltage runs right up to the upper limit (about 1.47 volts) allowed for them instead of the 1.400 volts that they ask for. Even when running 4 climate prediction work units at once, it does send up the trickles whenever they are ready. That has not been a problem. I disagree. Running two copies of the same application on an ht processor causes some contention on shared resources (especially the caches on the processor chip), but if anything, running the same application there can cause _less_ contention than running different processes. andit surely does not drop the gain to anything near 4%. IIRC, it drops things to, perhaps 70% (depends on ratio of fixed-point to floating point calculations, since IIRC there is almost enough fixed point hardware on those chips for two real processors, but there is definately not enough floating point stuff. OTOH, even a heavily floating point computation requires a huge amount of fixed point computation; e.g., subscript calculations. I should not have to do any diddling of the number of processors. When the cp work unit finished this morning, the BOINC client should have asked for only enough work for one processor to complete by the deadlines of the new applications. And even if it got more, it should not have asked for so much as to push it into overcommitted state so that the shortest deadline applications ran to the exclusion of the rest. I am not interested in that kind of CPU affinity. The Linux kernel already does its best to run processes on the same processor that they ran on previously to keep cache hit ratio high. Now my chips have rougly 16K instruction and 16K data cache at L1 level, 512K L2 cache, and 1024K L3 cache. Admittedly, the L3 cache is too small, but that is what it is. I do not want to direct CPU affinity because it violates a fundamental rule: "The more you try to oursmart an operating system, the more it will outsmart you." I also should have asked early what your resource shares for the various projects are - it really won't matter at this point with deadline danger on multiple CPDNs though. The "deadline risk" is going to be the CPDN, so it's going to do the other work first in order to get it out of the way, so it can get back to CPDN, but then it'll run out of work on one CPU, and this same cycle will repeat, over and over, and you will never finish the CPDN work. You either have to limit the number of CPUs or the number of results. 70 for climate prediction, 10 for predictor, 10 for setiathome, and 10 for rosetta. You seem to be correct as to what the schedular in the BOINC client will do. It is clearly doing that. And that is a design error. It should not do that. If you MUST run all 4 logical CPUs, then you can limit the number of results that will be loaded from other projects to 1, through the use of the cache. Set it down to 0.001 - NO MORE than that. And "no new work" CPDN. When the non-CPDN work you have now is finished, it should only get work for one CPU from other projects, and will leave CPDN running on the other "3". This won't solve the problem of CPDN running way longer than it should, but it'll do most of what you're asking for. That is just the kind of monkeying around I object to. Might as well not have a scheduler in the BOINC client at all if I must do that. If I go on vacation for two or three weeks, who is going to do all that stuff? I hope I got the quote stuff in there right. |
Send message Joined: 19 Dec 05 Posts: 93 |
Are you sure you only have 4 mons, I am running 6 Sulphur on 2 840ee HT and my deadline is one year, and yes BOINC runs the same way on both of my computers,,,,,,but I do not feel it is broken... I am quite sure. Note that there are (at least) two versions of sulphur cycle work units. One is called sulfur cycle 4.21 that gets about a 4 month deadline, and one called sulfur cycle 4.22 that gets almost a year deadline. |
Send message Joined: 25 Nov 05 Posts: 1654 |
The ealier sulphur units had a tight deadline because the project people needed a few hundred in a hurry. They now have them, and have generated another 50,000 with the usual 1 year deadline, so that they can build up a data base of sulphur data. All short deadline sulphur models still being processed can now be assumed to have the usual long deadline. Which isn't enforced, as long as the models trickle every few days to let the server know that they are still alive, and the end result isn't returned in 10 years time, after the project has finished. |
Send message Joined: 30 Aug 05 Posts: 297 |
"Doctor, it hurts when I do this!" "Well, don't DO that!" You're convinced that BOINC should do things your way, and because it won't, it isn't scheduling correctly. If you can write up the problem such that the developers will agree, I'm sure it will be fixed. However, everything you are describing is exactly how it is supposed to work until/unless BOINC has "per-CPU" scheduling. That has been brought up over and over again, and may well make it into the "to be done" list at some point, but at least up to now, it's been a real low priority. As for CPDN, Sulphur 4.22 only came out the first of this month, so you can't have work due in January for it, whether the deadline is a year or 4 months. 4.19 is what I have, with a year deadline; couldn't find any reference to a 4.21 Sulphur on the website. And whether the overall gain is 4% or 20% or 33%, HT makes an individual result take considerably longer, which is exactly the problem you have with those right now. There _are_ ways to do what you want to do, but not without changing anything on your end. And please note that I said nothing about "temporarily" changing your cache, that needs to be a permanent change, so vacations don't matter. The only "monkeying around" thing involved is doing whatever you need to do to limit the number of results you download from CPDN so you don't have more than two on your system. |
Send message Joined: 19 Dec 05 Posts: 93 |
"Doctor, it hurts when I do this!" Yes, I am convinced that the scheduler is wrong. Not necessarily wrongly implemented (though that is possible, but I do not know the specifications), but wrongly designed. The design almost guarantees that multiprocessor machines will always be biased against programs with far-off deadlines, even though those deadlines may have been correctly designed. And the scheduler ignores the time to completion that it, itself, calculates. It should surely take that into consideration when scheduling processes, and it does not except when it is too late; i.e., when climateprediction is closer to its deadline than anything else in there. And some applications have deadlines of a week or less. What the scheduler might well do is subtract the time to completion estimate it already makes from the deadline, and use _that_ to schedule processes when the system is overcommitted. Then distant deadline processes would get a better chance at the processors. N.B.: I have not fully thought through this "band aid", so it may not be the best way to schedule long-running processes, but the current way is guaranteed to fail. Sulfur 4.21 is certainly something that exists, because I have three of them on this machine. I also know that 4.22 exists because I have one of those on my other machine. The 4.21 work units came out like this: 33 Aug 24 10:06 gfx.sh 7271941 Aug 24 11:37 sulphur_gfx_4.21_i686-pc-linux-gnu 1573376 Aug 24 11:38 globe.rgb 5970568 Aug 24 12:24 sulphur_se_4.21_i686-pc-linux-gnu 13083990 Aug 24 12:34 sulphur_um_4.21_i686-pc-linux-gnu 4259364 Aug 26 17:39 sulphur_4.21_i686-pc-linux-gnu 8281400 Aug 26 18:00 sulphur_4.21_i686-pc-linux-gnu.so 4563697 Aug 26 18:30 sulphur_um_4.21_i686-pc-linux-gnu.zip 5639582 Aug 26 19:11 sulphur_se_4.21_i686-pc-linux-gnu.zip 24047629 Aug 26 19:42 sulphur_data_4.21_i686-pc-linux-gnu.zip 17989 Aug 27 21:43 48vf_200298395.zip 17992 Aug 28 08:43 460n_100294695.zip 18031 Sep 17 11:10 46st_c00295709.zip These have deadlines of January 24, January 25, and February 14, 2006. IIRC (machine running Windows XP an the moment), the 4.22 work unit I got today is due December 1, 2006, or something like that. I have 4.13 work units that are hadsm3um. No 4.19 on my machines. |
Send message Joined: 29 Aug 05 Posts: 304 |
There does seem to be a problem here, I hope JM7 is watching this thread and working on a fix. The problem as I see it is that if a long duration workunit causes EDF mode on a multi-processor machine, one CPU goes idle and fetches work. This new work preempts the workunit that is causing the problem and the cycle repeats. The long duration workunit that is causing the problem will rarely get CPU time because the work downloaded for the idle CPU will have an earlier deadline and get processed first. BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 29 Aug 05 Posts: 225 |
There does seem to be a problem here, I hope JM7 is watching this thread and working on a fix. I am not sure that this is entirely correct. For the simple reason that I have a Sulfur work unit on one of my machines that is putting that machine into EDF. Yet, I don't see run-away dowloading. And, the Sulfur work unit is getting time ... not sure that it will make it yet ... :) And more interesting, on this machine I also have a slab model though it does not seem to be causing EDF. And, both of the CPDN work units get time on occasion. The machine I have is a dual (HT) P4 ... |
Send message Joined: 30 Aug 05 Posts: 297 |
I am not sure that this is entirely correct. For the simple reason that I have a Sulfur work unit on one of my machines that is putting that machine into EDF. Yet, I don't see run-away dowloading. But Paul, what is your cache? Small enough that you only get enough work at one time that it will "fit" in the idle CPU(s)? Or > 4 results that will take up all 4 to get done first? I'm not arguing that this _isn't_ a design mistake in BOINC. We all want better multi-CPU handling... |
Send message Joined: 19 Dec 05 Posts: 93 |
There does seem to be a problem here, I hope JM7 is watching this thread and working on a fix. I see it the same way, and furthermore, I have not ONE, but THREE such long-duration work units on a 4-processor (well, two hyperthreaded processors) machine, so the fourth processor keeps getting short duration work units and they keep all three long-duration work units from getting any of the processors except for the short intervals between which my machine dials-up for more work. |
Send message Joined: 29 Aug 05 Posts: 225 |
Bill, I use 0.5 days right now. And, all projects usually have 2-10 work units on hand with the exception of CPDN which I keep "throttled" with NNWork. Xeon-64a: SETI@Home: 10 {40 min} Rosetta@Home: 4 {4 hours} PrimeGrid: 11 (puzzling as I have this at 1% resource share) {3:33 Hours} Einstein@Home: 4 {11 Hours} CPDN: 1 {62 days} another side note, I did start this computer out with 4 CPDN work units. It gradually worked them down to only one ... On another computer I DID get two work units due to the 4.45 (?) bug where it saw some WU as 0 length in time and would get another ... I have not had the courage to "open" up CPDN with 5.2.13 yet, I suppose I should. I don't understand what is happening with Jean-David becuase I have not seen this behavior with any of my machines. My only question is what are the resource shares? I re-looked and I did not see if Jean-David did post them. Are they "unbalanced". Mine are 15% CPDN (had been 20% or more before) with few of the projects at a widely varying percentage. It hs only been recently when I have set 3 projects to 1% (PPAH, SDG, and PG) on most of my machine |
Send message Joined: 24 Nov 05 Posts: 129 |
Paul, FWIW, Jean-David's shares are: 70 for climate prediction, 10 for predictor, 10 for setiathome, and 10 for rosetta. Hope that helps. "The arc of history is long, but it bends toward Justice" |
Send message Joined: 29 Aug 05 Posts: 225 |
Ok, I am not sure that this may not be another mis-understanding. EVEN WITH a 70% share, that does not mean that BOINC will allocate one or more CPUs to the largest share. I see this all the time. The CPDN work shifts in and out. For example, if I set CPDN to 25% it would be expected that BOINC would allocate one CPU to the project at all times. But, it does not. So the Xeons may have no CPDN running for some time as it runs other work. For example, right now this instant I only have 2 CPDN in progress of the 8-9 I have on hand (to make me a liar each of the Xeons is running CPDN right now ... arrrgggghhhhh!). So, I am not sure that this may not be another case of mis-expectations ... and we are talking past each other ... As long as BOINC is running CPDN MOST OF THE TIME, as I would expect, then there should be little to be concerned about ... |
Send message Joined: 19 Dec 05 Posts: 93 |
Ok, I am not sure that this may not be another mis-understanding. I agree, but for me it is not running CPDN most of the time. It fits it in only when there is nothing else to run, but since there is currently a free processor, it arranges that there is almost always something else to run. Now I know from previous experience when there were 4 CPDNs running I had no trouble. The trouble started when one of them finished and now all it does is gets new stuff for the fourth processor. I know how to "fix" it, but I do not want to do it. What I should do is suspend the other three projects. Then BOINC client will download another CPDN (at least, I suppose it would, since there is an idle processor). Then I could resume the other projects and since the machine will be overcommitted for a long while, it will download nothing else, and these CPDNs will progress normally. Until the next CPDN finishes and I will be back where I started yesterday morning. |
Send message Joined: 29 Aug 05 Posts: 304 |
Now I know from previous experience when there were 4 CPDNs running I had no trouble. The trouble started when one of them finished and now all it does is gets new stuff for the fourth processor. I know how to "fix" it, but I do not want to do it. What I should do is suspend the other three projects. Then BOINC client will download another CPDN (at least, I suppose it would, since there is an idle processor). Then I could resume the other projects and since the machine will be overcommitted for a long while, it will download nothing else, and these CPDNs will progress normally. Until the next CPDN finishes and I will be back where I started yesterday morning. Yeah it looks like this is one temporary fix. Another would be to set your queue length small enough to only get one workunit at a time from any project. But either way this should be addressed in the BOINC code since either fix is only temporary or viable under certain circumtances. BOINC WIKI BOINCing since 2002/12/8 |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.