BOINC 7.2.42 scheduling issue

Author	Message
Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52910 - Posted: 2 Mar 2014, 15:30:27 UTC Using current BOINC v7.2.42, not sure if it's an issue with earlier versions as I just started running OWS. Running Yoyo (Muon, ECM, Evolution, OGR & OWS) as well as LHC on CPUs Running GPUGrid on NVidia GPU and Einstein on AMD GPU. Work cache set to ~.5 days. BOINC downloads various WUs and seems to schedule more or less appropriately (although earliest deadlines do not seem to be well honored) UNTIL OWS WUs are downloaded. Since OWS deadlines are very short I would expect them to be scheduled quickly. They aren't. Instead, BOINC schedules and runs various ECM and LHC WUs that have MUCH longer deadlines and waits until the OWS deadlines are too close for comfort then suspends the other WUs to run OWS. This would be OK except that it also suspends the GPU WUs so that more CPU slots are available. NOT GOOD. Also posted this at Yoyo but it looks like a BOINC issue to me. ID: 52910 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52911 - Posted: 2 Mar 2014, 16:00:09 UTC As a workaround it's possible to use an app_config.xml such as the one below to limit the instances running but this shouldn't be necessary. Something in scheduling isn't right. <app_config> <app> <name>oddWeiredSearch</name> <max_concurrent>3</max_concurrent> </app> </app_config> ID: 52911 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15483	Message 52913 - Posted: 2 Mar 2014, 16:44:37 UTC Last modified: 2 Mar 2014, 16:44:58 UTC If you feel this is wrong, add <sched_op_debug>1</sched_op_debug> and <cpu_sched_debug>1</cpu_sched_debug> to the <log_flags/> section of your cc_config.xml file, let BOINC reload the config file and send the output, plus your complaint to the BOINC Alpha, or BOINC Development email list. ID: 52913 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52915 - Posted: 2 Mar 2014, 18:01:20 UTC Last modified: 2 Mar 2014, 18:01:58 UTC I think it's wrong to ever have CPU apps causing GPU apps to suspend. It's doubly wrong if there was plenty of time to run those CPU WUs but BOINC for some reason decided not to (as in this case). Unfortunately I'm now running a long GPU WU of a type that will abort if I exit BOINC. Not sure what will happen if the config is reloaded. Maybe shouldn't have posted this now since I'm under extreme time constraints for the next 2 weeks, too many irons in the fire. I'll try to come back to it when things loosen up a bit. I can't be the only seeing this issue... ID: 52915 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15483	Message 52917 - Posted: 2 Mar 2014, 18:51:36 UTC - in response to Message 52915. Not sure what will happen if the config is reloaded. If done through Advanced->Read config files, it will just instruct the client to use the debug flags from that time forward. Until instructed otherwise. There's no need to restart BOINC for this, as it's not a detection decision. ID: 52917 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52918 - Posted: 2 Mar 2014, 22:00:35 UTC Last modified: 2 Mar 2014, 22:01:25 UTC Thanks Jord, Now wouldn't you know that I've already burned through all the OWS WUs manually by suspending everything but them and letting them run 3 at a time. It seems they're a bit hard to get as I've only seen them show up on 2 of my machines so far. ID: 52918 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52919 - Posted: 2 Mar 2014, 22:30:31 UTC Last modified: 2 Mar 2014, 22:39:29 UTC It might be a good idea to re-read the scheduling policies. Here are two documents, both of which are old, but give an idea as to what's happening. Very old: http://boinc.berkeley.edu/trac/wiki/ClientSched Slightly newer: http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen I understand that you believe GPUs should never be pre-empted by CPU jobs, but that simply is not how it was designed. If a project releases CPU jobs with deadlines such that they would run high-priority (earliest-deadline-first) on your machine, then yes, your GPU jobs will be pre-empted, to make it happen. By design. The real problem you are having is: Why does it think it can't make the deadline on these jobs? Or, better yet, why does it run other CPU jobs first? That may boil down to things like "were the running CPU jobs in the middle of a non-checkpointed timeslice?" and "were the running CPU jobs running for a project that had higher resource share setting" and "were the running CPU jobs running for a project that had a lower Recent Estimated Credit (REC) than what was needed to meet resource share?" and "perhaps the completion estimation of existing running tasks is way off, and by the time it gets closer to OWS deadlines, they are forced into EDF". I know the documentation doesn't really specify that, but it's true. In fact, there was a recent checkin for 7.2.38, that gave more priority to MT jobs, as compared to ST jobs. The checkin notes actually give an "outline" on how tasks get scheduled (comparing Job 1 J1, to Job 2 J2): ------------------ client: job scheduler tweaks to avoid idle CPUs -> Allow overcommitment by > 1 CPU. E.g. If there are two 6-CPU jobs on an 8 CPU machine, run them both. -> Prefer MT jobs to ST jobs in general. When reorder the run list (i.e. converting "preliminary" to "final" list), prefer job J1 to J2 if: 1) J1 is EDF and J2 isn't. 2) J1 uses GPUs and J2 doesn't. 3) J1 is in the middle of a timeslice and J2 isn't. 4) J1 uses more CPUs than J2. 5) J1's project has higher scheduling priority than J2's ... in that order. 4) is new; it replaces the function promote_multi_thread_jobs(), which did something similar but didn't work in some cases. ------------------ Those are just some of my "off the top of my head" thoughts. I actually have had a similar problem with MindModeling tasks - they release them with ridiculous deadlines (2.5 days), pre-empting my GPUs. I talked with the admin, and they relaxed them a bit, but have been changing them back and forth over time. I now use an app_config to limit max_concurrent on them, to ensure my GPUs don't get pre-empted. It's their problem if their tasks don't get done, not mine. http://mindmodeling.org/forum_thread.php?id=667 Regarding your particular scenario, you'd have to turn on task scheduling debug flags to figure out what happened, or maybe consider creating a Client Emulator scenario, in order to debug this further. ID: 52919 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52920 - Posted: 2 Mar 2014, 22:39:15 UTC - in response to Message 52919. Last modified: 2 Mar 2014, 22:39:50 UTC The real problem you are having is: Why does it think it can't make the deadline on these jobs? Or, better yet, why does it run other CPU jobs first? That may boil down to things like "were the running CPU jobs in the middle of a non-checkpointed timeslice?" and "were the running CPU jobs running for a project that had higher resource share setting" and "were the running CPU jobs running for a project that had a lower Recent Estimated Credit (REC) than what was needed to meet resource share?" I know the documentation doesn't really specify that, but it's true. Hi Jacob! The priority should be the same since they're all Yoyo subprojects unless they're set inside Yoyo. At least I have no control over those priorities. There should IMO at least be a switch or config file setting that allows us to specify that GPU jobs won't be preempted. The last thing most of us want is for our high performing GPUs to be knocked offline indiscriminately. ID: 52920 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52921 - Posted: 2 Mar 2014, 22:45:23 UTC Last modified: 2 Mar 2014, 22:47:18 UTC Beyond, The GPU is treated like a resource. If we can use it, great. But if we can't, we can't. Tasks are treated as tasks. We strive to get them done by deadline, using the resources that we have. If that means that getting certain CPU jobs done by deadline means that we have to allocate all CPUs to CPU jobs, then we do it. I've talked with David Anderson about this very issue, a few times, because I thought it was a bug. I even remember mentioning "I'd want my CPU jobs to miss deadlines before having my GPUs get pre-empted." But he convinced me it is not a bug. I imagine, in the future, we may have other utilizable resources. Currently, we treat "memory" and "hard disk space" and "bandwidth" as constraints, but it's not hard to imagine BOINC (and projects) treating them as resources. I'm digressing. My point is that, when there are no EDF tasks, GPU tasks do get priority. But if there are EDF tasks, they beat GPU scheduling. And there's no switch. The investigation should focus on why they went EDF. ID: 52921 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52922 - Posted: 2 Mar 2014, 22:51:30 UTC Last modified: 2 Mar 2014, 22:53:33 UTC PS: It's entirely possible that the completion estimation on YoYo tasks gets way off, due to running various sub-tasks within the project. I'm not sure, as I don't run that project, but you might try to investigate that. Edit: It looks like the sub-tasks are split up into their own applications. http://www.rechenkraft.net/yoyo/apps.php So this leads me to believe that, maybe it's likely that completion estimation got screwed up on either the application that went EDF, or one of the other applications that you had tasks for. ID: 52922 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52924 - Posted: 3 Mar 2014, 2:37:40 UTC - in response to Message 52922. So this leads me to believe that, maybe it's likely that completion estimation got screwed up on either the application that went EDF, or one of the other applications that you had tasks for. One of the first things I checked. The estimates were fine and there was plenty of time to finish the OWS tasks before deadline. I did them manually as described above, but I shouldn't have to babysit BOINC to have it schedule properly. The whole issue would have been avoided if BOINC hadn't decided to run a long (17 hour) Yoyo OGR task that was downloaded AFTER the Yoyo OWS tasks and which still had over 500 hours until deadline. It also started running a 10 hour Yoyo Evolution task that still had over 650 hours until deadline and some 2 hour Yoyo ECM tasks that had 88 hours until deadline. There should have been no problem easily completing all the tasks with time to spare and no panic mode if BOINC had taken them in halfway logical order. Why is the scheduler running jobs from the same project that have over 650 hours until deadline when there are jobs with less than 15 hours to deadline that need to be run? ID: 52924 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52925 - Posted: 3 Mar 2014, 2:47:58 UTC Well.. The best thing I can recommend is to (in the future?) prove that the tasks that did the pre-empting were downloaded after the OWS tasks. Also, the task scheduling debug flags might be able to help figure out why the scheduler does what it does, so if you see something odd (pun intended) happening, you can turn the flags on without restarting the client, and then copy/paste the results. I can't remember, offhand, if "within-project" tasks are ran "earliest deadline first" or "earliest downloaded first". But I do know that, once a tasks starts running, if it is in the middle of a time slice (ie: not yet at a checkpoint), it takes priority over other tasks for that project. Hope this somehow helps. Sorry it doesn't have the full answer. You might have to do some more digging with log flags. ID: 52925 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52926 - Posted: 3 Mar 2014, 3:51:53 UTC Last modified: 3 Mar 2014, 3:56:25 UTC I did find a little information in here: http://boinc.berkeley.edu/trac/browser/boinc-v2/client/cpu_sched.cpp Basically, I found that assign_results_to_projects is called, to determine which tasks to consider running for a project. And it says the preference is: 325 // The preference order: 326 // 1. results with active tasks that are running 327 // 2. results with active tasks that are preempted (but have a process) 328 // 3. results with active tasks that have no process 329 // 4. results with no active task So... Is it possible that your non-OWS tasks already had processes created (ie were running?) If not, then, I guess I still haven't figured out how it chooses :) ID: 52926 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52927 - Posted: 3 Mar 2014, 5:18:29 UTC - in response to Message 52926. Is it possible that your non-OWS tasks already had processes created (ie were running?) If not, then, I guess I still haven't figured out how it chooses :) Nope, none. In fact some of the WUs that BOINC started had not even been downloaded until well after the OWS WUs. ID: 52927 ·

Jacob Klein Volunteer tester Help desk expert Send message Joined: 9 Nov 10 Posts: 63	Message 52928 - Posted: 3 Mar 2014, 6:18:45 UTC Last modified: 3 Mar 2014, 6:21:15 UTC Well, then I think Jord's advice (about using debug flags) is your best bet to help us isolate the issue. And maybe asking David, via the BOINC Alpha list. ID: 52928 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52933 - Posted: 3 Mar 2014, 14:38:13 UTC Last modified: 3 Mar 2014, 14:39:49 UTC Interesting. last night I received some Yoyo OWS WUs on a second client, except these now have a deadline time of 240 hours instead of 15. At the same time BOINC downloaded Yoyo Evolution WUs (748 hour deadline) and Yoyo crunch WUs (480 hour deadline). This time the Yoyo OWS WUs are running first, which is what I would hope (and expect). Since I also posted this behavior on the Yoyo forum I would guess that yoyo responded by increasing the deadline for OWS WUs. So this particular problem seems to have been solved by yoyo. What is still troublesome is that BOINC 7.2.42 (but I don't believe this is a new issue) had such strange behavior when confronted with short deadline WUs. As I think I mentioned above, I noticed similar behavior to this previously with other BOINC versions. Anyway, testing this will be difficult or impossible given the new longer OWS deadlines. Thanks Jacob and Jord for helping to troubleshoot this issue. I believe there is a BOINC problem in this regard but I suppose we'll have to wait for a similar scenario in order to root it out (unless there's a way to replicate it with the BOINC simulator). An interesting sideline (perhaps related?) that you may want to test yourself. Create or wait for a scenario where BOINC goes into panic mode, then lower your minimum work buffer. When BOINC is in panic mode at first I had a buffer of .5 day set. Lowering the buffer to .2 day decreased the number of WUs that were running at high priority. Decreasing the buffer again to .1 day again lowered the number of WUs at high priority. The size of the additional work buffer doesn't seem to matter. I've had this happen on all my machines and have one by one set all the minimum work buffers to very small values and increased the additional work buffer to compensate. This BOINC behavior has been consistent for a long time and I find it a bit mystifying. Try it yourself. ID: 52933 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 52934 - Posted: 3 Mar 2014, 14:55:41 UTC - in response to Message 52933. If you want to analyse the High Priority (aka EDF or 'Panic Mode') decision-making process, add the <rr_simulation> log flag to the list Jord gave you earlier. But be warned, it generates a lot of output and is quite hard to interpret. ID: 52934 ·

Juha Volunteer developer Volunteer tester Help desk expert Send message Joined: 20 Nov 12 Posts: 801	Message 52941 - Posted: 3 Mar 2014, 17:14:39 UTC - in response to Message 52933. Create or wait for a scenario where BOINC goes into panic mode, then lower your minimum work buffer. When BOINC is in panic mode at first I had a buffer of .5 day set. Lowering the buffer to .2 day decreased the number of WUs that were running at high priority. Decreasing the buffer again to .1 day again lowered the number of WUs at high priority. The size of the additional work buffer doesn't seem to matter. If I'm not mistaken the "minimum work buffer" still doubles as "connect every x days". BOINC tries to have every task completed before that many days before their deadline. Using your numbers. If you set minimum work buffer to 0.5 days or 12 hours BOINC will try to have the task completed 12 hours before it's deadline. If the deadline is 15 hours that leaves 3 hours to complete the task. If you have multiple tasks near the deadline then multiple tasks need to be rushed. If you set minimum buffer to 0.2 days you are giving BOINC more time to complete the tasks. Regarding your issue with weird scheduling decisions. What was the order the tasks were assigned to you? In same or different scheduler requests? What was the order the tasks had all of their input files downloaded? If you want to study BOINC's behavior more it's easy. You can convince BOINC that a task's deadline is whatever you like by simply editing client_state.xml. (If you break anything you own all the parts :) ID: 52941 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52943 - Posted: 3 Mar 2014, 17:34:43 UTC - in response to Message 52941. Regarding your issue with weird scheduling decisions. What was the order the tasks were assigned to you? In same or different scheduler requests? What was the order the tasks had all of their input files downloaded? Outlined above. The OWS tasks with short deadlines were DLed before some of the tasks with long deadlines, yet the longer deadline WUs ran first. Check the above messages for more details. Will be away from the computer for a day or two so won't be able to reply for a while. ID: 52943 ·

Beyond Send message Joined: 16 Aug 12 Posts: 39	Message 52967 - Posted: 5 Mar 2014, 14:20:09 UTC Another example of lovely scheduling in 7.2.42: BOINC is continuing to run LHC WUs with over 150 hours left to deadline while 14 Yoyo ECM WUs with under 13 hours to deadline sit. This will continue until all of a sudden BOINC goes into panic mode and starts bumping GPU WUs. IMO there's just no reason for this kind of behavior. Maybe we should give it a timeout and make it sit in the corner. On a second machine the same thing is happening except there's 19 Yoyo ECM WUs sitting, also with less than 13 hours to deadline. Jacob, I e-mailed you the screenshots since I have no way to host them at the moment. ID: 52967 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.