Rosetta sends too much work

Author	Message
xaminmo Send message Joined: 2 Feb 18 Posts: 4	Message 84623 - Posted: 2 Feb 2018, 1:28:59 UTC Last modified: 2 Feb 2018, 1:38:39 UTC Rosetta seems to send more work than BOINC requests, and BOINC ends up preferring Rosetta well past its resource share setting. The main system I notice this on is here: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3290081 Even if I limit Rosetta to a third of my CPUs, BOINC just leaves my CPUs idle rather than downloading work from other projects. If I suspend Rosetta, within 5 seconds, my queue will have work from other projects. This only affects CPU projects. No GPU projects, and none of my other projects seem to preempt the scheduler. This was brought up in the past, but all the user got was a sarcastic response from a forum mod. This is basically the same as https://boinc.berkeley.edu/dev/forum_thread.php?id=8903 I consider this a defect in the operation of BOINC, since BOINC is the workload manager and job scheduler. I'm hoping for a workaround, or a dev commitment to help improve this. ID: 84623 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 84626 - Posted: 2 Feb 2018, 6:15:50 UTC - in response to Message 84623. Last modified: 2 Feb 2018, 6:17:33 UTC This was brought up in the past, but all the user got was a sarcastic response from a forum mod. Funny how you can read what my mood was on 7 Feb 2014 at 5:40 in the afternoon. I'd appreciate it if you refrained from posting stuff like that again. I reread my answer there, and it's far from sarcastic. But if you don't like the answer I gave there, that's something different, and all about your own state of mind. Not necessary to rub that off onto me. Onto an answer: the resource share in BOINC 7 is one that runs over long time. Short time changes have no effect. Short time changes in preferences have no effect either. If you feel Rosetta sends more work than BOINC asks for, you'll have to take that up with that project, we have no bearing over them. At least they have updated their server software to something more recent, it used to be ancient (beginning of BOINC time). As for BOINC development, if after today you still feel this needs to be changed, post an Issue at https://github.com/BOINC/boinc/issues and hope someone takes you up on it. Do know we only have 3 developers, all of them volunteers. Not all of them will know in-depth how the scheduler works, and the one that does is swamped in other work already, among others the new way of doing science, soon to be revealed. He may have time after that though. Before posting your issue, you may want to search through back-issues if there's none there yet that asks what you want to ask, such as More cores used than user limit and BOINC may not use all CPUs in some cases. ID: 84626 ·

Jim1348 Send message Joined: 8 Nov 10 Posts: 310	Message 84631 - Posted: 2 Feb 2018, 17:30:28 UTC I provided an answer over on the Rosetta forum too, assuming that I diagnosed it correctly. http://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893 ID: 84631 ·

xaminmo Send message Joined: 2 Feb 18 Posts: 4	Message 84703 - Posted: 7 Feb 2018, 5:21:37 UTC - in response to Message 84626. Last modified: 7 Feb 2018, 5:55:36 UTC Funny how you can read what my mood was on 7 Feb 2014 at 5:40 in the afternoon.... The key point intended was to convey that the issue is not new, and had not been clearly answered in the prior posts about it. I didn't mean to bust your chops over it per se. It was a small part. I'll agree that my own frustration could be dialed back in the bottom part of my post, but I stand by the statement that it's not constructive to tell users things like "what else do you expect" and "funny how you can read my mood". I deleted the rest of my reply. It was asking for clarification of long-term, explaining details of how long I'd been running vs stuck, and details of the searches I made prior to posting, etc. Jim1348 clarified the definition of "what does long term" mean (10 day REC half-life is default), and suggested a specific client option that can help tune this. I think that was exactly what I needed. ID: 84703 ·

xaminmo Send message Joined: 2 Feb 18 Posts: 4	Message 84704 - Posted: 7 Feb 2018, 5:36:15 UTC - in response to Message 84627. Last modified: 7 Feb 2018, 6:25:53 UTC I did not point it out to bust your chops over phrasology. That is a filtering/semantic issue on my part. The intent was to point out that the issue is not new, and the existing replies for a similar issue did not provide the needed info. I deleted the rest of my reply because it is superseded. Jim1348 pointed me in the right direction, but I appreciate that you spent time replying to me. ID: 84704 ·

xaminmo Send message Joined: 2 Feb 18 Posts: 4	Message 84706 - Posted: 7 Feb 2018, 5:47:21 UTC - in response to Message 84631. Last modified: 7 Feb 2018, 6:26:19 UTC I provided an answer over on the Rosetta forum too, assuming that I diagnosed it correctly. http://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893 Thanks! That answers a major question (what does long-term mean). Based on that info, I expect your answer is probably exactly what I needed. I see you're not gridcoined, but if you have any sort of kudos or imaginary points you collect anywhere, PM me and I'll send something. ID: 84706 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 84708 - Posted: 7 Feb 2018, 8:11:36 UTC - in response to Message 84703. Last modified: 7 Feb 2018, 8:11:56 UTC Jim1348 clarified the definition of "what does long term" mean (10 day REC half-life is default) REC isn't recalculated over a precise 10 days either. From cpu_sched.cpp you can see how REC (Recent Estimated Credit] is defined: // CPU scheduling logic. // // - create an ordered "run list" (make_run_list()). // The ordering is roughly as follows: // - GPU jobs first, then CPU jobs // - for a given resource, jobs in deadline danger first // - jobs from projects with lower recent est. credit first // In principle, the run list could include all runnable jobs. // For efficiency, we stop adding: // - GPU jobs: when all GPU instances used // - CPU jobs: when the # of CPUs allocated to single-thread jobs, // OR the # allocated to multi-thread jobs, exceeds # CPUs // (ensure we have enough single-thread jobs // in case we can't run the multi-thread jobs) // NOTE: RAM usage is not taken into consideration // in the process of building this list. // It's possible that we include a bunch of jobs that can't run // because of memory limits, // even though there are other jobs that could run. // - add running jobs to the list // (in case they haven't finished time slice or checkpointed) // - sort the list according to "more_important()" // - shuffle the list to avoid starving multi-thread jobs // // - scan through the resulting list, running the jobs and preempting // other jobs (enforce_run_list). // Don't run a job if // - its GPUs can't be assigned (possible if need >1 GPU) // - it's a multi-thread job, and CPU usage would be #CPUs+1 or more // - it's a single-thread job, don't oversaturate CPU // (details depend on whether a MT job is running) // - its memory usage would exceed RAM limits // If there's a running job using a given app version, // unstarted jobs using that app version // are assumed to have the same working set size. Next you can see from line 86 onward how it's calculated, with going from line 565 how it's updated over time. But one caveat, REC determines mainly when a project's tasks are run, not when it's due to ask for work, or how much work BOINC will ask for. A project should always send in the neighborhood of how much work's being asked, it can be less, it can be more. But for on first work request, winch is for 1 second. Then it'll get at least one task for each hardware resource you designated needing to get work. The work that's still in cache will be accounted for in the calculation of REC, but the new to be asked work isn't (of course, as we don't know what it'll be and how long it's going to run for). So while you could experiment with the REC half-life value, here as well, don't expect that BOINC will immediately within the time set you give it, do what you want. Note: there is one, maybe one and a half person here at BOINC who totally knows how all of this fine-balancing of scheduling to and fro works, and I ain't one of them. ID: 84708 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.