Work cache amount calculation bug for multi-cpu from 5.8 on

Author	Message
Uioped1 Send message Joined: 2 Mar 06 Posts: 12	Message 12611 - Posted: 20 Sep 2007, 1:09:02 UTC It appears that the algorithm used to calculate the amount of time it will take to process a given workload is incorrect for multi-cpu systems. This was first noticed after an upgrade to 5.8.x, and I continue to experience it on 5.10.20. This does not appear to affect single cpu systems, and may be attributable to long-running workunits such as CPDN. It appears that the client scheduler over estimates the time it will take to complete and report a given set of work units, causing the amount of work fetched at any given time to be well below the specified "connect about every" and "additional work buffer" settings. I frequently observe my multi-core system processing all but one workunit before requesting additional work. I have read both the previous post explaining the expected scheduling calculations and posts stating that the locally set preferences are unreliable. As such I have hand calculated what the scheduler should think my cache is like, and verified that it should be requesting work and is not. (in the cases where it has more work than processors. in the case where it runs out of work, obviously it is not correct.) There are possible complications in my case: Firstly, I connect to CPDN. Thus on a pure hourly basis, my cache is almost always full (I set to connect once a day and keep a days extra work around.) however this is incorrect as this work only occupies one CPU. Secondly, I also connect to LHC which almost never has work and likely has accumulated a very large debt. Finally, I connect to 5 projects in total. ID: 12611 ·

Keck_Komputers Send message Joined: 29 Aug 05 Posts: 304	Message 12629 - Posted: 20 Sep 2007, 8:44:07 UTC It is probably due to CPDN. If the client thinks it will miss it's deadline that may block work fetch until there is a dry CPU. Another thing to check is your time stats. If your host is not on all the time or you only allow work while your computer is idle these numbers will decrease and your work fetch will be lower. The formula is along the lines of: work request=shortfall * on_frac * active_frac * cpu_efficiency BOINC WIKI BOINCing since 2002/12/8 ID: 12629 ·

Uioped1 Send message Joined: 2 Mar 06 Posts: 12	Message 12671 - Posted: 21 Sep 2007, 0:51:43 UTC - in response to Message 12640. I don't think that it is either of the issues mentioned, and here's why: I don't think the system thinks it will miss a deadline because it doesn't go into EDF mode. The system is nearly always on, and rarely heavily used, and so the multiplier should be very close to one. Also, while I have seen the flop count estimate fluctuation problem before, that would not seem to be the case here. In particular because the flop count is part of the estimated time remaining, which is what I used in my calculations of what should have been. If there is any measurement I can take to help pinpoint where the presumed bug is, let me know. ID: 12671 ·

Uioped1 Send message Joined: 2 Mar 06 Posts: 12	Message 12738 - Posted: 25 Sep 2007, 17:01:28 UTC - in response to Message 12671. I don't think that it is either of the issues mentioned, and here's why: I don't think the system thinks it will miss a deadline because it doesn't go into EDF mode. The system is nearly always on, and rarely heavily used, and so the multiplier should be very close to one. Also, while I have seen the flop count estimate fluctuation problem before, that would not seem to be the case here. In particular because the flop count is part of the estimated time remaining, which is what I used in my calculations of what should have been. If there is any measurement I can take to help pinpoint where the presumed bug is, let me know. I guess I was completely wrong :) I didn't realize that the EDF messages had been removed from the log, and were a suffix to the individual workunit's status ("Running, high priority") I dug up how to get some more information on the scheduler's decisions (a very nice feature!) nad figured out that the problem lay elsewhere. I now believe the problem to be with the active_frac key in the client_state., so I have created a new thread: http://boinc.berkeley.edu/dev/forum_thread.php?id=2154 ID: 12738 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.