BOINC Allows WU's to go Passed Deadline.

Author	Message
The Gas Giant Send message Joined: 30 Aug 05 Posts: 65	Message 32855 - Posted: 18 May 2010, 5:52:52 UTC Last modified: 18 May 2010, 5:54:32 UTC I've seen this a couple of times but didn't worry too much as the wu's in question were giving 6 cs, but the principle is getting to me now. On my work machine Dell E6850 dual core 2GB Ram, BOINC 6.10.45 now .56, running Aqua, PrimeGrid, LHC, MalariaControl and FreeHAL, I've seen BOINC ignore a wu heading passed its deadline. In fact this morning there was one that was passed it's deadline. This appears to be due to the interaction with the mutli-thread/core Aqua wu's and the single core 'other' projects and resource share allocation resulting in a single core project wu going passed its deadline. Resource Shares are: Aqua 16.55% DNA 22.37% (suspended) FreeHAL 1.12% LHC 42.73% MalariaControl 15.55% PrimeGrid 0.67% Currently only Aqua and PrimeGrid are supplying wu's and due to the resource share allocated to Aqua, only Aqua is being crunched unless PrimeGrid which has 7 wu's cached get's into deadline problems - which it will due to its low resource share. Effectively trying to operate has a backup project. I got into work this morning installed 0.56 and noticed the red messages regarding a PrimeGrid wu passed its deadline from yesterday even though I had just been letting BOINC "do its thing". This PG wu was the only wu with a dead line of the 17th the 6 others have deadlines of the 19th. Aqua wu's have deadlines of the 28th with 45min completion times with 4 cached (limited by Aqua). It appears that with Aqua running and if BOINC doesn't see 2 or more wu's in deadline trouble it will ignore the single wu and continue to crunch the multi-thread/core project. I have also seen on my home machine a wu get into dealine trouble and not be crunched when BOINC worked out that a core would be left idle if it worked on the wu in deadline trouble. [edit] I'll soon see if the same thing happens with the 6 cached PG wu's due tomorrow. ID: 32855 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 32856 - Posted: 18 May 2010, 6:31:14 UTC - in response to Message 32855. Run with <rr_simulation>, <cpu_sched_debug>, <std_debug> and perhaps an option to make the stdoutdae.txt file a lot bigger than its normal 2MB. Run for the duration of choice, then compress the output file and email it to me. You know where. I'll get it to the developers. In case you feel adventurous, you can run BOINC ClientSim to see if it does the same. Example of cc_config.xml with above choices and a 20MB stdoutdae.txt file: <cc_config> <log_flags> <cpu_sched_debug>1</cpu_sched_debug> <rr_simulation>1</rr_simulation> <std_debug>1<std_debug> </log_flags> <options> <max_stdout_file_size>20971520</max_stdout_file_size> </options> </cc_config> ID: 32856 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 32858 - Posted: 18 May 2010, 11:03:58 UTC I think TGG is right to finger the handling of multi-threaded (i.e. AQUA) tasks for this one. We went through several stages of MT scheduling in development testing, including the currently recommended v6.10.18 which has a tendency to leave single-threaded tasks unstarted (or waiting to run) if there is an AQUA task in the mix but not active. I have a dual-core attached to AQUA and QuantumFIRE (you could call it my quantum computer...). At the moment, the AQUA admins are pressing harder for results, so I have NNT set for QF. Each time I do that, I have one orphaned QF task left over, which is never scheduled to run because there's nothing to pair it with - even though debt is on the limit, with +/- 86,400 seconds for the two projects. Up to now, I haven't let anything reach deadline (I've manually allowed the orphan a playmate when deadlines approach, and then they get scheduled together until one completes). But if we still need logs when the time comes (early hours of 24 May for the current one), I can supply them. On the other hand, I think David will recognise that this is a consequence of the current design. ID: 32858 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 32859 - Posted: 18 May 2010, 13:19:25 UTC - in response to Message 32856. Last modified: 18 May 2010, 13:19:41 UTC OK, according to JM7 run with the following flags: <cc_config> <log_flags> <task>1</task> <cpu_sched_debug>1</cpu_sched_debug> <rr_simulation>1</rr_simulation> <cpu_sched>1</cpu_sched> </log_flags> <options> <max_stdout_file_size>20971520</max_stdout_file_size> </options> </cc_config> ID: 32859 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 32861 - Posted: 18 May 2010, 14:46:45 UTC OK, I've updated my Quantum computer to v6.10.56, brought forward the QF deadline to 22:30 this evening (about 30 minutes longer than BOINC estimates it would need), and set the log flags. I have just three tasks on the machine at the moment - one QF and two AQUA. I suspended all three tasks while I did the fiddling around: then I first resumed the QF. BOINC reported it running High Priority. Then I resumed an AQUA: BOINC preempted the QF, and ran AQUA instead (task duration ~70 minutes, deadline 10 days away). I'll send John (and David?) edited highlights of the log as it approaches and passes the artificial deadline. Fiddling around with Unix time converters, I find that my phone number converts to nest Saturday evening. I don't think that proves very much, but it was diverting.... ID: 32861 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 32862 - Posted: 18 May 2010, 18:12:04 UTC Never mind the log. David put in a fix at [trac]changeset:21563[/trac]. However, since we're at the end of 6.10, it's not going to be back-ported, but instead will show up in the next client range. ID: 32862 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 32863 - Posted: 18 May 2010, 19:05:05 UTC - in response to Message 32862. Last modified: 18 May 2010, 19:09:57 UTC You mean all those megabytes of 18-May-2010 19:52:58 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 dur: 26513.53 = 0.33526079.79 + 0.66526732.13 18-May-2010 19:52:58 [AQUA@home] [rr_sim] 29apr10-qm-8-100-148-480_1_72_0 dur: 4510.60 = 0.9404511.99 + 0.0604489.11 18-May-2010 19:53:01 [AQUA@home] [cpu_sched_debug] Request enforce CPU schedule: 29apr10-qm-8-100-148-480_1_72_0 checkpointed 18-May-2010 19:53:01 [---] [cpu_sched_debug] enforce_schedule(): start 18-May-2010 19:53:01 [---] [cpu_sched_debug] preliminary job list: 18-May-2010 19:53:01 [QuantumFIRE alpha] [cpu_sched_debug] 0: casino_p2-hno_04_parasweep.1000084_0 (MD: yes; UTS: no) 18-May-2010 19:53:01 [AQUA@home] [cpu_sched_debug] 1: 29apr10-qm-8-100-148-480_1_72_0 (MD: no; UTS: yes) 18-May-2010 19:53:01 [---] [cpu_sched_debug] final job list: 18-May-2010 19:53:01 [AQUA@home] [cpu_sched_debug] 0: 29apr10-qm-8-100-148-480_1_72_0 (MD: no; UTS: yes) 18-May-2010 19:53:01 [QuantumFIRE alpha] [cpu_sched_debug] 1: casino_p2-hno_04_parasweep.1000084_0 (MD: yes; UTS: no) 18-May-2010 19:53:01 [AQUA@home] [cpu_sched_debug] scheduling 29apr10-qm-8-100-148-480_1_72_0 18-May-2010 19:53:01 [QuantumFIRE alpha] [cpu_sched_debug] all CPUs used, skipping casino_p2-hno_04_parasweep.1000084_0 18-May-2010 19:53:01 [QuantumFIRE alpha] [cpu_sched_debug] casino_p2-hno_04_parasweep.1000084_0 sched state 1 next 1 task state 9 18-May-2010 19:53:01 [AQUA@home] [cpu_sched_debug] 29apr10-qm-8-100-148-480_1_72_0 sched state 2 next 2 task state 1 18-May-2010 19:53:01 [---] [cpu_sched_debug] enforce_schedule: end 18-May-2010 19:53:11 [AQUA@home] [rr_sim] 29apr10-qm-8-100-148-480_1_72_0 dur: 4495.95 = 0.9424496.38 + 0.0584489.11 18-May-2010 19:53:11 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 dur: 26513.53 = 0.33526079.79 + 0.66526732.13 18-May-2010 19:53:20 [---] [wfd]: work fetch start 18-May-2010 19:53:20 [---] [rr_sim] rr_sim start: work_buf_total 177120.00 on_frac 0.995 active_frac 1.000 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 dur: 26513.53 = 0.33526079.79 + 0.66526732.13 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] 0.00: starting casino_p2-hno_04_parasweep.1000084_0 (1.00 CPU) 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 29apr10-qm-8-100-148-480_1_72_0 dur: 4509.79 = 0.9424511.07 + 0.0584489.11 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 0.00: starting 29apr10-qm-8-100-148-480_1_72_0 (2.00 CPU) 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 0.00: 29apr10-qm-8-100-148-480_1_72_0 finishes after 1755.27 (7657.16G/4.36G) 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 1755.27: starting 29apr10-qm-8-100-148-480_1_101_0 (2.00 CPU) 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 1755.27: 29apr10-qm-8-100-148-480_1_101_0 finishes after 4510.98 (19678.60G/4.36G) 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] 6266.26: casino_p2-hno_04_parasweep.1000084_0 finishes after 17043.16 (38310.81G/2.25G) 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 misses deadline by 21829.71 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 dur: 26513.53 = 0.33526079.79 + 0.66526732.13 18-May-2010 19:53:20 [QuantumFIRE alpha] [rr_sim] casino_p2-hno_04_parasweep.1000084_0 dur: 26513.53 = 0.33526079.79 + 0.66526732.13 18-May-2010 19:53:20 [AQUA@home] [rr_sim] 29apr10-qm-8-100-148-480_1_72_0 dur: 4509.79 = 0.9424511.07 + 0.0584489.11 18-May-2010 19:53:20 [---] [wfd] ------- start work fetch state ------- 18-May-2010 19:53:20 [---] [wfd] target work buffer: 4320.00 + 172800.00 sec 18-May-2010 19:53:20 [---] [wfd] CPU: shortfall 324664.33 nidle 0.00 saturated 6266.26 busy 0.00 RS fetchable 100.00 runnable 200.00 18-May-2010 19:53:20 [AQUA@home] [wfd] CPU: fetch share 1.00 LTD -228595.50 backoff dt 0.00 int 0.00 (overworked) 18-May-2010 19:53:20 [QuantumFIRE alpha] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 0.00 (no new tasks) 18-May-2010 19:53:20 [CPDN Beta] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 0.00 (no new tasks) 18-May-2010 19:53:20 [Einstein@Home] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 0.00 (no new tasks) 18-May-2010 19:53:20 [AQUA@home] [wfd] overall LTD -234831.37 18-May-2010 19:53:20 [QuantumFIRE alpha] [wfd] overall LTD -11598.19 18-May-2010 19:53:20 [CPDN Beta] [wfd] overall LTD 0.00 18-May-2010 19:53:20 [Einstein@Home] [wfd] overall LTD 0.00 18-May-2010 19:53:20 [---] [wfd] ------- end work fetch state ------- 18-May-2010 19:53:20 [---] [wfd] No project chosen for work fetch are never going to be read? ;-) Just for the record, there's a "misses deadline by 21829.71" in there, and it has fetched new work for AQUA whilst in deadline trouble. I'm comfortable with leaving this in trunk and not holding up v6.10.56 - though if it gets called back yet again, and we have to go through another round of v6.10 testing, I'd suggest including this in the next re-release. ID: 32863 ·

The Gas Giant Send message Joined: 30 Aug 05 Posts: 65	Message 32865 - Posted: 18 May 2010, 21:20:57 UTC Glad to help out...shame it's not considered important enough to make it to a .57 release. WU's going passed deadline due to a bug...pretty important issue. ID: 32865 ·

Aurora Borealis Send message Joined: 8 Jan 06 Posts: 448	Message 32875 - Posted: 19 May 2010, 6:04:21 UTC - in response to Message 32865. Glad to help out...shame it's not considered important enough to make it to a .57 release. WU's going passed deadline due to a bug...pretty important issue. I think that the dev currently considers it more important to get out a new stable released than to have to worry about the new bugs that may come about by putting in a fix that is primarily a problem due to the needs of one project. ID: 32875 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 32877 - Posted: 19 May 2010, 6:37:53 UTC - in response to Message 32875. I think that the dev currently considers it more important to get out a new stable released than to have to worry about the new bugs that may come about by putting in a fix that is primarily a problem due to the needs of one project. Exactly. It's better to test this fix in a new client, than to add it to what's now, finally, after 56 revisions a reasonably stable client that adds ATI functionality. Remember, the developers didn't find it necessary to use 6.9 as the development number as all that was needed was to add ATI functionality. They figured we'd go on to 6.11 a good 6 months ago! We're way behind on development. Let's hope they learned you don't just 'add something' and that it'll work from the first get-go. :-) ID: 32877 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.