DCF Integrator

Message boards : Questions and problems : DCF Integrator
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28159 - Posted: 20 Oct 2009, 17:39:09 UTC - in response to Message 28157.  

The only problem will be that the log file(s) will grow dramatically.

Gruß,
Gundolf
[edit]Since SETI is currently shut down for maintenance, I can't check, but there should be a thread (by Richard Haselgrove?) with a warning that your problem might occur when using the rescheduler.
Yes I reschedule only LAR work to the CPU where it is done more efficiently. Just over 2 hours.

Perhaps the value you have set for <flops> isn't right, since it should prevent the DCF from varying that much.
Yes I have <flops> sections in the app_info file to force Boinc to display the actual work time required for each work unit.
[/edit]


I have the rescheduler set up with a batch file that Windows runs 2 times per day at 12:00am and 12:00pm. These two batch downloads occured 7 hours after the rescheduler had run.

The flops I have set up in app_info result in correctly predicted run times for the work in the cache. But for some unknown reason the system becomes vastly hungry for GPU work when the VLAR work units are being finished and at the same time an extremely short time GPU work unit is run. The GPU can process some of the work in as short as 3 minutes.

I will continue to run this setup and hopefully can help resolve the problem.

Will advise...............

ID: 28159 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28169 - Posted: 20 Oct 2009, 21:17:03 UTC
Last modified: 20 Oct 2009, 21:19:38 UTC

Jord,

in hope of catching this behavior by boinc sooner I have enabled the logging on all 5 of my computers.

Hopefully one of them will do it agian soon.

edit....how many minutes of data logging is contained in the stdoutdae.old file?
ID: 28169 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 28172 - Posted: 20 Oct 2009, 21:36:05 UTC - in response to Message 28169.  

I don't know. A couple of hours usually.

You can increase your log files using the <max_stdout_file_size>size_in_bytes</max_stdout_file_size> option in cc_config, like so:

<cc_config>
<log_flags>
<blah></blah>
</log_flags>
<options>
<max_stdout_file_size>8388608</max_stdout_file_size>
</options>
</cc_config>


That's a log of 8 megabytes (8 * 1024 * 1024).
ID: 28172 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28183 - Posted: 21 Oct 2009, 0:10:07 UTC

Thanks..........

<max_stdout_file_size>8388608</max_stdout_file_size>

Added to all 5 cc_config files. Now I wait.
ID: 28183 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28185 - Posted: 21 Oct 2009, 2:02:11 UTC
Last modified: 21 Oct 2009, 2:54:32 UTC

Jorg,

One of my computers just completed a VLAR CPU work unit. The cache size instantly doubled the estimated time to crunch the cache. I checked the computer and it did not download more work but the DCF is in the 1.198 range and new reducing each time a GPU wu finishes. Now down to 1.1325.

I have <flops> set in the config.ini file and the time to completion is correct for all work units until a VLAR unit is done. As long as there are no VLAR work on the machine it is happey with my flops estimates.

I have a 2MB log file if anyone is interested. Will also supply app_info if needed. Oh yes......this is machine 5137307 attached to Seti only.

STill logging on all my machines.

edit.....problem is the file only covers 9 minutes and there is only one DCF line in the entire file. So probably no use to anyone.

further edit.........even in a 8mb file there are only 2 DCF lines. Can we turn one of the 4 logging symbols to gain more file space for the DCF?
ID: 28185 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28186 - Posted: 21 Oct 2009, 3:53:16 UTC
Last modified: 21 Oct 2009, 4:02:04 UTC

Jord,

I have several 8mb data files now. None of them have more than 3 DCF: lines in them and this is with only <dcf_debug>1</dcf_debug> logging. Covering a time frame of 22 minutes. So this is pretty much a waste of my time as I stated earlier.

Since reporting on the odd behavior of Boinc is not allowed without backup data sets for proof, again this is a waste of my time.

Have a good day.
ID: 28186 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28188 - Posted: 21 Oct 2009, 6:00:47 UTC

One final thought..................

Perhaps you are not aware of track ticket 812 in which David Anderson states..........

DCF is a kludge to compensate for bad FLOP estimates by projects. I don't want to make the kludge even more complicated.

I believe I have proven David Anderson's point. My flops estimates are correct as all predicted crunch times are within 60 seconds of actual. Including the predicted crunch times of the VLAR work on the CPU's.

It is only when the DCF value is allowed to dramatically change value instantly that the system then becomes unbalanced with unpredictable results.

I have seen hundreds of work units requested and downloaded. I have seen cache sizes more than double in time.

Now since this ticket was opened Opened 10 months ago and Last modified 2 months ago. It does not look like Berkeley intends to address this problem in the near future.

ID: 28188 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 28192 - Posted: 21 Oct 2009, 8:35:43 UTC - in response to Message 28188.  

My flops estimates are correct as all predicted crunch times are within 60 seconds of actual. Including the predicted crunch times of the VLAR work on the CPU's.

One of my computers just completed a VLAR CPU work unit. The cache size instantly doubled the estimated time to crunch the cache.

These two statements are incompatible. Either the flops estimate in your app_info.xml file is accurate, in which case the cache estimate will remain unchanged on completion: or the cache estimates doubled because of a DCF change, in which case your flops value is wrong (by definition, because the sole purpose of the flops value is to normalise DCF).
ID: 28192 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28195 - Posted: 21 Oct 2009, 12:34:34 UTC - in response to Message 28192.  
Last modified: 21 Oct 2009, 12:40:23 UTC

My flops estimates are correct as all predicted crunch times are within 60 seconds of actual. Including the predicted crunch times of the VLAR work on the CPU's.

One of my computers just completed a VLAR CPU work unit. The cache size instantly doubled the estimated time to crunch the cache.

These two statements are incompatible. Either the flops estimate in your app_info.xml file is accurate, in which case the cache estimate will remain unchanged on completion: or the cache estimates doubled because of a DCF change, in which case your flops value is wrong (by definition, because the sole purpose of the flops value is to normalise DCF).


And that is the point. The DCF is making a massive change upon the completion of a VLAR work unit which throughs everything off.

Could we not have implemented...........

If <flops> are spedified then restrict DCF to 5% maximum transitions in each adjustment. DCF would settle somewhere but not in one jump.
ID: 28195 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 28196 - Posted: 21 Oct 2009, 12:38:27 UTC - in response to Message 28186.  

So this is pretty much a waste of my time as I stated earlier.

How is letting BOINC log extra information a waste of your time? Are you sitting inside the computer all the time while it does its work? Are you manually checking if it does its calculations correctly, making sure it is writing each byte to the correct sector of the right platter of the disk drive?

If it truly feels like a waste of your time - to let the software do extra logging - then stop with what you're doing. Delete the present cc_config.xml files, put back your own version (if you backed them up). Also uninstall the present development version of BOINC and return to the safety of the recommended version. Stop trying to learn about the software, just let it do its job. Only update to the next recommended version when BOINC tells you there is one.

You haven't posted any of the logs, neither here nor on the alpha email list. There's nothing in it that we can check, we can only take your word for it that you see this happen. Which when the software can log just about everything - even when you perceive that's a waste of your time - is totally unnecessary.

Perhaps that you are not aware of this, but there are certain rules to abide by when you try to report a bug. So far you haven't followed them. If you're interested though, take a look at How to Report Bugs Effectively. As long as you don't perceive that to be a waste of your time either. Much reading to do.
ID: 28196 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28198 - Posted: 21 Oct 2009, 12:51:05 UTC
Last modified: 21 Oct 2009, 12:55:37 UTC

An 8mb data file contained only 2 or 3 lines which referenced the DCF and only 22 minutes of data logged. Anyone would then say that the data set is not large enough to evaluate.

I'm sorry Jord. My observations of this problem are going to have to be sufficient. I see no way to get sufficient data on the DCF from this side of the project.

I can make a suggestion that the DCF be locked to within a 5% maximum change when the user is specifing <flops> values and then system would be much better handling the cache of work.

What can it hurt to install onle line of code and try it out?


Eric has spoken before about the massive shift of the DCF. Can someone invite Eric to read this forum message and possibly respond?
ID: 28198 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 28199 - Posted: 21 Oct 2009, 12:52:56 UTC - in response to Message 28198.  

An 8mb data file contained only 2 or 3 lines which referenced the DCF and only 22 minutes of data logged.

After you put in the change to the max file size, did you restart BOINC? Rule of thumb: when you make changes to the cc_config.xml file, exit BOINC and restart it.
ID: 28199 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28200 - Posted: 21 Oct 2009, 12:56:21 UTC - in response to Message 28199.  
Last modified: 21 Oct 2009, 12:57:26 UTC

An 8mb data file contained only 2 or 3 lines which referenced the DCF and only 22 minutes of data logged.

After you put in the change to the max file size, did you restart BOINC? Rule of thumb: when you make changes to the cc_config.xml file, exit BOINC and restart it.


Yes.........restarted the entire farm of computers. There was still much info being logged even though I specifed DCF only. The file size became 8mb in 22 minutes.
ID: 28200 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 28201 - Posted: 21 Oct 2009, 13:31:47 UTC

Then either increase the max file size (add a zero to the end), or reduce the amount of things you debug.

You want to know about the DCF?
Then <dcf_debug> and <cpu_sched_debug> are probably the only ones you need. Disable the others, or remove them from your cc_config.xml file.

It doesn't hurt to read a little on what the flags do, you know? I have posted what they do plenty of times in this thread.

But if you're not sure:
<cpu_sched_debug>: problems involving the choice of applications to run.
<work_fetch_debug>: problems involving work fetch (which projects are asked for work, and how much).
<rr_simulation>: problems involving jobs being run in high-priority mode.
<dcf_debug>: Shows the calculation of the duration correction factor per project.

ID: 28201 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28202 - Posted: 21 Oct 2009, 14:05:39 UTC

Well..................

I must eat a little crow here and appologize to one and all.

I have come to the conculsion now that my flops estimates are not "spot on" and if they were I would not be seeing this problem.

I will keep working to ballance the <flops> and see if I reach nirvana.

Again....sorry for the uproar.
ID: 28202 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 28204 - Posted: 21 Oct 2009, 14:22:49 UTC

Aw shucks. Just as I'd finished looking up the first appearance of the flopcount in print.

But it's still relevant to make the point I was going to make. DCF is a safety valve. It's designed to react quickly when things go wrong, and to reset gradually as they settle back down. But there are problems when, as with SETI at the moment (and ever since that first post on 8 March), where three different applications with vastly different characteristics are fighting for control of the same safety valve. The remedy is well known, and has been accepted in principle by the developers: separate DCF values for each app_version. But that's been fighting for developer time with every other item on every wish list on every project.

In the meantime, there is a mechanism which can be fine-tuned (even coarse-tuned will do) to help smooth things over pro tem. But it does require care and attention to detail. There are step-by-step posts by MarkJ and Pappa on the main SETI board: use those, not the values in that Beta post I linked - life was different then. And don't be afraid to post here - but with actual figures, copied and pasted from your computer, please: just re-typing what you believe the figures to be can add even more confusion.
ID: 28204 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28205 - Posted: 21 Oct 2009, 14:25:54 UTC

Yea, DA quote finally hit home. If the <flops> values are correct then the DCF is not needed at all.
ID: 28205 · Report as offensive
Geek@Play
Avatar

Send message
Joined: 20 Jan 09
Posts: 70
United States
Message 28247 - Posted: 22 Oct 2009, 21:19:52 UTC

Well, I have set up the <flops> number using the formula's as provided. I think the biggest problem I had was that I did the same procedure before but rounded it to numbers like X.XXXe09 and used that. Apparently extreme accuracy required here. At this instant my DCF is as follows on 5 machines.

0.1760
0.1750
0.1909
0.2380
0.2119

Now my question is. After all this work to get the DCF down near the 0.2 range, why is that the desirable value?

I do understand that with Seti, Astropulse and CUDA all using the same DCF that it will never be truley stable and the crunch time estimates will always have some error until seperate project DCF is available.

Thanks to all who tolerated my ramblings earlier and finally got me on the track. Must remember to take the medications sometimes.
ID: 28247 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 28248 - Posted: 22 Oct 2009, 21:34:22 UTC - in response to Message 28247.  

Now my question is. After all this work to get the DCF down near the 0.2 range, why is that the desirable value?

I don't think it is. You're getting close to the lower limit of the DCF (0.02), before that's going to be ignored and you're only going to ask for 1 second worth of work.

Isn't the value of Seti's DCF you want somewhere between 0.5 - 1 for normal apps and 0.3 - 0.8 for optimised apps? Perhaps that CUDA is throwing the spanner here.

I do understand that with Seti, Astropulse and CUDA all using the same DCF that it will never be truley stable and the crunch time estimates will always have some error until seperate project DCF is available.

And even then it's not stable as the run time of tasks are not the same all over the board. The Angle Range throws that off by a big amount. Nothing said about the differences here between normal and optimised apps.

But now you start to get an insight in the difficulties that the developers (full-time and volunteer) are faced with.

Thanks to all who tolerated my ramblings earlier and finally got me on the track. Must remember to take the medications sometimes.

Just remember we all have bad days. Although you usually don't see me around when I do have one. ;-)
ID: 28248 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 28249 - Posted: 22 Oct 2009, 21:35:32 UTC - in response to Message 28247.  

Now my question is. After all this work to get the DCF down near the 0.2 range, why is that the desirable value?

My fault. I chose it.

No single overwhelming reason, really.

I do like projects which (slightly) over-estimate their initial running times, and hence slowly edge the DCF downward. That means you start by downloading relatively few WUs, and gradually download more as BOINC gains confidence. Compare with a project which under-estimates the initial running time: you download too many before you know what's going on, and then have a struggle to complete them in time.

Also, IIRC, at the time I first chose the figure, the stock SETI application running on a fairly standard Core2 processor under Windows tended to settle at around DCF=0.2 (optimised AK_V8 could reach 0.1, and stock AP was about 0.4). It seemed best to choose a figure which would seem not too outrageous compared with what people had seen before, and hence avoid frightening the horses.

So that figure is SETI-specific, and related to the general downward drift in DCF as the stock application has improved, and as CPU architectures have improved. If I was starting with a clean sheet and none of that historical baggage, I'd probably have targeted 0.8
ID: 28249 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Questions and problems : DCF Integrator

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.