Message boards : BOINC client : Strange behaviour upgrading from 6.12.34 to 7.2.42
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Sep 05 Posts: 130 |
I have a large number of linux boxes, mainly Intel dual and quad cores and about half of them with GPUs, both NVIDIA and AMD of quite recent vintage. These mainly crunch Einstein. Until a couple of months ago, I had been using 6.12.34 on most GPU endowed hosts and 6.10.58 on most CPU-only hosts. I had upgraded a few to 7.X.X for a variety of reasons such as experimenting with new features like app_config.xml. In the main, I was quite happy to leave things alone - if it ain't broke, don't fix it! Towards the end of last year, I added POGS, initially to just a few, but over time to most hosts. This hasn't caused any issues - everything is working acceptably. In the last couple of months, I've built some new boxes with AMD 7850 GPUs. I started these off with a 7.2.X BOINC and installed both EAH and POGS. They have all been transitioned (if necessary) to 7.2.42 after it became the recommended version. I don't recall seeing any unusual behaviour with these. Recently, I decided to start upgrading all the older GPU hosts to 7.2.42. I've only done a relatively small number so far and all have been from 6.12.34. Every one has shown varying degrees of the same strange behaviour. Some have had POGS only on the CPUs whilst some have had both EAH and POGS running 50/50 on the CPUs. This is exactly what I wanted - equal numbers of available CPU cores running each project, independent of what the GPU was doing. I'm not sure this will continue under V7, but I guess I'll find out shortly :-). Prior to upgrading, hosts had work caches of 0.01/2.50 days. Because of the change in meaning of those settings (and as a precaution) I changed locally to 0.40/0.01 days so that whatever strange things happened, there should be no big work fetches on restarting. In combination with the BOINC upgrade, I also decided to do an OS upgrade, since it's quite easy to do and none of these hosts had seen such an upgrade in at least 6-12 months. A brand new fully up-to-date .iso for my distro had just been released so I built a liveUSB version from it on a USB external drive and it was all ready to go. The actual copying of files from the liveUSB takes just a minute or two. The time consuming bit is the post-install configuration which takes about 45 mins. The OS upgrade only touches stuff on the root partition. All the user stuff, including BOINC is on a second partition which gets mounted on /home and is quite unaffected by the OS install. The new version of BOINC is installed by copying the contents of the shell archive from Berkeley 'over the top' of the previous installation. For any new BOINC version, I run 'ldd' on the executables to make sure there are no missing shared libs. When I restart BOINC, there are automatic benchmarks because of the version change, 6.12.34 -> 7.2.42. I really haven't paid any attention to actual before and after values but I assume there must be some changes because all tasks in the cache have their estimated run times quite reduced. This is the reason why I've developed the habit of lowering the cache settings before upgrading BOINC. This is not the 'strange behaviour' I'm posting about - it doesn't really bother me that much. The strange behaviour happens sometime later when an 'in progress' task completes, or perhaps, more likely, when it actually gets reported. I haven't seen the precise moment because I've been doing these upgrades in the evening just before going to bed. When everything has restarted normally, I don't sit around for an hour or two to wait for a task to finish. The strange behaviour is that, at some point after the restart, the previously reduced estimates get blown out dramatically. It happens to both EAH and POGS but not necessarily in the same proportion. The hosts I started doing had GTX650 GPUs crunching EAH BRP5 tasks 2x in about 5-6 hours. POGS tasks had been taking about 2.5 hours. Immediately after the upgrade, the estimates were, perhaps, around 3.5 hours for BRP5 and say 1.3 hours for POGS. On checking the next morning, BRP5 tasks are estimated at perhaps 10-15 hours and POGS might be around 20-30 hours and are running in panic mode. Both types of tasks are completing in the normal, expected run times. As they complete, EAH tasks show the normal DCF reduction so it doesn't take too long for a degree of sanity to return there. Even after quite a few POGS tasks have completed, the crazy estimates remain unchanged, and POGS tasks continue in panic mode until the cache is virtually empty. I understand that EAH still uses DCF and that POGS uses something else (I've seen the <dont_use_dcf/> tag in the state file). I've also seen somewhere that a number of tasks for a particular application have to be completed for an average crunch time to be worked out. I thought that data was kept server-side, although I could be wrong about that. With the very latest host to be upgraded, I didn't change the OS at all. I upgraded BOINC only and restarted immediately. Same initial reduction in estimates. This time, I had deliberately arranged to have a couple of tasks close to completion and was intending to watch. I had an inopportune phone call I couldn't avoid and missed the critical bit. When I got back, the tasks had completed and been reported and the estimates had blown out. The 'ready-to-start' POGS tasks were now estimated at around 100-110 hours each. The estimate for the EAH tasks had hardly changed. It's now 18 hours later and there are just two POGS tasks on board, with no spares - one at 12% after 20 mins and one at 8% after 13 mins. Both are showing about 105 hours still to run so no hope of downloading any more for a while. Even when they are around 75% done, they will still show an estimate of many 10's of hours still to go although the real time will only be 20-30 mins. Seems like a pretty pathetic algorithm for correcting the estimates in the latest BOINCs. This host has done close to 20 POGS tasks under the new BOINC version. Surely that's enough for BOINC to come up with a better estimate? So I guess my questions are these:- 1. Has anybody seen any reports of behaviour similar to the above? 2. Is this likely to be some sort of artifact due to V6 -> V7? 3. When there is a BOINC version change, does a project such as POGS need to see a fresh number of new tasks completed before it can come up with a reasonable new estimate? If so, why, since there is no change of hardware or science app? Surely the assumption could be made that crunch times would stay much the same? 4. Why do the run time estimates for both projects initially reduce immediately after restarting and then blow out so dramatically (particularly for POGS) when the first tasks to complete under the new version are returned? 5. Is there any way at all to convince BOINC that the POGS tasks really don't need 110 hours and that something like 2.5 hours would be much more appropriate? 6. Why does the EAH estimate get blown out at all? It gets reduced from say 6 hrs to 3.5 hrs and then ends up in the 10-15 hr range at some point. I guess I need to observe exactly when that happens. 7. On thinking back over the several examples I've now witnessed, I have the impression that the extent of the blowout of the POGS estimate seems to be inversely related to the proportion of the task that gets crunched under the new BOINC version. For example, there were two almost finished POGS tasks in the latest case, which had by far the biggest increase in the estimate. Conversely, the EAH estimate hardly changed. Does this ring any bells for anybody? I'll be doing a few more upgrades over the coming days and will try to keep better records of certain details. I really don't have time to follow BOINC development closely so I'm really quite ill-informed about all the changes from the way things used to work in the 'good old days' :-). Any suggestions on how to debug this more sensibly/efficiently would be appreciated. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.