Thread 'possible problem task switching every hour??'

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 52709 - Posted: 22 Feb 2014, 5:50:37 UTC I run boinc from start using --detach and do not use BM. I assume the client forces the task switch and not BM. OK, the problem: I just checked one system and noticed that two tasks which normally take 3 hours each (rainbow tables) on my 2 HD-5850 were obviously hung as the combined time was shown by boinctasks to be over 3 days. There were 43 milkywaY tasks and 15 rainbow ready. I aborted the two hung tasks and observed 2 milkyway start up. I then checked "messages" and also checked at the milkyway site and for the last 24 hours there were no milkyway tasks uploaded. Milkyway take only 15 minutes to execute so it would appear that they never got a time slice while the rainbow table tasks were hung. Obviously, there is a problem, but it seems to me that the other tasks should have received a slice every hour. I just recently started processing those rainbow tables as they perform very well on my (old) 5850. However, looking thru their web site, I am concerned that they are providing tables for hackers as all of the big crunchers are in china. ID: 52709 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15575	Message 52716 - Posted: 22 Feb 2014, 17:43:25 UTC - in response to Message 52709. I may read this wrong in your post, but you're telling us that two tasks were stuck on your GPU, perhaps for 3 days, right? Then how do you expect them to be swapped out for other tasks? Also, just aborting tasks may not fix the problem where remnants are kept in memory. I would also exit and restart BOINC after such a thing. Having typed all that, you don't say which BOINC version this was with and what it is you want remedied. For a next time, it's also perhaps wise to use a couple of debug flags to show in the log what is happening. ID: 52716 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 52932 - Posted: 3 Mar 2014, 14:35:04 UTC ok, so there is no preemptive switching of tasks when the gpu is involved. BC must ask the app to give up and never gets a response. Anyway, it happened again and this time I looked at the DistRT web site as it is their problem is the cause and sure enough there are complaints and a solution. it seems that if a cpu benchmark is run then the gpu dries up and gets locked waiting for data to arrive and cannot recover from the temporary lack of data. This is a bug in their code but they claim other projects do the same so that excuses it. I made the recommended change to the cc file to not do any benchmarks. However, I do not remember specifically running any benchmrks but it seems that must have happened which caused their apps to hang "forever" ID: 52932 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5135	Message 52935 - Posted: 3 Mar 2014, 15:16:26 UTC - in response to Message 52932. ok, so there is no preemptive switching of tasks when the gpu is involved. BC must ask the app to give up and never gets a response. Anyway, it happened again and this time I looked at the DistRT web site as it is their problem is the cause and sure enough there are complaints and a solution. it seems that if a cpu benchmark is run then the gpu dries up and gets locked waiting for data to arrive and cannot recover from the temporary lack of data. This is a bug in their code but they claim other projects do the same so that excuses it. I made the recommended change to the cc file to not do any benchmarks. However, I do not remember specifically running any benchmrks but it seems that must have happened which caused their apps to hang "forever" GPU tasks can be and are preemptively switched, under the same rules as CPU tasks. The difference is that GPU tasks are completely removed from GPU memory when preempted (whatever the setting for 'leave applications in memory'), and restarted from the checkpoint file - does DistRT checkpoint? There is one exception to the 'remove from memory' rule - when benchmarks are being run. Then, the GPU tasks are suspended, but kept in GPU memory - the same as happens with CPU tasks when LAIM is set. One or two projects had problems when that policy first started, but all that I know of have updated their code successfully. I suggest you ask DistRT to explore the difference between "exit/restart" (full preemption) and suspend/resume (quick pause for benchmarking), and adapt their code to support the latter as well. Let's hope they are as responsive as Matt Harvey was to a similar discussion at GPUGrid. ID: 52935 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 52936 - Posted: 3 Mar 2014, 16:20:07 UTC - in response to Message 52935. Thanks Richard, I think that explains what happened. I also discovered that after I aborted the two stuck DistRT tasks that the milkyway tasks all got computation errors. Seems aborting the DistRT must have left their code in the gpu which caused milkway to fault. I rebooted instead of just "re reading" the cc_config file and the milkway tasks are now running correctly. As far as GPUGRID is concerned, I still have problems with their "long" tasks that lock the display and the entire computer following a power glitch or outage. As a result, I only run GPUGRID tasks on systems that I have a monitor on as it is difficult to reset their project before the NVidia driver crashes on powering up. Their suggestion was to put a 60 second delay before the project starts up to allow time to reset the project but that is not always workable as well as a PITA. It may be that they have fixed this problem since last month but I don't want to test it by pulling the plug. ID: 52936 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5135	Message 52937 - Posted: 3 Mar 2014, 16:34:20 UTC - in response to Message 52936. GPUGrid updated their Windows 'Long Run' application on 23 Jan 2014 to fix that bug - I don't think Linux was affected by that particular problem (you don't mention an OS, but for some reason I had you mentally flagged as a Linux user?). I don't think DistRT and MilkyWay should be linked by "code left in the GPU" - if they are, BOINC has another problem that needs to be fixed. But then, I don't regard MilkyWay as having very high-quality code (measured by robustness to unexpected circumstances), judging by the number of unanswered error reports on their message boards. ID: 52937 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15575	Message 52938 - Posted: 3 Mar 2014, 16:42:15 UTC - in response to Message 52932. it seems that if a cpu benchmark is run then the gpu dries up ??? CPU benchmarks won't be run automatically. Haven't done so since alpha version 6.11.8. They're now only started once every 5 days after a client restart. Since a client restart will remove all remnants of tasks out of memory anyway, they can't interfere. client: if we successfully did CPU benchmarks, don't keep doing them every 5 days unless restart the client. So the only way that your BOINC can do CPU benchmarks is if you chose to run an ancient version, or you told BOINC to do a CPU benchmark yourself. ID: 52938 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 52939 - Posted: 3 Mar 2014, 17:05:59 UTC - in response to Message 52938. OK, I now know what happened and how I caused it. Yesterday, on a Linux system, I noticed that DistRT was using very little CPU processing so I enabled BC %100 cpu to add the 4th core to the processing capability. This linux system is USB based and only runs asteroids and DistRT under cuda 1.1. That change, from %75 to %100 caused the benchmark to run from watching BoincTask as BT showed the suspension and the benchmark. This morning I noticed the DistRT was taking too long and after aborting, I then checked my dual ATI 5850 and sure enough DistRT has been running for 3 days with progress stuck at %99. I had done the same thing on that system a few days ago: made a change to the number of cores which caused the benchmarks to run. What got my attention to the problem was the very low temperatures that the GPU was reporting plus very low CPU utilization. BTW, my Linux box reports temps to BoincTasks from the ubuntu "sensors" app. I had to mod that app to send the temp data to BT's 31417 port on my monitoring system that runs BT. Anyway, I now have a handle on what happened and how to avoid it. ID: 52939 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15575	Message 52940 - Posted: 3 Mar 2014, 17:10:48 UTC - in response to Message 52939. That change, from %75 to %100 caused the benchmark to run Yes, that's a change that I forgot to add in my previous post. You can go lower the amount of CPU cores you want BOINC to run with without consequences, but upping them will need new benchmarks. Still, it's a user action that causes it, not an automation. ID: 52940 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5135	Message 52942 - Posted: 3 Mar 2014, 17:27:38 UTC - in response to Message 52938. Last modified: 3 Mar 2014, 17:30:31 UTC CPU benchmarks won't be run automatically. Haven't done so since alpha version 6.11.8. They're now only started once every 5 days after a client restart. Since a client restart will remove all remnants of tasks out of memory anyway, they can't interfere. client: if we successfully did CPU benchmarks, don't keep doing them every 5 days unless restart the client. So the only way that your BOINC can do CPU benchmarks is if you chose to run an ancient version, or you told BOINC to do a CPU benchmark yourself. Hmmm. That may be what the documentation says (and I remember seeing it the first time round), but we don't necessarily believe everything we read, do we? 09-Dec-2013 19:03:48 [---] Starting BOINC client version 7.2.34 for windows_x86_64 10-Dec-2013 16:25:59 [---] Running CPU benchmarks 10-Dec-2013 16:25:59 [---] Suspending computation - CPU benchmarks in progress 10-Dec-2013 16:26:30 [---] Benchmark results: 10-Dec-2013 16:26:30 [---] Number of CPUs: 4 10-Dec-2013 16:26:30 [---] 2258 floating point MIPS (Whetstone) per CPU 10-Dec-2013 16:26:30 [---] 7475 integer MIPS (Dhrystone) per CPU 10-Dec-2013 16:26:31 [---] Resuming computation 29-Jan-2014 20:48:31 [---] Starting BOINC client version 7.2.39 for windows_x86_64 02-Feb-2014 15:10:30 [---] Running CPU benchmarks 02-Feb-2014 15:10:30 [---] Suspending computation - CPU benchmarks in progress 02-Feb-2014 15:10:30 [NumberFields@home] [cpu_sched] Preempting wu_sf2_DS-24x13_Grp509343of819200_0 (left in memory) 02-Feb-2014 15:10:30 [NumberFields@home] [cpu_sched] Preempting wu_sf2_DS-24x13_Grp492690of819200_2 (left in memory) 02-Feb-2014 15:10:30 [boincsimap] [cpu_sched] Preempting 20130921.931155_0 (left in memory) 02-Feb-2014 15:10:30 [SETI@home] [cpu_sched] Preempting 17au13ae.5760.10701.438086664203.12.190_0 (left in memory) 02-Feb-2014 15:10:30 [boincsimap] [cpu_sched] Preempting 20130921.932392_0 (left in memory) 02-Feb-2014 15:11:01 [---] Benchmark results: 02-Feb-2014 15:11:01 [---] Number of CPUs: 4 02-Feb-2014 15:11:01 [---] 2159 floating point MIPS (Whetstone) per CPU 02-Feb-2014 15:11:01 [---] 6831 integer MIPS (Dhrystone) per CPU 02-Feb-2014 15:11:02 [---] Resuming computation I've got plenty of examples of benchmarks run immediately after restart, but those two weren't associated with a client startup or any configuration change. Edit - that "[SETI@home] ... Preempting ... (left in memory)" line would have been a GPU task. ID: 52942 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15575	Message 52944 - Posted: 3 Mar 2014, 17:35:42 UTC - in response to Message 52942. Well, then it's simple: You found a bug. Automated benchmarks should not happen, unless after a client restart, or after changing the number of CPU cores to use from a low amount to a higher one. ID: 52944 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5135	Message 52945 - Posted: 3 Mar 2014, 18:01:10 UTC - in response to Message 52944. Last modified: 3 Mar 2014, 18:06:16 UTC Well, then it's simple: You found a bug. Automated benchmarks should not happen, unless after a client restart, or after changing the number of CPU cores to use from a low amount to a higher one. Right. For that second one, I've found 28-Jan-2014 15:09:56 [---] Starting BOINC client version 7.2.39 for windows_x86_64 ... 28-Jan-2014 15:10:30 [---] Benchmark results: Note that's exactly five days (to the second) before the benchmark I logged before. So I think an automatic benchmark will run if you: * Boot and run benchmarks * Reboot once, within 5 days and without changing version (or doing anything to trigger a manual benchmark) * Leave running until 5 days are up. Edit: that's what I read from the code change in http://boinc.berkeley.edu/trac/changeset/2985faec3ec529dd18b3ec5ef2755508e18999f3/boinc-v2 The variable declaration static bool did_benchmarks = false; may say 'static', but I doubt its value is preserved across a restart unless saved in client_state.xml - and I don't think it is. Feel free to tell him - he's ignoring my reports, and I don't feel strongly enough about this one to divert his attention. ID: 52945 ·

rebirther Send message Joined: 21 Jun 06 Posts: 156	Message 52947 - Posted: 3 Mar 2014, 20:28:32 UTC Its better to use the cc_config.xml: <cc_config> <options> <skip_cpu_benchmarks>1</skip_cpu_benchmarks> </options> </cc_config> ID: 52947 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5135	Message 52948 - Posted: 3 Mar 2014, 20:44:15 UTC - in response to Message 52947. Its better to use the cc_config.xml: It would be better if one developer fixed the bugs, rather than 200,000 users worked round them one by one. ID: 52948 ·

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.