6.4.5 unnecessarily assigning high proiority CUDA

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 21746 - Posted: 12 Dec 2008, 1:31:33 UTC Last modified: 12 Dec 2008, 1:33:08 UTC Just upgraded from 6.4.1 to 6.4.5 and noticed that my GPUGRID task is running at high priority. This causes a full CPU to be assigned to the task which is unnecessary IMHO. The WU's take just over 7 hours to complete on a 9800gtx+ and my deadlines are all under 48 hours. 6.4.1 had been working fine and I had 5 tasks running on a q6700. Now I have only 4 tasks running and every GPUgrid task is getting a dedicated CPU even if it is not needed. http://swri.info/images/gpu_high_priority.png ID: 21746 ·

Nicolas Send message Joined: 19 Jan 07 Posts: 1179	Message 21781 - Posted: 13 Dec 2008, 23:41:16 UTC - in response to Message 21746. Last modified: 13 Dec 2008, 23:41:28 UTC Have you tried putting it in low priority before claiming high priority is unnecessary? It extremely slows down when run at low priority, because it can't feed the GPU fast enough. ID: 21781 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 21787 - Posted: 13 Dec 2008, 23:54:40 UTC - in response to Message 21781. I don't think he was talking about the priority setting in the OS, but more about the message that the task was being run in high priority mode, aka EDF. ID: 21787 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 21795 - Posted: 14 Dec 2008, 14:45:18 UTC - in response to Message 21781. Last modified: 14 Dec 2008, 14:48:00 UTC Have you tried putting it in low priority before claiming high priority is unnecessary? It extremely slows down when run at low priority, because it can't feed the GPU fast enough. Finally got around to answering this as I was trying some different things and trying to debug why gpugrid task periodicaly hangs causing all subsequent jobs to get compute error. Enough for that as it is a different thread. Anyway, I went back to 6.4.1 but it made no difference, gpugrid is still running in high priority mode. Another user had the same observation, went back to 6.3.x and still was stuck in high priority mode. There must be some adaptive algorithm in place and once the scheduler decides gputask needs to be high priority, it seems stuck in the mode. I must have gotten away for a week or two in standard priority before it got convinced I needed high priority. I attached gpugrid after setting cpu count to %75 as suggested a month ago using 6.3.x but after just a few days of watching WU's complete in under 11 hours with 96 hour deadlines, I set cpu count to 100%. After setting %100 which gives the so-called 4+1 cpus, my system ran just fine and there were 5 "running" projects, not longer just 4. This ran fine for several weeks (5 tasks) and even though the CPU was running in standard priority mode, it appeared to be capable of feeding the CPU as it was stilll taking about 11 hours to finish a WU and the deadlines were still 96 hours. I averaged up all my CPUTIME and ET and I take on an average 880 seconds of CPU time to process 42,000 seconds elapsted time (11 hours) This was while runing 4+1 in standard boinc priority. FWIW, I noticed the following: One of the other 4 tasks (this is a quad system with 4+1) would have a higher than expected time to complete. For example, if ABC was normally taking 5 hours, and I had 3 ABC waiting to run, all 3 would show, for example, 7 hours. However, as soon as GPUGRID completed a task, the 3 ABC waiting to run tasks would start dropping their expected time to complete from 7 hours down to the normal 5 hours. It might take 4-5 hours, but eventually the ABC would be back down to the exact 5 hours that they normally took. In the mean time, one of the other tasks would rise in expected completion time. For example, einstein might have all of the waiting-to-run increase from 11 hours to 13 hours. I assume this is because when the GPU task completed, the hard threaded CPU was assigned from a pool of 4 cpu's and the GPU tasks got a different CPU and was not sharing the ABC cpu. I am just guessing as I dont really know how vista or boinc assigns the CPUs. However, I did notice the rise and drop of expected completion time for one of other 4 tasks and assumed it was because they were sharing, or no longer sharing, the CPU with the GPU. Some good things 6.4.5 has the correct 11 hour or so expected time to complete. Previously it might show 90 days to complete (6.4.1). Iregardless of 6.4.1 or 6.4.5, the actual ET was about 11 hours. My system is Q6700 (quad) with 9800gtx+ not overclocked and no gaming on vista 64. I do not think that BOINC gets enough info from the CUDA to properly calculate the effieiency and utilization of the coprocessor. My cpu, Q6700, is very efficient, the system is at home and used only occasionally, I am not running any games and have 4gb of memory on a 64bit OS. I also let windows install updates automatically. Possibly other GPU users have slower systems and use their GPU for gameing and BOINC simply sets high priority to ensure those systems can finish in time. ID: 21795 ·

Nicolas Send message Joined: 19 Jan 07 Posts: 1179	Message 21805 - Posted: 14 Dec 2008, 23:31:47 UTC - in response to Message 21787. I don't think he was talking about the priority setting in the OS, but more about the message that the task was being run in high priority mode, aka EDF. Oh, my bad... ID: 21805 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 21859 - Posted: 16 Dec 2008, 14:07:53 UTC just read in the DCF thread on the gpugrid project: "server cannot distinguish between 8800 and GTX280" GPUs. OK, that helps account for the high priority. Time to complete is calculated based on the really slow 8800 board, much slower than my 9800gtx let alone the superior gtx280. With seti now in beta on gpu's maybe the problems identifying co-processor capability will get cleaned up. In the meantime, when my gtx9800 locks up, the boincmgr still schedules jobs that get downloaded and rejected immediately. ID: 21859 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 21860 - Posted: 16 Dec 2008, 14:15:23 UTC - in response to Message 21859. In the meantime, when my gtx9800 locks up, the boincmgr still schedules jobs that get downloaded and rejected immediately. David is working on a fix for that that will show up in a next client... it won't be fixed on the fly, on the server or sadly, by magic. ID: 21860 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 21863 - Posted: 16 Dec 2008, 16:35:02 UTC - in response to Message 21860. Last modified: 16 Dec 2008, 16:40:26 UTC In the meantime, when my gtx9800 locks up, the boincmgr still schedules jobs that get downloaded and rejected immediately. David is working on a fix for that that will show up in a next client... it won't be fixed on the fly, on the server or sadly, by magic. magic - MAGIC - - M A G I C It happened (I think). I upgraded CUDA from 180.48 to 180.60 (probably should have done that earlier), rebooted (hanging my GPU during the process) and I am now seeing the boincmgr message "[error] Missing a CUDA coprocessor" from every project, not just gpugrid. The system is at home and I am accessing via remote desktop from work. I have rebooted twice but the GPU is hung. Running that program "gpuz" gets an error message instead of displaying the GPU temps so that kind of proves the GPU is hung. from experience, I will have to get home and cycle the power to fix a GPU hang. In the mean time, GPUGRID refuses to run and the status shows "Waiting to Run(0.90 CPUs". So the combination of seeing the status "Waiting to run" and reading the log message "missing a CUDA" tells me the gpu is hung. The fact that it is not going on to another task indicates it must know the GPU is hung. Conceivably, the GPU could answer that it is present (ie: not missing) yet still be unable to be initialized which might still cause the jobs to be flushed. I wont know for sure for a couple of days. ID: 21863 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 21869 - Posted: 16 Dec 2008, 20:09:18 UTC - in response to Message 21863. The system is at home and I am accessing via remote desktop from work. Hasn't it been reported that CUDA processing is incompatible with RDP? I've been seeing people advocating a switch to VNC if you need to fiddle with GPU drivers during working hours ;-) ID: 21869 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 21870 - Posted: 16 Dec 2008, 21:32:06 UTC - in response to Message 21869. The system is at home and I am accessing via remote desktop from work. Hasn't it been reported that CUDA processing is incompatible with RDP? I've been seeing people advocating a switch to VNC if you need to fiddle with GPU drivers during working hours ;-) Was unaware of any problem. However, since I seem to be reduced to grasping at straws, I will start using VNC. This was a home premium 64 bit system and did not support RDP but I downloaded some odds and ends that kluged RDP in. Problem is that McAfee (since about 4 weeks ago) thinks that VNC is a trojan and deletes even the viewer from my office system. Ever since our management installed the brain dead mcafee epolicy orchestrator, my security policy gets reverted back to "no vnc" 24 hours after I put vnc back in. ID: 21870 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 21884 - Posted: 17 Dec 2008, 13:30:04 UTC These images, supplied by BeemerBiker at SETI Beta, suggest that there is a serious DCF bug in BOINC v6.4.5 Default value expressed as a percentage rather than a fraction, anyone? The Astropulse task (top image) will have a running time of approx 40-45 hours on his CPU. ID: 21884 ·

Nicolas Send message Joined: 19 Jan 07 Posts: 1179	Message 21885 - Posted: 17 Dec 2008, 13:36:17 UTC - in response to Message 21884. Default value expressed as a percentage rather than a fraction, anyone? No, it just reached the highest value it could reach. There is a limit on how high DCF can get, to keep 0.1-second WUs estimated as 2 months (or viceversa) from getting the DCF on a very insane value that would take too long to get back to sanity. ID: 21885 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 21888 - Posted: 17 Dec 2008, 13:48:03 UTC - in response to Message 21885. Last modified: 17 Dec 2008, 13:51:53 UTC Default value expressed as a percentage rather than a fraction, anyone? No, it just reached the highest value it could reach. There is a limit on how high DCF can get, to keep 0.1-second WUs estimated as 2 months (or viceversa) from getting the DCF on a very insane value that would take too long to get back to sanity. The host is a new join on a new project. Look at the Task list: no tasks have been reported yet (tasks are never purged at SETI Beta). I don't know whether BeemerBiker has actually completed any tasks yet - an upload server failure at the project has prevented any tasks reporting since approximately 17:30 UTC yesterday. But SETI tasks are usually over-estimated: I'm running the same tasks on the CPU-only version of the program, and seeing reasonable DCF values below 0.5000 So either it's the default value (if no CUDA tasks have exited), or an extreme mis-interpretation of the running time (if they have exited but not been reported). Either way, it's going to be a major problem for projects like SETI where CUDA and CPU programs run on the same project and share a DCF. That's a bug, in my book. Edit - looking back at BeemerBiker's opening posts in this thread, the immediate EDF mode reported reinforces my hunch that it's an inital DCF problem. ID: 21888 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15480	Message 21890 - Posted: 17 Dec 2008, 13:51:19 UTC - in response to Message 21888. Either way, it's going to be a major problem for projects like SETI where CUDA and CPU programs run on the same project and share a DCF. That's a bug, in my book. Write a Trac ticket about it... that is the bug database. ID: 21890 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 21893 - Posted: 17 Dec 2008, 13:59:39 UTC - in response to Message 21890. Either way, it's going to be a major problem for projects like SETI where CUDA and CPU programs run on the same project and share a DCF. That's a bug, in my book. Write a Trac ticket about it... that is the bug database. I usually like to wait for confirmation from at least two matching reports before bloating the trac database. Not having a CUDA card, I don't feel like upgrading any of my machines to 6.4.5 yet, 'recommended' 'stable' or not. Besides, my existing machines are already joined to projects, including CPDN: I don't want to risk those tasks having their estimates multiplied by 100! These posts are by way of a public (more public than trac) "heads up": perhaps other people installing v6.4.5 and joining new projects such as SETI Beta could watch the DCF values as they go through the process, and report. ID: 21893 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 22789 - Posted: 31 Jan 2009, 15:23:56 UTC Last modified: 31 Jan 2009, 16:06:23 UTC I think I've worked out one of the reasons why we get these 'high priority' complaints about v6.4.5 Refer to (direct link) This is from a quad (one of my Q6600s) with a single 9800GT. A CUDA task is running, and there are other projects not visible at the top of the screen (2x CPDN Beta, 1 x CPDN, 2 x Astropulse). I'm running a cc_config.xml with <ncpus> set to 5 - the SETI app doesn't use enough CPU to need a full core, and now that the thread priorities (as opposed to the BOINC priorities) have been sorted out, it runs fast enough even when just scavenging CPU cycles from other BOINC tasks. Task switch interval is the default 60 minutes, and cache size is currently set to 2 days. The interesting project is Einstein. There are four tasks running, high priority: plus a fifth task waiting to run. I find that when running multiple projects like this, BOINC v6.4.5 gets itself into two minds about task switching. First it wants to pre-empt SETI (big STD at the moment), so it starts five CPU tasks (as per cc_config), and then it realises that the GPU is idle, so it starts a sixth task to utilise the CUDA resource. Then it realises that it's got too many cores running, and suspends the CPU task it's just started. The particular problem with Einstein is that tasks take a lot of preparation time before they get started (much longer than the 9 seconds BOINC allowed this task to run). So we have a task with positive CPU time, and zero progress - BOINC is worried that the runtime will be infinite, and that's why it runs the rest of the Einstein tasks in EDF. At five hours per task, and a fortnight before the first deadline, there's no other reason for the panic that I can see. I've looked through the change logs since v6.4.5, and I can't see any attention being given to task switching: anyone spot any flaws in my logic before I raise a trac ticket? Afterthought: a nasty interaction between resource share, sub-projects, TDCF and deadlines. At the moment, my SETI resource share is below 20%, so SETI (my only CUDA project) is pushed into massive STD by the imperative to keep the CUDA card loaded. That means that my Astropulse tasks, which are also classed as part of the SETI project, never get a chance to run. At the same time, of course, my SETI TDCF is being stabilised at the appropriate value for the CUDA app - currently 0.0423. At this value, BOINC is estimating that my Astropulse will complete in 5:01:57 Yet even using the fastest-available third-party optimisations for Astropulse (as I am), they will take at least 10 hours - TDCF should be around 0.1 for accurate estimation of optimised AP tasks, or 0.4 for SETI's stock AP application. BOINC will eventually get round to running AP in high priority, of course, but because of the TDCF basis error, it may well not start them in time - it certainly wouldn't give them the full 42 hour runtime that would be needed if I was using stock AP apps. There was talk on the mailing list of getting TDCF split out into a 'per application' record: this example shows why that development work should be "high priority". {But I can micro-manage my way out of the current problem} ID: 22789 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5081	Message 22790 - Posted: 31 Jan 2009, 16:31:11 UTC Now you can see it all on one page: (direct link) Note the six running tasks, and the Astropulse at the top (oldest tasks in cache except CPDN, still unstarted, wrong time estimate). Strangely, it's now decided that the least-processed Einstein task from last time is now safe to pre-empt, even though the most-processed one is still in EDF. ID: 22790 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.