Thread 'Why is BOINC recognizing all four TitanX GPUs but fails to assign to them randomly?'

Author	Message
Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67009 - Posted: 20 Jan 2016, 19:37:41 UTC I noticed this last night and it is driving me crazy! I have this Quad Titan X machine. They and the CPU are liquid cooled with those sealed dedicated units, so the room is always warm. Every time I look at it, or touch it, all five fans are blowing out hot air. But last night I touched them, and one was cold. Hmmm... Checked BOINC, and sure enough, #2 (out of #0...#3) wasn't running anything despite the many GPU jobs waiting... I looked at the EventLog, at the top, and it did recognize all 4 cards. Yikes! Did I burnout an X?? But I am also running MSIAfterBurner on all my machines (none of which is having this problem and one is Dual X and one is Dual Z) to see card temps, load%, memusage%, clock etc., and it is still getting data from all cards. Hmmmm.... I rebooted the machine. Now #3 isn't getting any jobs but #2 is. Huh? I turn on the coproc_debug flag for the Event Log. Yup, it is confirming assigning jobs to #0..#2, but not #3. I check the fans to see if somehow what BOINC called #2, it is now calling #3. Nope, a different fan is now blowing cold air, and the previous cold fan is now hot. Weird... In frustration, since PhysX on the NVIDIA control panel was set to AutoAssign, I forced PhysX to only consider CPU and none of the GPUs (just in case, but the default "auto-assign" doesn't cause any problems on any of my other single or multi-GPU machines), make sure SLI is off, etc. I am using the latest 361.43 drivers, by the way. Reboot. And now, #2 and #3 are not used. Is this spreading?! Just for kicks, I shutdown BOINC, and physically switch the HDMI connection to my 4K monitor between cards. Sure enough, every single card shows activity when it is driving the monitor, as I can see in MSIAfterBurner. So, no card is dead. Frustrated, I go to bed, leaving only #0 and #1 churning. In the morning, I find the room hot again. Check the fans, all four hot. Look at BOINC, yup, all GPUs are running tasks. An hour later, however, #1 and #3 are not used. What? Check the fans, and confirm that it is #3 and a new one, #1. So, now #0 and #2 are working. The "blindspot" has shifted & split... I shut down BOINC, set MSIAfterBurner to recording a log, run 3DMark, their high level test (Fire-something), look at the log, and all GPUs are firing at max. GPUs are healthy! And, due to the individual liquid cooling per card, none of the cards ever even get close to 60C; they usually stay at 30-45C with the fans at around 30%, only hitting 50-55C briefly at times, which is something to be very happy about. Plus, the temperature envelop for these cards is in the 80s, so that is not the problem. These are also the temps I am seeing in MSIAfterBurner when all four of my GPUs were showing 100% GPU load earlier yesterday when BOINC was able to assign jobs to all of them. Of course I am and have been running the latest BOINC software; see below. No, I am not overclocking/overvoltaging/overanything these cards. They are Titan X SC models the way they were set in the factory, with the Hybrid kit slapped on them. Snippets from my current BOINC session: 1/20/2016 10:53:03 AM \| \| Starting BOINC client version 7.6.22 for windows_x86_64 1/20/2016 10:53:03 AM \| \| log flags: file_xfer, sched_ops, task, coproc_debug, unparsed_xml 1/20/2016 10:53:03 AM \| \| Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8 .... 1/20/2016 10:53:04 AM \| \| [coproc] launching child process at C:\Program Files\BOINC\boinc.exe 1/20/2016 10:53:04 AM \| \| [coproc] relative to directory C:\ProgramData\BOINC 1/20/2016 10:53:04 AM \| \| [coproc] with data directory "C:\ProgramData\BOINC" 1/20/2016 10:53:06 AM \| \| CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 361.43, CUDA version 8.0, compute capability 5.2, 4096MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| CUDA: NVIDIA GPU 1: GeForce GTX TITAN X (driver version 361.43, CUDA version 8.0, compute capability 5.2, 4096MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| CUDA: NVIDIA GPU 2: GeForce GTX TITAN X (driver version 361.43, CUDA version 8.0, compute capability 5.2, 4096MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| CUDA: NVIDIA GPU 3: GeForce GTX TITAN X (driver version 361.43, CUDA version 8.0, compute capability 5.2, 4096MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 361.43, device version OpenCL 1.2 CUDA, 12288MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| OpenCL: NVIDIA GPU 1: GeForce GTX TITAN X (driver version 361.43, device version OpenCL 1.2 CUDA, 12288MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| OpenCL: NVIDIA GPU 2: GeForce GTX TITAN X (driver version 361.43, device version OpenCL 1.2 CUDA, 12288MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| OpenCL: NVIDIA GPU 3: GeForce GTX TITAN X (driver version 361.43, device version OpenCL 1.2 CUDA, 12288MB, 4025MB available, 7468 GFLOPS peak) 1/20/2016 10:53:06 AM \| \| [coproc] NVIDIA library reports 4 GPUs 1/20/2016 10:53:06 AM \| \| [coproc] No ATI library found. .... 1/20/2016 10:53:06 AM \| \| Processor: 16 GenuineIntel Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz [Family 6 Model 63 Stepping 2] 1/20/2016 10:53:06 AM \| \| Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 smep bmi2 1/20/2016 10:53:06 AM \| \| OS: Microsoft Windows 10: Professional x64 Edition, (10.00.10586.00) 1/20/2016 10:53:06 AM \| \| Memory: 63.90 GB physical, 73.40 GB virtual 1/20/2016 10:53:06 AM \| \| Disk: 476.39 GB total, 392.29 GB free 1/20/2016 10:53:06 AM \| \| Local time is UTC -8 hours 1/20/2016 10:53:06 AM \| \| VirtualBox version: 5.0.10 1/20/2016 10:53:06 AM \| \| Config: don't suspend NCI tasks 1/20/2016 10:53:06 AM \| \| Config: event log limit 10000 lines 1/20/2016 10:53:06 AM \| \| Config: report completed tasks immediately 1/20/2016 10:53:06 AM \| \| Config: use all coprocessors .... 1/20/2016 11:07:41 AM \| Milkyway@Home \| [coproc] NVIDIA instance 0; 1.000000 pending for de_80_DR8_Rev_8_5_00004_1446686708_40269073_2 1/20/2016 11:07:41 AM \| Asteroids@home \| [coproc] NVIDIA instance 0; 1.000000 pending for ps_151226_input_13038_16_0 1/20/2016 11:07:41 AM \| Milkyway@Home \| [coproc] NVIDIA instance 0: confirming 1.000000 instance for de_80_DR8_Rev_8_5_00004_1446686708_40269073_2 1/20/2016 11:07:41 AM \| Asteroids@home \| [coproc] NVIDIA instance 2: confirming 1.000000 instance for ps_151226_input_13038_16_0 1/20/2016 11:07:41 AM \| SETI@home Beta Test \| [coproc] Assigning NVIDIA instance 1 to 16no11aa.11914.18087.6.40.96_0 1/20/2016 11:07:41 AM \| Collatz Conjecture \| [coproc] Assigning NVIDIA instance 3 to collatz_sieve_2518255562788568039424_6597069766656_1 .... 1/20/2016 11:27:18 AM \| SETI@home Beta Test \| [coproc] NVIDIA instance 0; 1.000000 pending for 16no11aa.11914.18087.6.40.27_1 1/20/2016 11:27:18 AM \| Asteroids@home \| [coproc] NVIDIA instance 0; 1.000000 pending for ps_151226_input_13039_25_1 1/20/2016 11:27:18 AM \| Collatz Conjecture \| [coproc] NVIDIA instance 0; 1.000000 pending for collatz_sieve_2518259560612846632960_6597069766656_0 1/20/2016 11:27:18 AM \| Poem@Home \| [coproc] NVIDIA instance 0; 1.000000 pending for poempp_1vii_1453239133_373521905_0 1/20/2016 11:27:18 AM \| SETI@home Beta Test \| [coproc] NVIDIA instance 0: confirming 1.000000 instance for 16no11aa.11914.18087.6.40.27_1 1/20/2016 11:27:18 AM \| Asteroids@home \| [coproc] NVIDIA instance 1: confirming 1.000000 instance for ps_151226_input_13039_25_1 1/20/2016 11:27:18 AM \| Collatz Conjecture \| [coproc] NVIDIA instance 2: confirming 1.000000 instance for collatz_sieve_2518259560612846632960_6597069766656_0 1/20/2016 11:27:18 AM \| Poem@Home \| [coproc] NVIDIA instance 3: confirming 1.000000 instance for poempp_1vii_1453239133_373521905_0 Aaaaaaand, during the half hour it took me to write this post with all the copy/paste, right now, all 4 GPUs are being used. No, the different flips between use/no-use states are not the transient times between a task being reported and a new one starting. I have ReportTasksImmediately turned on (see above), and these state changes are measured in hours between different sets of GPUs being used. Anybody else has seen this? Any ideas? Any further debugging flags to be turned on to see what is going on when BOINC says "Assigning NVIDIA instance N to blahblah" without a matching "NVIDIA instance N: confirming...."? Thanks Tuna ID: 67009 ·

Agentb Send message Joined: 30 May 15 Posts: 265	Message 67011 - Posted: 20 Jan 2016, 21:46:14 UTC Last modified: 20 Jan 2016, 21:50:14 UTC Hi Tuna Strange indeed.... It's not clear from your post, but i am guessing - Tasks complete in the normal time? They do not abort or invalid? They just finish and nothing is rescheduled (until some random later time) ? What happens if you disable "report completed tasks immediately" ? Some of the debug switches are here boinc client debug I thinks i'd look at some of these (quote from the wiki link) <cpu_sched> CPU scheduler actions (preemption and resumption). <cpu_sched_debug> Explain CPU scheduler decisions. <sched_op_debug> Details of scheduler RPCs; also shows deferral intervals and other low info. <slot_debug> Prints messages about allocation of slots, creating/removing files in slot dirs. <state_debug> Show summary of client state after scheduler RPC and garbage collection <suspend_debug> Show details of processing and network suspend/resume. <task_debug> Low-level details of process start/end (status codes, PIDs etc.), and when applications checkpoint. ID: 67011 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67012 - Posted: 20 Jan 2016, 22:05:43 UTC Everything else is normal, as far as I can tell. The GPUs that BOINC is able to schedule do get tasks. The CPU seems totally unaffected. The job durations on CPUs and GPUs seem on par (although that is hard to tell since I am running 50 projects on 10 machines). I can tell, for instance, that SETI@Home Beta has received 394 completed & validated CPU tasks, mix of OpenCL/Cuda42. And, POEM has received 66 OpenCL/GPU tasks completed & validated. Similary for Astroids, 49 validated cuda55 jobs. All over the last 2-3 days. Etc. So, things seem to work otherwise, except that sometimes a GPU is not "task worthy". Even though bunch of the flags you mention are CPU-specific, I will turn them all on and see if there are any clues... Anybody else? Thanks Tuna ID: 67012 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5131	Message 67013 - Posted: 20 Jan 2016, 22:16:38 UTC - in response to Message 67012. <cpu_sched_debug> covers the whole scheduler process, including both CPUs and GPUs - it simply hasn't been renamed since GPUs were added to the mix. Are you using any app_config.xml files, especially to limit the maximum number of concurrent tasks for any application or project? I think I'm finding a reportable bug in that area, but it's not confirmed yet. ID: 67013 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67014 - Posted: 20 Jan 2016, 22:28:42 UTC Last modified: 20 Jan 2016, 22:31:54 UTC No app_config use at all. They scare me... :) [Unrelated, not to hijack this thread, but if there is a link you can provide about using app_config files, for the uninitiated in that dark art, I'd be thankful.] I turned on all those flags. Currently BOINC is able to schedule on all 4 GPUs, so nothing to debug, but I am getting LOADS and LOADS of data in my EventLog. Even with my 10,000 lines of increased buffer setting, I might end up losing data by the time I notice there is an issue. :) I did NOT turn off "report completed tasks immediately". Partly, I forgot. But, why should that help solve this problem? I just don't want many 100% tasks collect in the client. Tuna ID: 67014 ·

Agentb Send message Joined: 30 May 15 Posts: 265	Message 67015 - Posted: 21 Jan 2016, 1:29:11 UTC - in response to Message 67014. I turned on all those flags. Currently BOINC is able to schedule on all 4 GPUs, so nothing to debug, but I am getting LOADS and LOADS of data in my EventLog. Even with my 10,000 lines of increased buffer setting, I might end up losing data by the time I notice there is an issue. :) The wiki gives the name of the file that the gui displays if the buffer is insufficient, or you forget to check it in time etc. Yes the logs will be large, and that is to be expected and desired. I did NOT turn off "report completed tasks immediately". Partly, I forgot. But, why should that help solve this problem? I just don't want many 100% tasks collect in the client. It appears to my blind eye, a scheduling issue - you asked for some ideas, and this non-standard setting may affect scheduling. Thank you for reporting progress so far and good luck with it. ID: 67015 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67057 - Posted: 21 Jan 2016, 18:01:17 UTC - in response to Message 67015. Report: Since I turned on the flags, as far as I can tell, all GPUs are being scheduled. Either outputting all that debugging data is slowing down something just enough to make the blockage go away, or the problem was somehow related to the jobs across multiple projects that were in the queue at the time, or something else totally random. Or, maybe I have just been lucky during the last 24hrs. I'll keep the flags on & keep watching. If there is a problem again, I will report back. ID: 67057 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67398 - Posted: 30 Jan 2016, 20:23:22 UTC - in response to Message 67057. Last modified: 30 Jan 2016, 20:33:04 UTC Caught it! It was only assigning to #0 and #3, not #1 and #2. I looked through the event log: 1/30/2016 12:10:13 PM \| SETI@home Beta Test \| [coproc] Assigning NVIDIA instance 1 to 16no11aa.32407.25020.12.46.1_2 1/30/2016 12:10:13 PM \| \| [slot] cleaning out slots/0: get_free_slot() 1/30/2016 12:10:13 PM \| SETI@home Beta Test \| [cpu_sched_debug] skipping GPU job 16no11aa.32407.25020.12.46.1_2; CPU committed 1/30/2016 12:10:13 PM \| \| [slot] removed file slots/0/init_data.xml 1/30/2016 12:10:13 PM \| \| [slot] removed file slots/0/boinc_temporary_exit 1/30/2016 12:10:13 PM \| \| [cpu_sched_debug] enforce_run_list: end and 1/30/2016 12:13:07 PM \| Poem@Home \| [coproc] Assigning NVIDIA instance 2 to poempp_2k39_1453838965_1955717516_0 1/30/2016 12:13:07 PM \| Poem@Home \| [cpu_sched_debug] skipping GPU job poempp_2k39_1453838965_1955717516_0; CPU committed 1/30/2016 12:13:07 PM \| \| [cpu_sched_debug] enforce_run_list: end Yes, all the 8 cores/16 threads of the 5690x CPU at this time are committed. And, SETI@home Beta Test needs 0.473 CPUs while Poem@Home needs 0.737. So, I understand that these guys don't get scheduled. However, there are a whoooooole bunch of other projects with tasks waiting with very low CPU need along with needing a GPU, like Astroids@Home tasks that need 0.01 CPU and 1 GPU, and SETI@Home tasks that need 0.04 CPU plus 0.3 GPU (I am using Lunatics & my own app_config for SETI). Those tasks don't even get considered. BOINC Mgr seems to just look at the same two POEM and SETI Beta tasks over and over again, fails to allocate CPU, goes back into waiting, try again, etc. Why doesn't it move on to something else that would keep the GPUs busy? Heck, it could give one Astroid to one GPU, and three SETIs to another, and make the host scream. Tuna PS: I have saved all the log/xml files modified during the last 30mins while I was looking at this. If someone "from the staff" wants to see them, I can share a link. Don't want to do it publicly since I don't want to go into each of ~40 files to scrub IDs/names etc. ID: 67398 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5131	Message 67399 - Posted: 30 Jan 2016, 20:38:41 UTC - in response to Message 67398. Look at those SETI Beta tasks which are running (or compare the CPU time with the Elapsed time of tasks which are using the same application and have recently completed) to get a realistic measure of how much CPU is actually needed while the task is running. If the SETI Beta tasks are the 'CUDA' type, I expect you'll find that they actually use far less than 47.3% of a CPU - though probably more than the 4% that I've defined for you in app_info.xml Use app_config.xml to define a value for <cpu_usage> which is closer to reality than BOINC's notoriously generous stock estimate. You could do the same for POEM, but I don't have a feel for what that figure would be. ID: 67399 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67400 - Posted: 30 Jan 2016, 20:46:43 UTC - in response to Message 67399. Thanks, Richard. I will do that. But the question remains: As there are tons of other WUs waiting around with declared use of 0.01 or 0.04 CPUs, how come the scheduler doesn't move on to one of those to keep the valuable GPU resource fully used? Since my "task switch every N minutes" is set to "60", does it simply reserve the GPUs to a particular set of projects for the current 60mins, and doesn't move beyond that? If so, that would seem wasteful. Tuna ID: 67400 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67402 - Posted: 30 Jan 2016, 20:53:06 UTC - in response to Message 67399. Look at those SETI Beta tasks which are running (or compare the CPU time with the Elapsed time of tasks which are using the same application and have recently completed) to get a realistic measure of how much CPU is actually needed while the task is running. If the SETI Beta tasks are the 'CUDA' type, I expect you'll find that they actually use far less than 47.3% of a CPU - though probably more than the 4% that I've defined for you in app_info.xml Use app_config.xml to define a value for <cpu_usage> which is closer to reality than BOINC's notoriously generous stock estimate. You could do the same for POEM, but I don't have a feel for what that figure would be. And, they are OpenCL type, not CUDA type. As such (I think), their Run Time and CPU Time are very similar, unless there is another "elapsed time" entry I am missing: http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=77274&offset=0&show_names=0&state=4&appid= Tuna ID: 67402 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5131	Message 67403 - Posted: 30 Jan 2016, 21:35:02 UTC - in response to Message 67400. Thanks, Richard. I will do that. But the question remains: As there are tons of other WUs waiting around with declared use of 0.01 or 0.04 CPUs, how come the scheduler doesn't move on to one of those to keep the valuable GPU resource fully used? Since my "task switch every N minutes" is set to "60", does it simply reserve the GPUs to a particular set of projects for the current 60mins, and doesn't move beyond that? If so, that would seem wasteful. Tuna That's a question which I intend to write up fully, with message logs and screen shots, for the boinc_alpha bug list, when I've recovered from the installer launch. I'm seeing it too. ID: 67403 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67404 - Posted: 30 Jan 2016, 21:36:40 UTC - in response to Message 67403. Awesome! Thanks!! ID: 67404 ·

Tuna Ertemalp Send message Joined: 23 Dec 13 Posts: 45	Message 67406 - Posted: 30 Jan 2016, 22:31:08 UTC - in response to Message 67402. And, they are OpenCL type, not CUDA type. For the sake of completeness of data, even though irrelevant to the question at hand: On another machine, SETI Beta CUDA tasks claim 0.06 CPUs unlike the OpenCL jobs' 0.473. So, seems folk behind these tasks thankfully try to do the right thing. Tuna ID: 67406 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.