BOINC has stopped using my second GPU

Message boards : Questions and problems : BOINC has stopped using my second GPU
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile rtX

Send message
Joined: 6 May 06
Posts: 33
United Kingdom
Message 97452 - Posted: 10 Apr 2020, 9:38:58 UTC

I have a box running two GTX1060 GPUs. I recently installed BOINC on it and it was running both GPUs perfectly. It has suddenly started only using one of the GPUs (device 0). I was running GPUGrid and it was running fine on both. Now only one of the 4 GPUGrid WUs is running. The other 3 currently on the system are waiting to run. I added the recommended use all GPUs switch to the cc_config.xml file (which I created using Notepad++), and restarted the client, with no success, still only 1 GPU being used. I also added Einstein@Home on the slight chance that it was a GPUGrid issue. The Einstein@Home WU is also waiting to run.

Any suggestions?

PS: Where is the event log stored?

Many thanks.
ID: 97452 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 97453 - Posted: 10 Apr 2020, 9:46:46 UTC - in response to Message 97452.  

Event Log is accessed from BOINC Manager.

Tools menu (in either Simple or Advanced view), bottom item.

Keyboard shortcut Ctrl+Shift+E

It's best to look at it immediately after (re-)starting BOINC: for situations like this, the initial lines after startup are most useful.
ID: 97453 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 97454 - Posted: 10 Apr 2020, 9:53:09 UTC

Also please enable the coproc_debug flag in cc_config.xml (or via the event log flags option in advanced view) and then restart Boinc and post the messages.
ID: 97454 · Report as offensive
Profile rtX

Send message
Joined: 6 May 06
Posts: 33
United Kingdom
Message 97462 - Posted: 10 Apr 2020, 15:13:04 UTC

So, I checked the event log after a reboot. BOINC is seeing both GPUs,but just doesn't seem to want to start using both with GPUGrid. Here is an extract from the event log:

10/04/2020 15:54:55 |  | Starting BOINC client version 7.14.2 for windows_x86_64
10/04/2020 15:54:55 |  | log flags: file_xfer, sched_ops, task
10/04/2020 15:54:55 |  | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
10/04/2020 15:54:55 |  | Data directory: C:\ProgramData\BOINC
10/04/2020 15:54:55 |  | Running under account [redacted]
10/04/2020 15:54:55 |  | CUDA: NVIDIA GPU 0: GeForce GTX 1060 3GB (driver version 445.75, CUDA version 11.0, compute capability 6.1, 3072MB, 2488MB available, 3936 GFLOPS peak)
10/04/2020 15:54:55 |  | CUDA: NVIDIA GPU 1: GeForce GTX 1060 3GB (driver version 445.75, CUDA version 11.0, compute capability 6.1, 3072MB, 2488MB available, 3936 GFLOPS peak)
10/04/2020 15:54:55 |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1060 3GB (driver version 445.75, device version OpenCL 1.2 CUDA, 3072MB, 2488MB available, 3936 GFLOPS peak)
10/04/2020 15:54:55 |  | OpenCL: NVIDIA GPU 1: GeForce GTX 1060 3GB (driver version 445.75, device version OpenCL 1.2 CUDA, 3072MB, 2488MB available, 3936 GFLOPS peak)
10/04/2020 15:54:55 |  | Host name: [redacted]
10/04/2020 15:54:55 |  | Processor: 8 AuthenticAMD AMD Ryzen 5 1400 Quad-Core Processor [Family 23 Model 1 Stepping 1]
10/04/2020 15:54:55 |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 svm sse4a osvw skinit wdt tce topx page1gb rdtscp fsgsbase bmi1 smep
10/04/2020 15:54:55 |  | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18363.00)
10/04/2020 15:54:55 |  | Memory: 15.95 GB physical, 18.32 GB virtual
10/04/2020 15:54:55 |  | Disk: 236.65 GB total, 181.56 GB free
10/04/2020 15:54:55 |  | Local time is UTC +1 hours
10/04/2020 15:54:55 |  | No WSL found.
10/04/2020 15:54:55 |  | Config: use all coprocessors
10/04/2020 15:54:55 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID [redacted]; resource share 1
10/04/2020 15:54:55 | GPUGRID | URL http://www.gpugrid.net/; Computer ID [redcated]; resource share 1
10/04/2020 15:54:55 | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID [redacted]; resource share 100
10/04/2020 15:54:55 | GPUGRID | General prefs: from GPUGRID (last modified 01-Apr-2020 13:06:29)
10/04/2020 15:54:55 | GPUGRID | Host location: none
10/04/2020 15:54:55 | GPUGRID | General prefs: using your defaults
10/04/2020 15:54:55 |  | Reading preferences override file
10/04/2020 15:54:55 |  | Preferences:
10/04/2020 15:54:55 |  | max memory usage when active: 12247.32 MB
10/04/2020 15:54:55 |  | max memory usage when idle: 14696.78 MB
10/04/2020 15:54:55 |  | max disk usage: 182.93 GB
10/04/2020 15:54:55 |  | suspend work if non-BOINC CPU load exceeds 25%
10/04/2020 15:54:55 |  | (to change preferences, visit a project web site or select Preferences in the Manager)
10/04/2020 15:54:55 |  | Setting up project and slot directories
10/04/2020 15:54:55 |  | Checking active tasks
10/04/2020 15:54:55 |  | Setting up GUI RPC socket
10/04/2020 15:54:55 |  | Checking presence of 751 project files
10/04/2020 15:55:06 |  | Suspending computation - CPU is busy
10/04/2020 15:55:16 |  | Resuming computation
10/04/2020 15:58:08 | World Community Grid | Task MCM1_0161602_2005_0 exited with zero status but no 'finished' file
10/04/2020 15:58:08 | World Community Grid | If this happens repeatedly you may need to reset the project.
10/04/2020 16:01:43 | World Community Grid | Computation for task MCM1_0161602_2005_0 finished
10/04/2020 16:01:43 | World Community Grid | Starting task MCM1_0161594_2784_0
10/04/2020 16:01:45 | World Community Grid | Started upload of MCM1_0161602_2005_0_r1537514952_0
10/04/2020 16:01:48 | World Community Grid | Finished upload of MCM1_0161602_2005_0_r1537514952_0


I've redacted computer name, IDs etc.

Might this be something to do with BOINC's project weighting algorithm? Is there a way of increasing the weighting given to GPUIGrid to ensure that it gets the CPU it needs to run the GPU, if this might be the case?

Many thanks again.
ID: 97462 · Report as offensive
Profile rtX

Send message
Joined: 6 May 06
Posts: 33
United Kingdom
Message 97463 - Posted: 10 Apr 2020, 15:26:11 UTC - in response to Message 97462.  

I think this is to do with resource weighting. I increased the resource usage weighting for GPUGrid (so that it and WCG are equal) and updated the project from BOINC manager. This had no obviuos effect until I suspended the WCG project, at which point both GPUs started being used. My guess is that I'll have to dramatically increase the resources allocated to GPU Grid so that BOINC doesn't think it deserves less attention than WCG?
ID: 97463 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 97465 - Posted: 10 Apr 2020, 15:35:25 UTC - in response to Message 97462.  

I've redacted computer name, IDs etc.
Yes, but why? Name I don't mind, but by redacting the ID numbers you're making it harder for helpers to look up your system on the projects to see if it may have been throwing errors and such. We can't see any privacy information when looking up your account.

As for the extremely low resource share for GPUGrid and Einstein, what was your meaning behind them?
ID: 97465 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 97466 - Posted: 10 Apr 2020, 15:46:39 UTC

It's possibly due to the total core count calculation. You have a N core processor (is N 4 or 8? "8 AuthenticAMD AMD Ryzen 5 1400 Quad-Core" is a bit ambiguous).

Anyway, GPU tasks need a bit of CPU support, and the rule is "tasks can be run until N-and-a-bit cores are in use". 'A bit' is deliberately left vague: it can be made up of multiple fractional parts for different GPUs, but the fractions have to add up to strictly less than 1.

For the new GPUGrid application, 'a bit' is unfortunately very large. I'm seeing 0.973 and 0.975 here - run two of those, and you're definitely going past one.

Next question: what's happening on the CPU: Is the task that's running using a single CPU core, or is it marked 'MT' (multi-threaded) and using all N of them?

If there were N single-threaded CPU tasks running, the 'and a bit' rule would allow one GPU task to run. The solution would be to pause one of the CPU tasks (run N-1), and let the GPU task take its place. We can show you how to do that.

If, on the other hand, it's running MT, 'and a bit' is only going to allow one GPU to run at a time: you can't simply stop one of the threads. Well, you can tell it to run on fewer threads, but it's a bit fiddly. Again, we can show you how to do it.
ID: 97466 · Report as offensive
Profile rtX

Send message
Joined: 6 May 06
Posts: 33
United Kingdom
Message 97467 - Posted: 10 Apr 2020, 15:48:30 UTC

By changing resource usage for WCG to 1 and GPUGrid to 100,000 and aborting a load of WCG WUs for a project which I keep getting computation errors on (on two separate machines), and which WCG keeps sending me even though I have stopped volunteering for that particular project, I've finally managed to persuade BOINC to use both GPUs. I'm a little frustrated that there isn't an option to override BOINC's scheduling to ensure that the GPUs are always working, and even mopre that WCG keeps sending me dodgy WUs. I've posted this for any perspective and as it may offer insight to others whose GPUs do not seem to be being used.
ID: 97467 · Report as offensive
Profile rtX

Send message
Joined: 6 May 06
Posts: 33
United Kingdom
Message 97468 - Posted: 10 Apr 2020, 15:57:26 UTC - in response to Message 97467.  

Regarding redacting my computer ID - I didn't realise that this could be used to help. See below from latest event log:

10/04/2020 16:36:51 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12823750; resource share 1
10/04/2020 16:36:51 | GPUGRID | URL http://www.gpugrid.net/; Computer ID 538497; resource share 100000
10/04/2020 16:36:51 | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 6736303; resource share 1
10/04/2020 16:36:51 | World Community Grid | General prefs: from World Community Grid (last modified 10-Apr-2020 16:28:28)
10/04/2020 16:36:51 | World Community Grid | Host location: none
10/04/2020 16:36:51 | World Community Grid | General prefs: using your defaults


Thanks.
ID: 97468 · Report as offensive

Message boards : Questions and problems : BOINC has stopped using my second GPU

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.