One Nvidia GPU unable to process after a couple of days.

Message boards : GPUs : One Nvidia GPU unable to process after a couple of days.
Message board moderation

To post messages, you must log in.

AuthorMessage
BoincSpy

Send message
Joined: 28 Oct 21
Posts: 7
Message 108114 - Posted: 17 May 2022, 16:47:52 UTC

I have the following issue in that One of the two NVIDIA GPUS no longer computer after a couple of days. The only way to get work again is if i remove all the tasks or reset the project but I have to do this every 2 - 3 days. Here are the specs. This happens on a couple of machines.

Project: Einstein@Home
GPUs: 2 RTX 2070s.
CPU: Intel I7 / Gen 8, with 8 GB ram.
Boinc Version: 7.16.20

There are plenty of GPU tasks enabled.

I have turned on coproc_debug, cpu_sched_debug and work_fetch_debug options. Here is the event log. Noticed that device is not able to run because CPU is committed but I have set the CPU limit to 70% so I thought there would be plenty of CPU head room. I tried setting it lower makes no different.


5/17/2022 9:39:12 AM | | Re-reading cc_config.xml
5/17/2022 9:39:12 AM | | Config: GUI RPCs allowed from:
5/17/2022 9:39:12 AM | | 172.16.0.23
5/17/2022 9:39:12 AM | | Config: use all coprocessors
5/17/2022 9:39:12 AM | | log flags: file_xfer, task, coproc_debug, cpu_sched_debug, work_fetch_debug
5/17/2022 9:39:12 AM | | [cpu_sched_debug] Request CPU reschedule: Core client configuration
5/17/2022 9:39:12 AM | | [work_fetch] Request work fetch: Core client configuration
5/17/2022 9:39:12 AM | | [cpu_sched_debug] schedule_cpus(): start
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] reserving 1.000000 of coproc NVIDIA
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: LATeah3012L08_796.0_0_0.0_32509764_0 (NVIDIA GPU, FIFO) (prio -1.000000)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] reserving 1.000000 of coproc NVIDIA
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: LATeah3012L08_796.0_0_0.0_32507847_0 (NVIDIA GPU, FIFO) (prio -1.020764)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (CPU, EDF) (prio -1.041527)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (CPU, EDF) (prio -1.041562)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (CPU, EDF) (prio -1.041597)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (CPU, EDF) (prio -1.041632)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (CPU, EDF) (prio -1.041667)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (CPU, EDF) (prio -1.041702)
5/17/2022 9:39:12 AM | | [cpu_sched_debug] enforce_run_list(): start
5/17/2022 9:39:12 AM | | [cpu_sched_debug] preliminary job list:
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 0: LATeah3012L08_796.0_0_0.0_32509764_0 (MD: no; UTS: yes)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 1: LATeah3012L08_796.0_0_0.0_32507847_0 (MD: no; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 2: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 3: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 4: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 5: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 6: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 7: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | | [cpu_sched_debug] final job list:
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 0: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 1: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 2: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 3: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 4: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 5: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (MD: yes; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 6: LATeah3012L08_796.0_0_0.0_32509764_0 (MD: no; UTS: yes)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 7: LATeah3012L08_796.0_0_0.0_32507847_0 (MD: no; UTS: no)
5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for LATeah3012L08_796.0_0_0.0_32509764_0
5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for LATeah3012L08_796.0_0_0.0_32509764_0
5/17/2022 9:39:12 AM | Einstein@Home | [coproc] Assigning NVIDIA instance 1 to LATeah3012L08_796.0_0_0.0_32507847_0
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (high priority)
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling LATeah3012L08_796.0_0_0.0_32509764_0
5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] skipping GPU job LATeah3012L08_796.0_0_0.0_32507847_0; CPU committed
5/17/2022 9:39:12 AM | | [cpu_sched_debug] enforce_run_list: end
5/17/2022 9:39:14 AM | | choose_project(): 1652805554.692413
5/17/2022 9:39:14 AM | | [work_fetch] ------- start work fetch state -------
5/17/2022 9:39:14 AM | | [work_fetch] target work buffer: 86400.00 + 86400.00 sec
5/17/2022 9:39:14 AM | | [work_fetch] --- project states ---
5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] REC 607807.647 prio -0.104 can't request work: scheduler RPC backoff (14.56 sec)
5/17/2022 9:39:14 AM | | [work_fetch] --- state for CPU ---
5/17/2022 9:39:14 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1317305.84 busy 1072572.89
5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] share 0.000
5/17/2022 9:39:14 AM | | [work_fetch] --- state for NVIDIA GPU ---
5/17/2022 9:39:14 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 278830.81 busy 0.00
5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] share 0.000
5/17/2022 9:39:14 AM | | [work_fetch] ------- end work fetch state -------
5/17/2022 9:39:14 AM | Einstein@Home | choose_project: scanning
5/17/2022 9:39:14 AM | Einstein@Home | skip: scheduler RPC backoff
5/17/2022 9:39:14 AM | | [work_fetch] No project chosen for work fetch
5/17/2022 9:39:29 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
5/17/2022 9:39:29 AM | | choose_project(): 1652805569.760385
5/17/2022 9:39:29 AM | | [work_fetch] ------- start work fetch state -------
5/17/2022 9:39:29 AM | | [work_fetch] target work buffer: 86400.00 + 86400.00 sec
5/17/2022 9:39:29 AM | | [work_fetch] --- project states ---
5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] REC 607807.647 prio -1.104 can request work
5/17/2022 9:39:29 AM | | [work_fetch] --- state for CPU ---
5/17/2022 9:39:29 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1317242.70 busy 1072565.50
5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] share 1.000
5/17/2022 9:39:29 AM | | [work_fetch] --- state for NVIDIA GPU ---
5/17/2022 9:39:29 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 278828.81 busy 0.00
5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] share 1.000
5/17/2022 9:39:29 AM | | [work_fetch] ------- end work fetch state -------


Anyone have suggestions to fix this? I have tried talking to the Einstein@home people and didn't get to far with them.

Thanks,
Bob
ID: 108114 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 108127 - Posted: 18 May 2022, 7:57:07 UTC - in response to Message 108114.  

Lines like:
5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for LATeah3012L08_796.0_0_0.0_32509764_0
5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for LATeah3012L08_796.0_0_0.0_32509764_0

I've highlighted the key section of the line.

A few things
Are both GPUs actually working - use something like GPU-Z to check that
Is BOINC actualy seeing both GPUs - in the BOINC log, when first start BOINC you should see lines a bit like these:
18/05/2022 07:44:19 | | Starting BOINC client version 7.16.20 for windows_x86_64
18/05/2022 07:44:19 | | log flags: file_xfer, sched_ops, task, sched_op_debug
18/05/2022 07:44:19 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8
18/05/2022 07:44:19 | | Data directory: C:\ProgramData\BOINC
18/05/2022 07:44:19 | | Running under account rob
18/05/2022 07:44:20 | | CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1070 Ti (driver version 511.65, CUDA version 11.6, compute capability 6.1, 4096MB, 3470MB available, 8186 GFLOPS peak)
18/05/2022 07:44:20 | | CUDA: NVIDIA GPU 1: NVIDIA GeForce GTX 1070 Ti (driver version 511.65, CUDA version 11.6, compute capability 6.1, 4096MB, 3470MB available, 8186 GFLOPS peak)
18/05/2022 07:44:20 | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1070 Ti (driver version 511.65, device version OpenCL 3.0 CUDA, 8192MB, 3470MB available, 8186 GFLOPS peak)
18/05/2022 07:44:20 | | OpenCL: NVIDIA GPU 1: NVIDIA GeForce GTX 1070 Ti (driver version 511.65, device version OpenCL 3.0 CUDA, 8192MB, 3470MB available, 8186 GFLOPS peak)

18/05/2022 07:44:20 | | Windows processor group 0: 16 processors

Again the key bits are highlighted.

Reasons for not running???
GPUs not seated properly
Thermal - GPUs do tend to get very hot when doing computational work - you may need to de-dust them.
Power supply not working properly
ID: 108127 · Report as offensive
BoincSpy

Send message
Joined: 28 Oct 21
Posts: 7
Message 108128 - Posted: 18 May 2022, 17:44:38 UTC - in response to Message 108127.  

Yes BOINC sees both GPUs.

5/18/2022 10:38:41 AM | | CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 (driver version 470.99, CUDA version 11.4, compute capability 7.5, 4096MB, 3968MB available, 7465 GFLOPS peak)
5/18/2022 10:38:41 AM | | CUDA: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 (driver version 470.99, CUDA version 11.4, compute capability 7.5, 4096MB, 3968MB available, 7465 GFLOPS peak)
5/18/2022 10:38:41 AM | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 (driver version 470.103.01, device version OpenCL 3.0 CUDA, 7981MB, 3968MB available, 7465 GFLOPS peak)
5/18/2022 10:38:41 AM | | OpenCL: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 (driver version 470.103.01, device version OpenCL 3.0 CUDA, 7982MB, 3968MB available, 7465 GFLOPS peak)

Both GPus are working as I use distributed.net opencl application that really push's the GPUs.
ID: 108128 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 108130 - Posted: 18 May 2022, 20:02:20 UTC

Something strange, it appears you have two versions of the driver running. In the first pair of lines driver version 470.99 is reported, while in the second pair of lines 470.103.01 is reported. If you look at the lines from my computer you will see that the version is the same for all four lines, 511.65. Having mixed driver versions has given people some problems over the years.
It may be that at some time in the past you didn't do a "clean" driver update (or Windows decided that your drivers needed updating and only did half the job). Version 470.103.01 was distributed with one of the toolkits, which aren't really needed for (most) BOINC projects.
I'd head over the Nvidia site and get the current drives then do a "clean installation (this is an option often buried in small text when you start doing the installation)
ID: 108130 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 108131 - Posted: 18 May 2022, 22:27:50 UTC

I would go even further and boot into Safe Mode, DDU the driver install, check the option to prevent Windows from installing their own driver, then boot back into normal mode, install the driver from the latest Nvidia package.

this gets rid of every last bit of previous drivers. simply doing a clean install with the new driver package doesn't wipe out everything.
ID: 108131 · Report as offensive
BoincSpy

Send message
Joined: 28 Oct 21
Posts: 7
Message 108152 - Posted: 19 May 2022, 21:28:56 UTC

So the machine in is running ubuntu, so I will have to do a clean install on the nvidia drivers. However the other machine (Windows ) that I have issues with has the same versions of the drivers and it appears the task cpu/gpu scheduler is causing the issue of not running the other task on the other GPU. I suspect this because of the following.

1) If I play around with the Computing preferences CPU usage limit ( % of CPU ) I can get the other GPU to start processing.
2) If I delete all tasks I can the the other GPU to run for a couple of days.
ID: 108152 · Report as offensive
Nick Name

Send message
Joined: 14 Aug 19
Posts: 55
United States
Message 108191 - Posted: 22 May 2022, 19:14:04 UTC

Older BOINC versions had problem with Einstein, under some circumstances they would download more work than could be completed by the deadline. This might be your problem, it explains why things work correctly for a couple days after you delete tasks and then the problem comes back. I'm certain the excessive work problem affects the 7.16 series. The easiest thing to do is update BOINC to a newer version and see if that fixes it.

Keep in mind some of those Einstein GPU tasks actually require more than one thread, so your CPU usage settings might not match the workload like you think.
Team USA forum
Follow us on Twitter
Help us #crunchforcures!
ID: 108191 · Report as offensive
BoincSpy

Send message
Joined: 28 Oct 21
Posts: 7
Message 108277 - Posted: 30 May 2022, 5:17:28 UTC - in response to Message 108191.  

The Linux client is not the 7.16.20 as reported earlier. Its running :

5/29/2022 10:00:29 PM | | Starting BOINC client version 7.18.1 for x86_64-pc-linux-gnu
5/29/2022 10:00:29 PM | | This a development version of BOINC and may not function properly.

I will dig into the issue of GPU task requires more that one thread, and play around with the CPU settings.

Thanks,
BoincSpy
ID: 108277 · Report as offensive

Message boards : GPUs : One Nvidia GPU unable to process after a couple of days.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.