"Phantom" GPU devices showing up in 7.16.3 and 441.66 again

Message boards : Questions and problems : "Phantom" GPU devices showing up in 7.16.3 and 441.66 again
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 94593 - Posted: 30 Dec 2019, 14:31:36 UTC
Last modified: 30 Dec 2019, 14:39:15 UTC

I know how it happened and what can be done to fix it but not why.

How: Had to replace blower fan on one of two boards on my office desktop, long story, but ended up with the two boards back in but the slots were reversed. Installed 441 after Microsoft put in 3xx as it seems reversing the PCIe slots confuses windows.

Boinc showed 2 CUDA and 4 OpenCL devices with the pair of extra "phantom" GPU's attempting to crunch. Revo Uninstaller, clean install of 441 did not solve the problem. The Revo showed a mix of 339 and 441 but the clean install should have worked.

Looked at the coproc_info xml file
header
cuda0
cuda1  
opencl  num,index
OCLnv0  ===> 0,0
OCLnv1  ===> 0,0
OCLnv2  ===> 1,1
OCLnv3  ===> 1,1


C:\Users\josep\Desktop\debug coproc>fc OCLnv0.txt OCLnv1.txt
Comparing files OCLnv0.txt and OCLnv1.TXT
FC: no differences encountered


C:\Users\josep\Desktop\debug coproc>fc OCLnv2.txt OCLnv3.txt
Comparing files OCLnv2.txt and OCLnv3.TXT
FC: no differences encountered


C:\Users\josep\Desktop\debug coproc>fc OCLnv1.txt OCLnv3.txt
Comparing files OCLnv1.txt and OCLnv3.TXT
***** OCLnv1.txt
      <opencl_driver_version>441.66</opencl_driver_version>
      <device_num>0</device_num>
      <peak_flops>8186112000000.000000</peak_flops>
***** OCLnv3.TXT
      <opencl_driver_version>441.66</opencl_driver_version>
      <device_num>1</device_num>
      <peak_flops>8186112000000.000000</peak_flops>
*****

***** OCLnv1.txt
      <opencl_available_ram>3726508031.000000</opencl_available_ram>
      <opencl_device_index>0</opencl_device_index>
      <warn_bad_cuda>0</warn_bad_cuda>
***** OCLnv3.TXT
      <opencl_available_ram>3726508031.000000</opencl_available_ram>
      <opencl_device_index>1</opencl_device_index>
      <warn_bad_cuda>0</warn_bad_cuda>
*****


The gpu detect program wrote out duplicate entries for the same GPU. My fix was to delete the OCnv1 and OCnv3 and set the attributes of the coproc_info.xml file to read only.

Suggestion: The program that writes out that file should check for duplicates. Alternately, the program that reads it in should do a check.

other thoughts: clean uninstall should have worked. possibly I should have disconnected the ethernet to prevent windows from re-downloading the same 339 (?) driver. I was instructed to reboot several times to removed 441 and 339 stuff. Since I was busy with replacing the fan I may not have responded in time to continue the uninstall.
ID: 94593 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 94594 - Posted: 30 Dec 2019, 16:47:15 UTC
Last modified: 30 Dec 2019, 16:48:39 UTC

Went back to feb 2019 and got the AMD RX-570 zipped coproc_info that I had provided earlier in the year when the problem first arose..

There is a difference, although both coproc info files have an extra pair of GPUs, the arrangement is not the same as nvidia. In this case I deleted the last two sections before making the file read-only.

	device_num, device_index
OCLati0		0	0
OCLati1		1	1
OCLati2		2	0
OCLati3		3	1



C:\Users\josep\Desktop\debug coproc>fc OCLat0.txt OCLat1.txt
Comparing files OCLat0.txt and OCLAT1.TXT
***** OCLat0.txt
      <opencl_driver_version>2766.5</opencl_driver_version>
      <device_num>0</device_num>
      <peak_flops>5095424000000.000000</peak_flops>
***** OCLAT1.TXT
      <opencl_driver_version>2766.5</opencl_driver_version>
      <device_num>1</device_num>
      <peak_flops>5095424000000.000000</peak_flops>
*****

***** OCLat0.txt
      <opencl_available_ram>4294967296.000000</opencl_available_ram>
      <opencl_device_index>0</opencl_device_index>
      <warn_bad_cuda>0</warn_bad_cuda>
***** OCLAT1.TXT
      <opencl_available_ram>4294967296.000000</opencl_available_ram>
      <opencl_device_index>1</opencl_device_index>
      <warn_bad_cuda>0</warn_bad_cuda>
*****


The nvidia coprioc info lists 2 CUDA devices so if more than 2 OpenCL device then a clue there is a problem. There is no count of actual cards nor do any of the OpenCL have duplicate sections so the ATI problem I harder to solve if just analyzing the file.
ID: 94594 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 94596 - Posted: 30 Dec 2019, 17:56:39 UTC

Given that the nVidia driver version 441.66 is known to have some issues when performing calculations you would do well to not be using it, despite having had it forced on you by MS (How often do we have to say "Do NOT allow MS to update your drivers for you, but ALWAYS get the drivers from nVidia"? (The highest known "fault free" version is 431.xx)

That apart, what is the hardware configuration - is is a single GTX 1070ti?, or something else?

I suspect that, like many other similar routines the one in question, simply finds the file it wants ad instead of over-writing it just appends the data into the appropriate sections.
ID: 94596 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 94598 - Posted: 30 Dec 2019, 19:47:48 UTC - in response to Message 94593.  

The Revo showed a mix of 339 and 441 but the clean install should have worked.
If an uninstaller finds multiple drivers, that's probably why BOINC finds these GPUs as well. Trouble with all drivers is that names change and their folders may change as well. In so much that the uninstaller of the new installer doesn't know per se where it has to clean all the old stuff of the previous installation when doing a clean install. Has happened quite some times. So much so even that at times people were forced to do a clean installation of Windows to get rid of previous driver remnants that didn't want to move.
ID: 94598 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 94601 - Posted: 30 Dec 2019, 20:21:26 UTC
Last modified: 30 Dec 2019, 20:27:16 UTC

Ran some more tests after talking with Dell and it turned out the fan was not the problem. The NVidia board is running the fan at %100 which is ruining my hearing as well as the fan.

Just removed the "read only" coproc file and started boinc and it wrote out a good coproc_info.xml file that actually matched the one I had edited.

The board arrangement is the same. Maybe it needed another reboot for the "cleaner" to work.

Turned out the "basic" warranty (have 40 days left) covers the video board so they wanted proof so I took a lot of pictures. GPUz was helpful as it showed 5000 rpm and "no load" on the bad board and 1100 rpm on the good one also at no load. It also shows the history which is as good as a video.

I think an issue should be brought up about that coproc_info file. The detect GPU should never write out identical GPUs as the same address. If boinc has no control over the program doing the writing (which I suspect) then for sure when the client reads in the info file to see what is there it should ignore duplicates at the same bus address. Unfortunately, the ATI behavior is different.

https://stateson.net/images/coproc_normal.png
ID: 94601 · Report as offensive
embed

Send message
Joined: 9 Mar 20
Posts: 5
Poland
Message 96753 - Posted: 13 Mar 2020, 19:56:03 UTC
Last modified: 13 Mar 2020, 20:00:06 UTC

Hi, I had a similar problem, but I lost 2 of my 4 GPUs. In my case the key was to edit coproc_info.xml file and double gpu entry. Thanks!
ID: 96753 · Report as offensive

Message boards : Questions and problems : "Phantom" GPU devices showing up in 7.16.3 and 441.66 again

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.