Inconsistent GPU enumeration: gpus: 0,1,2.. different for additional cards

Message boards : GPUs : Inconsistent GPU enumeration: gpus: 0,1,2.. different for additional cards
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile BeemerBiker
Avatar

Send message
Joined: 27 Jun 08
Posts: 257
United States
Message 84583 - Posted: 29 Jan 2018, 16:11:02 UTC
Last modified: 29 Jan 2018, 16:38:28 UTC

I observed this with AMD/ATI video boards. If the motherboard has only 1 ATI card, that GPU is numbered gpu0. If a second board is added to adjacent slot, the first board becomes gpu1 and the new one is known as gpu0 to boinc 7.8.3. I have not seen this problem on nvidia boards.

For example:
Dell Z400 with with RX-570 closest to cpu and HD7950 in adjacent X16 slot
Windows 10 driver (amd v17) "location path" shows "SLT2" and "SLT4" for the RX-570 and HD7950 respectively, those are the correct slots for this motherboard and SLT2 is the first X16 slot closest to the CPU. SLT4 is the next X16 slot and is further from the CPU. This system was built with the RX-570 and the HD added later.

However, boinc clearly shows the HD7950 as the "first" gpu:

    1/29/2018 7:38:26 AM OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7900 Series (driver version 2527.7, device version OpenCL 1.2 AMD-APP (2527.7), 3072MB, 3072MB available, 3315 GFLOPS peak)

    1/29/2018 7:38:26 AM OpenCL: AMD/ATI GPU 1: Radeon RX 570 Series (driver version 2527.7, device version OpenCL 2.0 AMD-APP (2527.7), 4096MB, 4096MB available, 5095 GFLOPS peak)



Another example
MSI-7380 with Pair of 7950s, Gigabyte in slot closest to CPU and Tahiti LE in adjacent slot further away. The Tahiti has fewer shaders and its GFLOPS are less. Otherwise it is not possible to tell which board is 1 or 2 as boinc does not report board names. Windows drivers do not show SLTx info for this much older motherboard. Instead bus and device number are shown which I do not know how to interpret. Notice that the "weaker" board is listed first as GPU 0. The Gigabyte should have been listed first as the weaker Tahiti was added after the system was built.


    1/29/2018 9:07:32 AM OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7900 Series (driver version 2527.8, device version OpenCL 1.2 AMD-APP (2527.8), 3072MB, 3072MB available, 2842 GFLOPS peak)

    1/29/2018 9:07:32 AM OpenCL: AMD/ATI GPU 1: AMD Radeon HD 7900 Series (driver version 2527.8, device version OpenCL 1.2 AMD-APP (2527.8), 3072MB, 3072MB available, 3315 GFLOPS peak)



This bug is important because if you are using TThrottle to monitor temperatures, the wrong temperatures are assigned to the graphics cards. On the remote system being monitored by BoincTasks: TThrottle, Gpu-z, Radeon and windows correctly enumerate the graphics board order. Unfortunately, the temperature reported back by TThrottle is then associated with the wrong board. This also invalidates cc_config.xml configurations when gpu_exclude is being used and a additional boards are being added.

Again, this does not happen with nVidia boards.

[EDIT] I only noticed this problem because I was getting hot temperatures reported for the RX-570 but the card that was really hot was the HD7950.

ID: 84583 · Report as offensive
Juha
Volunteer developer
Volunteer tester
Help desk expert

Send message
Joined: 20 Nov 12
Posts: 766
Finland
Message 84584 - Posted: 29 Jan 2018, 18:55:09 UTC - in response to Message 84583.  

BOINC doesn't number GPUs based on their physical distance to CPU, or their performance or anything like that. BOINC numbers GPUs in the order it receives information about them from CAL, CUDA or OpenCL drivers.

Those other programs are probably using some lower level interface to get the information.
ID: 84584 · Report as offensive
Fred - efmer.com
Avatar

Send message
Joined: 8 Aug 08
Posts: 551
Netherlands
Message 84601 - Posted: 30 Jan 2018, 20:43:25 UTC - in response to Message 84583.  

The card sequence is reported by the BOINC client.

If the sequence if not correct add tthrottle.xml to the TThrottle folder.
An example is here: C:\Program Files\eFMer\TThrottle\examples
Use <Device_position>1;0</Device_position> (From what I remember) This should switch the two cards.
TThrottle The way to control your CPU and GPU temperature.
BoincTasks The best view of BOINC.
My other activities
ID: 84601 · Report as offensive
Profile BeemerBiker
Avatar

Send message
Joined: 27 Jun 08
Posts: 257
United States
Message 84617 - Posted: 31 Jan 2018, 16:46:40 UTC - in response to Message 84601.  
Last modified: 31 Jan 2018, 16:48:32 UTC

The card sequence is reported by the BOINC client.

If the sequence if not correct add tthrottle.xml to the TThrottle folder.
An example is here: C:\Program Files\eFMer\TThrottle\examples
Use <Device_position>1;0</Device_position> (From what I remember) This should switch the two cards.


Thanks Fred! That fixed the problem with the temperatures and they now correspond to the correct GPU.

I also found that nVidia boards can also show inconsistent enumeration, not just ATI. I had a motherboard with an x1 socket closest to the CPU and the two adjacent X16 were filled with pair of gtx 670. I put an x1 to x16 riser in that x1 socket with a gtx 650 TI in the riser. I expected that two gtx670 shoud be renamed to 1 and 2 respectively with the x1 becoming gpu0. Instead, boinc assigned the board so it showed up in the middle of the order: 670, 650ti, 670 for 0,1,2 respectively. However, it was not necessary to renumber using your Device_position. Both tthrottle and boinctasks had the correct order as shown below. Note that the 45 degree is GPU1 which is the 650ti. The ATIs required me to use your xml file to set the ordering.


However, this does not fix the problem where the cc_config.xml has to be edited to change the gpu_exclusions as they become incorrect when the gpus are renumbered.
ID: 84617 · Report as offensive
Fred - efmer.com
Avatar

Send message
Joined: 8 Aug 08
Posts: 551
Netherlands
Message 86190 - Posted: 12 May 2018, 9:17:56 UTC - in response to Message 84583.  

The GPU assignment is handled by the BOINC client.
In case the cards are reversed this is how to solve this:
Add <Device_position> to the GPU_SETUP

<GPU_SETUP>
<Device_position>0;1;3;2</Device_position>
</GPU_SETUP>

This switches reading of the third and fourth card.

for an example file look here: C:\Program Files\eFMer\TThrottle\examples
Move the real file here: C:\Program Files\eFMer\TThrottle
and restart TThrottle.
TThrottle The way to control your CPU and GPU temperature.
BoincTasks The best view of BOINC.
My other activities
ID: 86190 · Report as offensive

Message boards : GPUs : Inconsistent GPU enumeration: gpus: 0,1,2.. different for additional cards

Copyright © 2018 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.