Thread 'Request: better GPU labelling'

Author	Message
Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 2789	Message 102127 - Posted: 13 Dec 2020, 13:23:38 UTC - in response to Message 102124. I suspect this happens when task A on card 0 completes, task B on card 1 gets promoted to card0 or some such. Is the Windows numbering constant? It is many years since I last ran Windows. On a side note, rather than change the numbering on existing platforms a few years ago when at Kings Cross station they added another platform they went with platform 0 for it. ID: 102127 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 901	Message 102141 - Posted: 14 Dec 2020, 8:58:34 UTC It's even more confusing in Linux where you have three different entities enumerate the multiple gpus in the system totally differently. nvidia-smi app orders the cards 0 thru N based on ascending order of their hex BusID. Nvidia-X-Server Setting app orders the cards 0 thru N basically the same as in Windows from the slot nearest the socket most of the time but often from descending BusID number from the slot closest to the socket to the slot at the bottom of the board. Oh and the X-Server app numbers the cards in decimal form to confuse the issue even more. And finally BOINC orders cards based on the compute capability or basically the card with the highest performance is numbered card 0. CC capability first, then amount of VRAM, then driver level. ID: 102141 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1317	Message 102146 - Posted: 14 Dec 2020, 12:34:42 UTC There is already at least one Open report of this (or related) behaviour on GitHub (https://github.com/BOINC/boinc/issues/3200) And I do agree that the way in which BOINC appears to arbitrarily change GPU identities can be a pain - either when trying to force a task to run on a particular GPU, or when trying to work out which GPU is misbehaving.... ID: 102146 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5148	Message 102147 - Posted: 14 Dec 2020, 12:59:13 UTC - in response to Message 102146. All BOINC GPU detection is done at BOINC startup. I have no evidence of BOINC changing device number mapping while running - although a crash and reboot (or an automated OS update) can change things while no-one's looking. Again, BOINC does not re-assign tasks from one device to another arbitrarily. But there are multiple reasons why a task running on a GPU may stop: end of task-switch interval; pre-emption so a more urgent task can run; project application crash; OS/driver crash; user operation; and probably many more. Some of these may cause the entire task to fail, but often it will become 'waiting to run'. When a device of the same type becomes available, the waiting task will be started on that device. The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. ID: 102147 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 901	Message 102167 - Posted: 15 Dec 2020, 6:56:26 UTC - in response to Message 102147. The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. Which is the peril you will endure on pausing or interrupting a GPUGrid task midstream on a host with dissimilar card types. It is pure chance it restarts on the same kind of card as when it was started. If it does restart on a different device than which it started, it will trigger an instant fail and error message that the task can not be restarted on a different device. So I set a 6 hour switch between tasks setting in the Manager to let a task go to completion on the same device to prevent that occurrence. No protection for crashes or power interruptions though. ID: 102167 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5148	Message 102186 - Posted: 15 Dec 2020, 20:10:46 UTC - in response to Message 102179. If both cards have been programmed properly, and if both cards are running within their operational envelope, then both cards should compute the same mathematical answer, within tolerance. No problem at all swapping between them midway. ID: 102186 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 239	Message 102189 - Posted: 15 Dec 2020, 20:52:38 UTC - in response to Message 102167. The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. Which is the peril you will endure on pausing or interrupting a GPUGrid task midstream on a host with dissimilar card types. It is pure chance it restarts on the same kind of card as when it was started. If it does restart on a different device than which it started, it will trigger an instant fail and error message that the task can not be restarted on a different device. So I set a 6 hour switch between tasks setting in the Manager to let a task go to completion on the same device to prevent that occurrence. No protection for crashes or power interruptions though. even when the GPUs are as close to being exactly the same as possible, it can still fail. my 2 primary hosts at GPUGRID: one uses 8x EVGA RTX 2070 Black cards. these are more or less identical, and even have the same part number. the other uses 5x EVGA RTX 2080ti XC Ultra, again, the same part number. sometimes when I reboot the system, it will fail tasks complaining that they restarted on a "different device". frustrating. ID: 102189 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 901	Message 102196 - Posted: 16 Dec 2020, 2:08:48 UTC Last modified: 16 Dec 2020, 2:10:34 UTC On my host with three identical EVGA 2080 XC Hybrid cards, I have been blessed with not a single occurrence of that kind of error when interrupting GPUGrid tasks and restarting them. Of course now I have jinxed myself for boasting of such. ID: 102196 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 901	Message 102219 - Posted: 17 Dec 2020, 19:54:47 UTC I think all it takes to get confused is to have different device_id numbers from the vendor. The client code looks up the device and vendor id via the respective ATI/AMD and Nvidia API's and there are differences in the device_id say from a EVGA 2080 Black and a EVGA 2080 XC card. So even if both cards have the same 2080 die, the cards are enumerated with different device_id's and so BOINC detects them as different even though they have the same CC, memory and driver. ID: 102219 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1317	Message 102354 - Posted: 28 Dec 2020, 17:36:26 UTC - in response to Message 102353. A couple of things: First their home page says "with the SUPPORT of AMD & nVidia". Second, the FAQ only discusses the use of nVidia GPUs running CUDA. The home page does not say "Using AMD & nVidia GPUS" or any thing like that. I would guess that AMD have either given them some money in the past or cut them a good deal on servers (or server space) One of the pages linked from the "Join us" page is a list of compatible GPUs (http://www.gpugrid.net/forum_thread.php?id=2507) and there are no AMD GPUs on the list. Thus they don't support AMD GPUs, and indeed no longer distribute work to a lot of older nVidia GPUs. If you do really want to use your farm of AMD GPUs on GPUGRID I would suggested you either volunteer your services to develop some applications or dig deep in your pocket and sponsor an AMD GPU developer or two. ID: 102354 ·

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.