Message boards : BOINC client : Request: better GPU labelling
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 28 Jun 10 Posts: 2789 ![]() |
I suspect this happens when task A on card 0 completes, task B on card 1 gets promoted to card0 or some such. Is the Windows numbering constant? It is many years since I last ran Windows. On a side note, rather than change the numbering on existing platforms a few years ago when at Kings Cross station they added another platform they went with platform 0 for it. |
![]() ![]() Send message Joined: 17 Nov 16 Posts: 901 ![]() |
It's even more confusing in Linux where you have three different entities enumerate the multiple gpus in the system totally differently. nvidia-smi app orders the cards 0 thru N based on ascending order of their hex BusID. Nvidia-X-Server Setting app orders the cards 0 thru N basically the same as in Windows from the slot nearest the socket most of the time but often from descending BusID number from the slot closest to the socket to the slot at the bottom of the board. Oh and the X-Server app numbers the cards in decimal form to confuse the issue even more. And finally BOINC orders cards based on the compute capability or basically the card with the highest performance is numbered card 0. CC capability first, then amount of VRAM, then driver level. |
Send message Joined: 25 May 09 Posts: 1317 ![]() |
There is already at least one Open report of this (or related) behaviour on GitHub (https://github.com/BOINC/boinc/issues/3200) And I do agree that the way in which BOINC appears to arbitrarily change GPU identities can be a pain - either when trying to force a task to run on a particular GPU, or when trying to work out which GPU is misbehaving.... |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
All BOINC GPU detection is done at BOINC startup. I have no evidence of BOINC changing device number mapping while running - although a crash and reboot (or an automated OS update) can change things while no-one's looking. Again, BOINC does not re-assign tasks from one device to another arbitrarily. But there are multiple reasons why a task running on a GPU may stop: end of task-switch interval; pre-emption so a more urgent task can run; project application crash; OS/driver crash; user operation; and probably many more. Some of these may cause the entire task to fail, but often it will become 'waiting to run'. When a device of the same type becomes available, the waiting task will be started on that device. The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. |
![]() ![]() Send message Joined: 17 Nov 16 Posts: 901 ![]() |
The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. Which is the peril you will endure on pausing or interrupting a GPUGrid task midstream on a host with dissimilar card types. It is pure chance it restarts on the same kind of card as when it was started. If it does restart on a different device than which it started, it will trigger an instant fail and error message that the task can not be restarted on a different device. So I set a 6 hour switch between tasks setting in the Manager to let a task go to completion on the same device to prevent that occurrence. No protection for crashes or power interruptions though. |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
If both cards have been programmed properly, and if both cards are running within their operational envelope, then both cards should compute the same mathematical answer, within tolerance. No problem at all swapping between them midway. |
Send message Joined: 24 Dec 19 Posts: 239 ![]() |
The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored. even when the GPUs are as close to being exactly the same as possible, it can still fail. my 2 primary hosts at GPUGRID: one uses 8x EVGA RTX 2070 Black cards. these are more or less identical, and even have the same part number. the other uses 5x EVGA RTX 2080ti XC Ultra, again, the same part number. sometimes when I reboot the system, it will fail tasks complaining that they restarted on a "different device". frustrating. ![]() |
![]() ![]() Send message Joined: 17 Nov 16 Posts: 901 ![]() |
On my host with three identical EVGA 2080 XC Hybrid cards, I have been blessed with not a single occurrence of that kind of error when interrupting GPUGrid tasks and restarting them. Of course now I have jinxed myself for boasting of such. |
![]() ![]() Send message Joined: 17 Nov 16 Posts: 901 ![]() |
I think all it takes to get confused is to have different device_id numbers from the vendor. The client code looks up the device and vendor id via the respective ATI/AMD and Nvidia API's and there are differences in the device_id say from a EVGA 2080 Black and a EVGA 2080 XC card. So even if both cards have the same 2080 die, the cards are enumerated with different device_id's and so BOINC detects them as different even though they have the same CC, memory and driver. |
Send message Joined: 25 May 09 Posts: 1317 ![]() |
A couple of things: First their home page says "with the SUPPORT of AMD & nVidia". Second, the FAQ only discusses the use of nVidia GPUs running CUDA. The home page does not say "Using AMD & nVidia GPUS" or any thing like that. I would guess that AMD have either given them some money in the past or cut them a good deal on servers (or server space) One of the pages linked from the "Join us" page is a list of compatible GPUs (http://www.gpugrid.net/forum_thread.php?id=2507) and there are no AMD GPUs on the list. Thus they don't support AMD GPUs, and indeed no longer distribute work to a lot of older nVidia GPUs. If you do really want to use your farm of AMD GPUs on GPUGRID I would suggested you either volunteer your services to develop some applications or dig deep in your pocket and sponsor an AMD GPU developer or two. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.