Request: better GPU labelling

Message boards : BOINC client : Request: better GPU labelling
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 102127 - Posted: 13 Dec 2020, 13:23:38 UTC - in response to Message 102124.  

I suspect this happens when task A on card 0 completes, task B on card 1 gets promoted to card0 or some such. Is the Windows numbering constant? It is many years since I last ran Windows.

On a side note, rather than change the numbering on existing platforms a few years ago when at Kings Cross station they added another platform they went with platform 0 for it.
ID: 102127 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 102141 - Posted: 14 Dec 2020, 8:58:34 UTC

It's even more confusing in Linux where you have three different entities enumerate the multiple gpus in the system totally differently.

nvidia-smi app orders the cards 0 thru N based on ascending order of their hex BusID.

Nvidia-X-Server Setting app orders the cards 0 thru N basically the same as in Windows from the slot nearest the socket most of the time but often from descending BusID number from the slot closest to the socket to the slot at the bottom of the board. Oh and the X-Server app numbers the cards in decimal form to confuse the issue even more.

And finally BOINC orders cards based on the compute capability or basically the card with the highest performance is numbered card 0. CC capability first, then amount of VRAM, then driver level.
ID: 102141 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 102146 - Posted: 14 Dec 2020, 12:34:42 UTC

There is already at least one Open report of this (or related) behaviour on GitHub (https://github.com/BOINC/boinc/issues/3200)

And I do agree that the way in which BOINC appears to arbitrarily change GPU identities can be a pain - either when trying to force a task to run on a particular GPU, or when trying to work out which GPU is misbehaving....
ID: 102146 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 102147 - Posted: 14 Dec 2020, 12:59:13 UTC - in response to Message 102146.  

All BOINC GPU detection is done at BOINC startup. I have no evidence of BOINC changing device number mapping while running - although a crash and reboot (or an automated OS update) can change things while no-one's looking.

Again, BOINC does not re-assign tasks from one device to another arbitrarily. But there are multiple reasons why a task running on a GPU may stop: end of task-switch interval; pre-emption so a more urgent task can run; project application crash; OS/driver crash; user operation; and probably many more. Some of these may cause the entire task to fail, but often it will become 'waiting to run'. When a device of the same type becomes available, the waiting task will be started on that device. The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored.
ID: 102147 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 102167 - Posted: 15 Dec 2020, 6:56:26 UTC - in response to Message 102147.  

The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored.

Which is the peril you will endure on pausing or interrupting a GPUGrid task midstream on a host with dissimilar card types.
It is pure chance it restarts on the same kind of card as when it was started.
If it does restart on a different device than which it started, it will trigger an instant fail and error message that the task can not be restarted on a different device.
So I set a 6 hour switch between tasks setting in the Manager to let a task go to completion on the same device to prevent that occurrence. No protection for crashes or power interruptions though.
ID: 102167 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 102186 - Posted: 15 Dec 2020, 20:10:46 UTC - in response to Message 102179.  

If both cards have been programmed properly, and if both cards are running within their operational envelope, then both cards should compute the same mathematical answer, within tolerance. No problem at all swapping between them midway.
ID: 102186 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 102189 - Posted: 15 Dec 2020, 20:52:38 UTC - in response to Message 102167.  

The decision is 'first come, first served': whether it's the same device as last time is irrelevant and ignored.

Which is the peril you will endure on pausing or interrupting a GPUGrid task midstream on a host with dissimilar card types.
It is pure chance it restarts on the same kind of card as when it was started.
If it does restart on a different device than which it started, it will trigger an instant fail and error message that the task can not be restarted on a different device.
So I set a 6 hour switch between tasks setting in the Manager to let a task go to completion on the same device to prevent that occurrence. No protection for crashes or power interruptions though.


even when the GPUs are as close to being exactly the same as possible, it can still fail.

my 2 primary hosts at GPUGRID:
one uses 8x EVGA RTX 2070 Black cards. these are more or less identical, and even have the same part number.
the other uses 5x EVGA RTX 2080ti XC Ultra, again, the same part number.

sometimes when I reboot the system, it will fail tasks complaining that they restarted on a "different device".

frustrating.
ID: 102189 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 102196 - Posted: 16 Dec 2020, 2:08:48 UTC
Last modified: 16 Dec 2020, 2:10:34 UTC

On my host with three identical EVGA 2080 XC Hybrid cards, I have been blessed with not a single occurrence of that kind of error when interrupting GPUGrid tasks and restarting them.

Of course now I have jinxed myself for boasting of such.
ID: 102196 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 867
United States
Message 102219 - Posted: 17 Dec 2020, 19:54:47 UTC

I think all it takes to get confused is to have different device_id numbers from the vendor. The client code looks up the device and vendor id via the respective ATI/AMD and Nvidia API's and there are differences in the device_id say from a EVGA 2080 Black and a EVGA 2080 XC card.

So even if both cards have the same 2080 die, the cards are enumerated with different device_id's and so BOINC detects them as different even though they have the same CC, memory and driver.
ID: 102219 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 102354 - Posted: 28 Dec 2020, 17:36:26 UTC - in response to Message 102353.  

A couple of things:
First their home page says "with the SUPPORT of AMD & nVidia".
Second, the FAQ only discusses the use of nVidia GPUs running CUDA.

The home page does not say "Using AMD & nVidia GPUS" or any thing like that. I would guess that AMD have either given them some money in the past or cut them a good deal on servers (or server space)
One of the pages linked from the "Join us" page is a list of compatible GPUs (http://www.gpugrid.net/forum_thread.php?id=2507) and there are no AMD GPUs on the list.

Thus they don't support AMD GPUs, and indeed no longer distribute work to a lot of older nVidia GPUs.

If you do really want to use your farm of AMD GPUs on GPUGRID I would suggested you either volunteer your services to develop some applications or dig deep in your pocket and sponsor an AMD GPU developer or two.
ID: 102354 · Report as offensive

Message boards : BOINC client : Request: better GPU labelling

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.