need help debugging a problem: Linux 7.16.1

Message boards : Questions and problems : need help debugging a problem: Linux 7.16.1
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92562 - Posted: 24 Aug 2019, 11:28:21 UTC
Last modified: 24 Aug 2019, 12:27:09 UTC

This is likely a hardware problem. It is solved by rebooting but I would like to know what could cause this. I now have the capability of building the (Linux) client and could look at where this occurs and possibly come up with an error message that could notify the user the problem has started.

---once every couple of days----

On a 5 GPU rig, one of the GPUs crunches for 0-1 seconds then goes on to another work unit. A queue of "waiting to run" starts building up. Because there are 4 other working GPUs, they pull from this queue so the queue grows only slowly. After about an hour or two there might be 40 items in the queue.

There are no error message in the event queue and the work units all eventually finish and report back OK. There is just no productivity from the GPU that has the problem (assuming it is the same GPU)

There are "error" messages in the stderr file associated with the task.
https://setiathome.berkeley.edu/result.php?resultid=7986887720

Another problem (may be a feature): The GPU's are numbered 0..X where 0 is given to the "best" GPU and larger numbers for the "weaker". I do not know why BOINC bothers to rank GPUs. There seems to be no need and it makes it difficult to find which GPU is causing the problem assumeing the problem is a unique GPU. Why can't BOINC use the same GPU number that nVidia uses in nvidia-smi or that ATi uses in "sensors". Currently, I have to stop the fan spinning on a GPU. Look at nvidia-smi to see which GPU has the fan stopped, then make a note of the BUS-ID and then look up that BUS-ID in the file coproc_info.xml and then look at the Boinc Manager to see if that "Dx" is the same "Dx" that is crunching only 0-1 seconds. This is very awkward plus is dangerous depending on the fan type. I can post a picture of my bloody finger if anyone want to see it.

[edit] I may have looked in the wrong event queue using Boinctasks. Next time I will check the event queue more carefully for an error message.
ID: 92562 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92564 - Posted: 24 Aug 2019, 13:41:39 UTC
Last modified: 24 Aug 2019, 13:58:48 UTC

Using the following two error messages
   Device cannot be used
  Cuda device initialisation retry 1 of 6, waiting 5 secs


I cannot find any matching phrase looking recursively through 7.16.1

grep -r "Cuda device initialisation retry"

grep -r "Device cannot be used" .

I did find the "Cuda device initialisation retry" in the SETI source and spotted the following as an exit during Cuda initialization:
	  boinc_temporary_exit(180,"Cuda device initialisation failed");


Somehow this error needs to get more visibility to the user. Possibly it is buried in the event queue. All I see in the manager is a lot of tasks "waiting to run" which is NOT an error but a symptom

I was unable to find "Device cannot be used" anywhere but if
	  boinc_temporary_exit(180,"Cuda device initialisation failed");

is reported to the client then they did their job even if not much.
ID: 92564 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 92568 - Posted: 24 Aug 2019, 16:58:12 UTC

BOINC checks the presence, capabilities, and status of coprocessors at startup. It is not designed as a continuous hardware health monitor: other tools are available for that.

As you have discovered, petri33's "Cuda 9.00 special" app for SETI is one such tool, drawing on earlier work contributed by JasonG and Raistmer.

The sample stderr you have linked shows the problems go deeper than you describe. It starts with repeated iterations of

setiathome_CUDA: CUDA Device 5 specified, checking...
   Device cannot be used
  Cuda initialisation FAILED, Initiating Boinc temporary exit (180 secs)
setiathome_CUDA: Found 3 CUDA device(s):
implying that TWO of your five cards have become inoperable since the task was originally assigned to device 5.

One thing to remember is that Petri's CUDA 9 app is designed to drive the cards as hard as possible (well, not quite as hard as the CUDA 10 app I've recently started using, but close). That much strain is highly likely to seek out and expose any weaknesses in the hardware and power supply in use.

The PCI Express Power specification allows:

25W (maximum) from PCIe x1 or x4 slots
75W from PCIe x8 or x16 slots
75W from 6-pin auxiliary connectors
150W from 8-pin auxiliary connectors

My dual GTX 1660 Ti machine is currently drawing about 360W from the wall, falling to a little over 300W when the CPU is idled. The rest of the machine is pretty minimal (single SSD, no mechanical or optical drives), so the GPUs must be drawing nigh on 150W each. With bus power and an 8-pin connector each, they're well supplied - and have shown no sign of power stress.

In the past, you've described using riser cables from < x8 PCIe slots. I don't have access to the full rig specifications, so you'll have to do the power maths yourself - but my instinct would be to do a full power audit: how much do the cards demand under load, and how is each card powered (what combination of auxiliary connector, direct feed from slot, feed from slot via riser cable, feed (if any) from PSU to outboard riser slot). Tot up total draw, total connector capacity, and total supply capability. Make sure that each total is bigger than the previous one.

And that's all before we consider cooling and exhaust air impact on neighbouring cards.
ID: 92568 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92569 - Posted: 24 Aug 2019, 17:22:37 UTC - in response to Message 92568.  
Last modified: 24 Aug 2019, 17:58:55 UTC


My dual GTX 1660 Ti machine is currently drawing about 360W from the wall, falling to a little over 300W when the CPU is idled


This system has 670 at wall and power supply is either 750 or 850 Seasonic gold. Will have to pull it out to see exactly what it is. There are two gtx1060 on a 4-in-1 splitter and possibly those are the problem. Next time it fails I will remove the splitter and go with just 4.

[EDIT] is 850 watt. I used a DeWalt inspection camera to read the info. I managed to avoid knocking any of the x1 adapters loose on the rig under the power supply.
ID: 92569 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 92570 - Posted: 24 Aug 2019, 19:27:47 UTC - in response to Message 92569.  

Mine's a Corsair 650W 80+ bronze (according to the invoice), and hasn't had time to degrade - much - yet. The 360W draw is measured: I think that's comfortably below the 80% line (520W) I should be able to sustain without ill effects.

But the question I was asking was more about the cable paths and how much power they could sustain. If I'm imagining the concept of a 4-in-1 splitter correctly, that device should be limited to 75W from the motherboard: the connected cards might be limited to 25W each if the slots are wired as x4, or might be limited to half (37.5W) each. Either way, they'll be underpowered compared to what NVidia was expecting.
ID: 92570 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 92571 - Posted: 24 Aug 2019, 20:22:14 UTC

Some spliters do have external power lines which help the situation by cutting the power lines from the "plug" and inserting the feed from the external supply in their place.
But, as others have found the quality of many of those spliters and stand-offs is not that high, and price is not a good guide to quality :-(
ID: 92571 · Report as offensive
Bernie Vine
Volunteer moderator
Avatar

Send message
Joined: 10 Dec 12
Posts: 322
Message 92572 - Posted: 24 Aug 2019, 20:40:14 UTC
Last modified: 24 Aug 2019, 20:42:25 UTC

I have one machine running 2 GTX1060's one on a riser and I get this same error about once a week. It always seems to be the card on the riser. First time it was cured by swapping to a "better" riser. It has failed 3 time since and each time it is cleared by shutting down the machine and unplugging/re-plugging the Pcie "USB" type cable.

I have also seen it once one another machine. On that one a new riser and cables seems to have cleared the problem.

I am using Boinc 7.14.2
ID: 92572 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 92573 - Posted: 24 Aug 2019, 20:46:16 UTC

I'd be prepared to place a small bet that this is a hardware problem, not related to the software version (of either BOINC or SETI) in use.
ID: 92573 · Report as offensive
Bernie Vine
Volunteer moderator
Avatar

Send message
Joined: 10 Dec 12
Posts: 322
Message 92574 - Posted: 25 Aug 2019, 6:52:03 UTC - in response to Message 92573.  
Last modified: 25 Aug 2019, 6:53:23 UTC

I'd be prepared to place a small bet that this is a hardware problem, not related to the software version (of either BOINC or SETI) in use.


I agree 100% as this has never happened to me on a machine without GPU risers card and cables.
ID: 92574 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92576 - Posted: 25 Aug 2019, 15:10:01 UTC - in response to Message 92574.  

I'd be prepared to place a small bet that this is a hardware problem, not related to the software version (of either BOINC or SETI) in use.


I agree 100% as this has never happened to me on a machine without GPU risers card and cables.


Risers and cables is a symptom of adding more GPUs on a motherboard than it was designed to use or the OS to manage or the drivers to handle.

I can run nvidia[-smi in a loop all day with 2 or 3 video boards and the fans speeds and usage are reported just fine. When I add additional GPUs I start seeking "ERR" under fan speed at random GPUs and usage varies erratically.

We are pushing the envelope: "going where no BOINC program has gone before" At least, for the 2 week WOW mission.
ID: 92576 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1283
United Kingdom
Message 92578 - Posted: 25 Aug 2019, 15:27:51 UTC

You are describing the symptoms of hitting a hardware wall.
A couple of years back I was involved in the management of a high-GPU count computer - it had 64 GPUs. During the development it was found that use of "commercially available" risers and the like was fraught with problems due to the poor standards of manufacture of these devices (coupled with really diabolically inconsistent motherboards). Out of batches of 16 risers, of various types and manufacturer the average failure rate was in excess of 50% during the burn-in period, and a further 50% of the survivors failing the next phase of testing. If you care to look though the SETI message boards you will see a user "Tom" who describes everything you are talking about in his endeavors to get a moderately high GPU count of about a dozen stable - I've lost count of how many risers, splitters and motherboards he has gone through.
On the other hand "we" are in the middle of commissioning a "bit of a beast" of a computer with 256 GPUs. As a trial a 64GPU segment of it was loaded with BOINC, which detected all GPUs first time, and "demolished" the thousands of tasks thrown at it. Because these tasks were duplicated from other computers neither they, nor the computer concerned will ever appear in the list of crunchers.
So no issues with BOINC not supporting large numbers of GPUs - with a dozen or so you aren't anywhere near the limits of BOINC yet.
ID: 92578 · Report as offensive
Bernie Vine
Volunteer moderator
Avatar

Send message
Joined: 10 Dec 12
Posts: 322
Message 92580 - Posted: 25 Aug 2019, 16:06:59 UTC

Risers and cables is a symptom of adding more GPUs on a motherboard than it was designed to use or the OS to manage or the drivers to handle.


The computer I was describing has 2 PCIe 16 slots,on a new motherboard, however there is not really enough space to mount the two cards also it would cause airflow problems, so I use a riser to move it outside.

This machine has had several of these failures.

I have 3 machines with much older motherboards that do not have 2 PCIE16 slots so I am using risers in other PCie slots and they have no problems.

I don't think running two GTX 1060's GPU's on Ubuntu 18.04 on a 4 month old MB with a new Corsiar 650 watt PSU is the problem.
ID: 92580 · Report as offensive
Profile Gary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2462
United States
Message 92582 - Posted: 25 Aug 2019, 21:46:44 UTC - in response to Message 92578.  

You are describing the symptoms of hitting a hardware wall.
A couple of years back I was involved in the management of a high-GPU count computer - it had 64 GPUs. During the development it was found that use of "commercially available" risers and the like was fraught with problems due to the poor standards of manufacture of these devices (coupled with really diabolically inconsistent motherboards). Out of batches of 16 risers, of various types and manufacturer the average failure rate was in excess of 50% during the burn-in period, and a further 50% of the survivors failing the next phase of testing. If you care to look though the SETI message boards you will see a user "Tom" who describes everything you are talking about in his endeavors to get a moderately high GPU count of about a dozen stable - I've lost count of how many risers, splitters and motherboards he has gone through.
On the other hand "we" are in the middle of commissioning a "bit of a beast" of a computer with 256 GPUs. As a trial a 64GPU segment of it was loaded with BOINC, which detected all GPUs first time, and "demolished" the thousands of tasks thrown at it. Because these tasks were duplicated from other computers neither they, nor the computer concerned will ever appear in the list of crunchers.
So no issues with BOINC not supporting large numbers of GPUs - with a dozen or so you aren't anywhere near the limits of BOINC yet.

You are describing the problem of a broken trace or contact. When new it makes contact. When you plug the card in it flexes enough to break the connection. Thermal expansion and contraction can also make/break the connection. It is purely a mechanical problem. Better quality components and better care in assembly is the only solution.

For your work you might seriously consider putting the rig on a shake table before you even bother with the first burn in test.
ID: 92582 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 92593 - Posted: 26 Aug 2019, 19:21:59 UTC - in response to Message 92562.  

Another problem (may be a feature): The GPU's are numbered 0..X where 0 is given to the "best" GPU and larger numbers for the "weaker". I do not know why BOINC bothers to rank GPUs. There seems to be no need and it makes it difficult to find which GPU is causing the problem assumeing the problem is a unique GPU. Why can't BOINC use the same GPU number that nVidia uses in nvidia-smi or that ATi uses in "sensors".
I see no one addressed this, but BOINC takes the numbering from the GPU drivers. It'll check whether the device is CUDA or OpenCL capable and give it device numbers according compute capability, driver version, RAM size and estimated FLOPs. BOINC will number all devices this way.

CUDA itself does this almost the same way. It'll give device 0 to the fastest device, but then it'll stop indexing devices. So the next devices are not sorted by speed and can get any device ID when detected by CUDA. This means that the GPUs in any slot other than the one getting DeviceID 0, can switch their device ID numbers.

From what I read around, if you set environment variable
export CUDA_DEVICE_ORDER=PCI_BUS_ID
the GPU IDs will be ordered by pci bus IDs and will show the same output as in nvidia-smi.
ID: 92593 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92595 - Posted: 26 Aug 2019, 21:32:07 UTC - in response to Message 92593.  
Last modified: 26 Aug 2019, 21:35:59 UTC


From what I read around, if you set environment variable
export CUDA_DEVICE_ORDER=PCI_BUS_ID
the GPU IDs will be ordered by pci bus IDs and will show the same output as in nvidia-smi.


Thanks Jord!

Tried that, first in bash and then ran /etc/init.d/boinc-client restart
Did not work so I then edited "profile" and rebooted
Had same problem but at least it was not missing when I logged in using xterm.
I then a /etc/init.d/boinc-client restart while in bash but no change.

Get the following all the time. As you can see nvidia reports different order. The boinc manager matches the coproc_info.xml file.

TERM=xterm
SHELL=/bin/bash
CUDA_DEVICE_ORDER=PCI_BUS_ID
SHLVL=1
LOGNAME=jstateson
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
XDG_RUNTIME_DIR=/run/user/1000
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
LESSOPEN=| /usr/bin/lesspipe %s
_=/usr/bin/printenv
jstateson@tb85-nvidia:~$ cd /var/lib/boinc-client/
jstateson@tb85-nvidia:/var/lib/boinc-client$ grep -i gtx coproc_info.xml
   <name>GeForce GTX 1660 Ti</name>
   <name>GeForce GTX 1070 Ti</name>
   <name>GeForce GTX 1070</name>
   <name>GeForce GTX 1070</name>
   <name>GeForce GTX 1060 3GB</name>
   <name>GeForce GTX 1060 3GB</name>
   <name>GeForce GTX 1060 3GB</name>
      <name>GeForce GTX 1660 Ti</name>
      <name>GeForce GTX 1070 Ti</name>
      <name>GeForce GTX 1070</name>
      <name>GeForce GTX 1070</name>
      <name>GeForce GTX 1060 3GB</name>
      <name>GeForce GTX 1060 3GB</name>
      <name>GeForce GTX 1060 3GB</name>
jstateson@tb85-nvidia:/var/lib/boinc-client$ nvidia-smi
Mon Aug 26 15:52:26 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
|100%   41C    P8    13W / 180W |     12MiB /  8117MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:02:00.0 Off |                  N/A |
|100%   46C    P8    12W / 151W |      9MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
|100%   40C    P8     8W / 120W |      9MiB /  3019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 166...  Off  | 00000000:04:00.0  On |                  N/A |
|100%   42C    P8    16W / 120W |     17MiB /  5944MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1070    Off  | 00000000:05:00.0  On |                  N/A |
|100%   37C    P8     9W / 151W |     18MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 106...  Off  | 00000000:08:00.0 Off |                  N/A |
|100%   41C    P5     7W / 120W |      9MiB /  3019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 106...  Off  | 00000000:0A:00.0 Off |                  N/A |
|100%   41C    P8     9W / 120W |      9MiB /  3019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+


I then went to boinc-master and did a recursive grep for CUDA_DEVICE_ORDER
Nothing showed up but I did get a hit on PCI_BUS_ID but pretty sure it is not used for ranking
---------- GPU_NVIDIA.CPP
    CU_DEVICE_ATTRIBUTE_PCI_BUS_ID = 33,
        (*p_cuDeviceGetAttribute)(&cc.pci_info.bus_id, CU_DEVICE_ATTRIBUTE_PCI_BUS_ID, device);


It would really be useful for debugging purposes (hardware or software) if the GPU0...GPU6 shown by nVidia matches the D0..D6 as shown by BT or BM.

Back around 2007, before I retired, I took a picture of my self standing in front of a 4096 blade system that took up an entire bay. Maybe 8 huge racks of servers that were being shipped to an Okinawa army base. There was a problem with the 1394a control interface. No one pointed fingers or complained about hardware. It just had to be fixed and fixed it was, in software. I know of no way other than stopping the fan and making a note of which device stopped to identify GPUs. Will be more careful of where I put my finger in the future.
ID: 92595 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 92613 - Posted: 27 Aug 2019, 19:45:43 UTC - in response to Message 92595.  

I didn't mean with the bit of code that you could change the output of BOINC. BOINC will always follow the compute capability, driver version, RAM size and estimated FLOPs sequence to designate the device IDs, with the best GPU always being device 0, the second best device 1 etc. Of course if all devices are the same for all four comparison points, then it gets tricky.

I got the code entry from this Stack Overflow thread, while this thread in the Nvidia Dev forums suggests that the device ID numbering isn't always following the PCIe slot order, so don't just think that because a GPU is in the slot nearest the CPU, that that device gets DeviceID 0.
ID: 92613 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92621 - Posted: 28 Aug 2019, 13:59:32 UTC
Last modified: 28 Aug 2019, 14:45:41 UTC

Have had this happen again. AFAICT it is cause by the GPUs that are on a splitter.

Was thinking about something along this line:

Instead of
 boinc_temporary_exit(180,"Cuda device initialisation failed");


do this instead since the app knows which failed


 boinc_temporary_exit(180,"Cuda device=7 initialisation failed");


Unless I am mistaken, that string is passed back to the BOINC client as it shows up in the message log.
The device id could be extracted by the client and it would know which CUDA device was defective.
Could this info be used to prevent tasks being assigned to that device?

Another piece of the puzzle from this task (not sure how long in SETI database)

The SETI app reports the following:
In cudaAcc_initializeDevice(): Boinc passed DevPref 7
setiathome_CUDA: CUDA Device 7 specified, checking...
   Device cannot be used
  Cuda device initialisation retry 1 of 6, waiting 5 secs...
setiathome_CUDA: Found 6 CUDA device(s):
  Device 1: GeForce GTX 1660 Ti, 5944 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 4, pciSlotID = 0
  Device 2: GeForce GTX 1070 Ti, 8117 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 19 
     pciBusID = 1, pciSlotID = 0
  Device 3: GeForce GTX 1070, 8119 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 15 
     pciBusID = 5, pciSlotID = 0
  Device 4: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 3, pciSlotID = 0
  Device 5: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 8, pciSlotID = 0
  Device 6: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
     computeCap 6.1, multiProcs 9 
     pciBusID = 10, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 7


I assume the Stderr output is from the app, not boinc, when writing the following:

Clearly, the client says to use device 7 (DevPref 7). It does not know that device is defective. If it did it would not have recommend that device. It needs some feedback from the app to make that decision. Complicating this is the "Device cannot be used" might apply to this project's app and some other project's app might not have a problem using the device. However, the above series of messages seem strange: Why is the app even trying other devices if it was given a preference of "7". This is best answered by the project, but the BOINC developers should be aware that the app is trying devices other than what was recommended because 7 had a problem. The client (IMHO) would never launch an app unless a resource was available. ******

Another factor is the pciBusID. As above they are numbered: 4,1,5,3,8,10. Note that "2" is missing

When I ran nvidia-smi on that system that generated the above paragraph from Stderr Output, I get the following ***
jstateson@tb85-nvidia:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost.  Reboot the system to recover this GPU 


What is interesting is that BOINC runs just fine, the SETI apps on the other 6 GPUS also run fine, but the nvidia-smi app cannot get the handle to one of its GPU's and simple says to reboot the system. Handles are provided by the OS (Ubuntu 18.04) That gpu "Device 7" is hung, nvidia-smi says the bus id is 2, the client in the messages log uses devices D0...D6 and the SETI app uses 1..7. I do not know how the numbering of the bus-id works. One would think that the device driver's numbering would be used rather than a made up number (1..7) or (0..6) etc.

Also, I suspect this forum is not the place to offer constructive criticism. It is a public forum for questions / problems about running the client or manager and criticism here tends to bring out tribal instincts from non-programmers. Maybe there is a better place. For Gridcoin, the programmers tend to use steemit or reddit. GitHub also has a forum. Maybe there is a better place to discuss this, assuming anyone really wants to.

*** I thought that was funny. It reminded me of a project for the Canadian Navy I worked on. The contract specified that the system had to run a minimum of 24 hour without rebooting. Here, there is a problem with the GPU and the driver has lost communication so nvidia-smi recommends a reboot of the system. If the GPUs were each assigned target acquisition that could be a real problem in a naval conflict. Fortunately, BOINC is not a mission critical app, nor is SETI.

****** If a resource is available and an app is launched and that app fails to used that resource and then repeatedly tries to find another resource it seem this could cause a race between itself and the client as I assume the client is also looking for open resources. If a resource is freed, say GPU-x and the client gets x as a resource possibly the app could also get that same x which could cause a conflict. I am also seeing left over tasks that the client cannot terminate
7209	SETI@home	8/10/2019 3:34:45 PM	[error] garbage_collect(); still have active task for acked result blc32_2bit_guppi_58643_76143_HIP73005_0101.26078.409.23.46.97.vlar_0; state 5

It is just a guess /speculation that these are related but they only show up on my systems that have splitters to add additional GPUs.
ID: 92621 · Report as offensive
mmonnin

Send message
Joined: 1 Jul 16
Posts: 146
United States
Message 92627 - Posted: 28 Aug 2019, 19:03:45 UTC - in response to Message 92613.  

I didn't mean with the bit of code that you could change the output of BOINC. BOINC will always follow the compute capability, driver version, RAM size and estimated FLOPs sequence to designate the device IDs, with the best GPU always being device 0, the second best device 1 etc. Of course if all devices are the same for all four comparison points, then it gets tricky.

I got the code entry from this Stack Overflow thread, while this thread in the Nvidia Dev forums suggests that the device ID numbering isn't always following the PCIe slot order, so don't just think that because a GPU is in the slot nearest the CPU, that that device gets DeviceID 0.


BOINC always shows my cards as 2x 1070s even though it is a 1070 and a 1070Ti. It is not ordered by best/fastest/FLOPs. 1070 is in the 1st slot and 1070Ti is in the 2nd slot.
ID: 92627 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 92630 - Posted: 28 Aug 2019, 21:01:55 UTC - in response to Message 92621.  

Have you considered posting the issue to the BOINC github repository? Once the issue is logged in, you can track whether any developers pick up the bug to be worked on. Also, invariably they will ask for a simulation of the problem. Have you started a simulation scenario of the system yet? Since it deals with loss of coprocessors, I would set the logging flags for coproc_debug at minimum and probably add rr_simulation and maybe slot_debug.

https://boinc.berkeley.edu/sim_web.php

I tried the environment variable suggestion on my daily driver. Did absolutely nothing. Didn't change the ordering or identification of any of the cards in the host. And yes I logged out and back in to pick up the change to environment. So that was a bust. I was hoping that would be the magic bullet to stop the PCI BusID reordering every time you change xorg.conf with the coolbits tweak.
ID: 92630 · Report as offensive

Message boards : Questions and problems : need help debugging a problem: Linux 7.16.1

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.