Radeon RX 7900 XTX Linux (Fedora) Crashes Frequently

Message boards : GPUs : Radeon RX 7900 XTX Linux (Fedora) Crashes Frequently
Message board moderation

To post messages, you must log in.

AuthorMessage
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112122 - Posted: 25 Jun 2023, 0:22:37 UTC

I saw user @1082528 (magic_sam)'s post about problems getting BOINC to recognize a 7900 GPU on Ubuntu LTS, but that experience seems very different from mine on Fedora. Since Fedora has been including (some) ROCm pkgs in their distribution, no thrid-party software is necessary to get OpenCL running. This has (and is) working perfectly fine with a 6800XT as I write. (Although, I get more invalid WUs than I would like.)

When I put the 7900 in the same system, I get a frequent crashes the desktop (but not the rest of the system) while crunching with BOINC. I've tried running it in my system on three different occasions and, eventually, it starts crashing so frequently I cannot use the system. Einstein@Home is the only project for which I do GPU crunching. I'm posting here since I'm hoping more folks here may be using the card for other projects.

I've even had the card RMA-ed and I've been talking to the manufacturer about the problem. I originally thought it was a temperature issue, since I had strong evidence that was a problem with my 5700XT and the 7900 lacks any fan control on Linux, but I'm now convinced temps are not the issue. My best idea now is that there is a problem with AMDGPU or the OpenCL stack. There are lots of reports against AMDGPU of crashes within various games, but I don't have that problem; my issues occur outside of games.

- Is anyone else having frequent crashes with OpenCL on the 7900?

- Any advice?

I'm wondering if should try to debug E@H as it runs, but that sounds difficult.
ID: 112122 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 112125 - Posted: 25 Jun 2023, 7:38:21 UTC - in response to Message 112122.  

Consumer AMD cards are generally a poor choice for compute. They are meant to be used in gaming only. The drivers are woefully inadequate, poorly developed and of not much importance for the developer to debug or properly design the consumer drivers for compute.

Only the professional AMD compute cards should be used for compute. The drivers for those cards are actively developed, the opposite of the consumer drivers.

You would have better luck running the latest beta drivers from the github sources.
ID: 112125 · Report as offensive
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112143 - Posted: 26 Jun 2023, 9:55:27 UTC - in response to Message 112125.  

Hi Keith,

That is all true. However, I've had great--well, we can call it luck, then--with AMD for the past 10 years. I wouldn't be so concerned or upset if it wasn't so far outside my typical experience. AMD is currently occupying a very unique space and I've become quite dependent on them, I now realize. This is the first experience I've ever had like this. In fact, it's probably the worst experience with hardware I've ever had, at least since the time in the 90s when I got a shaved and re-labeled CPU from a highly reputable local reseller.

I think it is likely a driver problem. Since the professional cards you mention are not even out yet, I'm hoping the drivers will improve when those cards finally hit the retail market. AMD has made a commitment to support the OSS ecosystem, so I don't think my expectations are wholly unreasonable. The consumer cards are expensive enough. If I ever had a hope to get a PRO card, they are long gone since the price spike last year.
ID: 112143 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 112151 - Posted: 26 Jun 2023, 21:14:20 UTC

The legacy AMD drivers for the older series seem to be mostly stable once you get past the arcane installation of OpenCL.

The newest architectures are just not ready for primetime for compute. Maybe in a year or so things will have gotten better.
ID: 112151 · Report as offensive
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112154 - Posted: 26 Jun 2023, 23:11:44 UTC - in response to Message 112151.  

That would be good in the aggregate. But, it's a disaster for me with this card. I don't know what I'm going to do.

I feel like the new drivers are much better, overall, because the installation issues have mostly been resolved, despite the fact that my distro isn't supported. Catalyst is really really old, too. The current drivers eliminated direct OpenCL support years go. I think all of RDNA2 was ROCm-based only. So, this feels like an RDNA3-specific issue, not a driver development issue, although I grant that it is hard to make such a distinction; both issues could occur at the same time.
ID: 112154 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 112156 - Posted: 27 Jun 2023, 1:44:42 UTC - in response to Message 112154.  

Please read this message for some hints about getting the 7900XTX working.
https://boinc.berkeley.edu/forum_thread.php?id=14916&postid=111107
Also check that the /etc/OpenCL/vendors file is pointing to the correct opencl library probably in /usr/lib/x86_64-linux-gnu
I've read that the vendors list points to the wrong location often for the correct opencl library.
ID: 112156 · Report as offensive
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112158 - Posted: 27 Jun 2023, 5:04:49 UTC - in response to Message 112156.  

I had seen that thread, that's the thread I was referring to in my original post. It's interesting how different the distributions can be. My symptoms are quite different: I had a few results marked as 'invalid', but very few actual errors. The errors that did occur were catastrophic, and the driver couldn't contain them, causing the whole display subsystem to fail. That's the issue I mean to bring up here. When it crunches and completes work, it seems fine. Perhaps more invalids than my 6800XT, but it's also faster, so, perhaps the same rate of invalids, but I don't know because it won't stay running for 24 hrs straight, so I can't get a good average!

I checked the ICD on my system, as you suggest. Thanks for that suggestion. AFAICT, it is correct. I only have one icd file, it's owned by the same package (ROCm-opencl v.5.5.1) that owns the library to which it points, and that .so exists and had a good sig. It also seems pretty current, but internal vs external numbering makes it hard to know.

I'm hoping the manufacturer has something helpful to tell me, but that will take a few more days. Sounds like I got a lemon, though, at this point. An expensive lemon. Everyone thought that things would get fixed, driver-wise, when the PRO cards go released, but that happened two months ago. So, I'm pretty confused.
ID: 112158 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 869
United States
Message 112161 - Posted: 27 Jun 2023, 7:55:07 UTC - in response to Message 112158.  

The ICD file can be incorrectly linked when the distro installs the basic AMD Mesa drivers and ICD for that set of drivers is pointed to. Seen that error a lot on posts in the forum for AMD cards. The ROCm drivers put the ICD file in a different location I believe if I remember correctly. Hence the conflict.

You might try power limiting the card to test for stability. Could simply be the memory or processor is running too hot and crashes out.
ID: 112161 · Report as offensive
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112173 - Posted: 27 Jun 2023, 17:40:46 UTC - in response to Message 112161.  

Right. I was really convinced for a long while that it was thermal. But, with the manufacturer's help, I think we ruled that out. I was using boinccmd to manually throttle GPU crunching and keep temps very low, like 75 C junction temps, and that didn't improve the symptoms. Also, I let the thing run hot hot hot, like 100 C junction, for over an hour before it crashed, which wasn't worse than when I kept it cold.

My 5700XT had a problem with thermals. It behaved *very* similarly before I increased the fan curve, then all problems went away. So, I was *sure* this problem was also thermal. But, after a lot of testing, it doesn't seem correlated to thermals. The manufacturer is quite confident about this, and I reluctantly agree, finally.

Another confusing factor is that, since this problem started so long ago, I've actually upgraded my distro's major version (and many weekly updates) during the course of testing. In that time, I've had rocm-opencl updates, mesa updates, and LLVM updates. My system still has LLVM15 libs, which I *think* is what is being used for ROCm OCL, but the default/main LLVM is v16. If there is a problem with the driver and/or OCL stack, I wonder if it could be something with LLVM? I mean, that makes some sense, right?
ID: 112173 · Report as offensive
Paul

Send message
Joined: 8 Dec 13
Posts: 51
United States
Message 112176 - Posted: 27 Jun 2023, 20:36:25 UTC - in response to Message 112173.  

Okay, it looks like at least Sam was able to get it working and is crunching with the 7900. System looks stable, turning in lots of tasks consistently. Sam used the AMD installer on Ubuntu LTS. It's not clear how much OSS that includes, but it could be the proprietary driver, too; I don't understand that installer very well.

Devastating, but I guess that's something I have to expect. Not sure what to do. It's bad enough that I would have to do a full reinstall on an encrypted system, but to have to switch to another distro, and LTS at that...I'm used to getting the latest kernels. But, I suppose I could live without it. That will take a lot of planning, even if I decided to do it.
ID: 112176 · Report as offensive

Message boards : GPUs : Radeon RX 7900 XTX Linux (Fedora) Crashes Frequently

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.