Recent frequent linux errors with amdgpu, kernel 5.10+, multiple projects

Message boards : GPUs : Recent frequent linux errors with amdgpu, kernel 5.10+, multiple projects
Message board moderation

To post messages, you must log in.

AuthorMessage
Paul

Send message
Joined: 8 Dec 13
Posts: 35
United States
Message 97232 - Posted: 2 Apr 2020, 6:20:58 UTC

Two weekends ago I did a normal system update and got kernel 5.10. Since then, I've been getting a lot of this:

Apr 01 22:24:41 hostname kernel: [drm:amdgpu_ttm_backend_bind [amdgpu]] *ERROR* failed to pin userptr
Apr 01 22:24:41 hostname kernel: ------------[ cut here ]------------
Apr 01 22:24:41 hostname kernel: kernel BUG at mm/slub.c:304!
Apr 01 22:24:41 hostname kernel: invalid opcode: 0000 [#1] SMP NOPTI
Apr 01 22:24:41 hostname kernel: CPU: 1 PID: 11895 Comm: setiathome_8.22 Not tainted 5.5.11-200.fc31.x86_64 #1
Apr 01 22:24:41 hostname kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS PRO WIFI/B450 AORUS PRO WIFI-CF, BIOS F50 11/27/2019
Apr 01 22:24:41 hostname kernel: RIP: 0010:kfree+0x23c/0x250
...
Apr 01 22:24:41 hostname kernel: ---[ end trace caf6b7bf7cc304f1 ]---


80 in about 10 days. Same thing with Einstein@Home. I updated kernel to 5.11 and BIOS, but it came right back after boot. AFAICT, things were fine right before that. Is this a known thing?
ID: 97232 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 14232
Netherlands
Message 97233 - Posted: 2 Apr 2020, 8:26:50 UTC - in response to Message 97232.  

It's a problem with your kernel, not with BOINC or the applications thereunder. Search Google for "[drm:amdgpu_ttm_backend_bind [amdgpu]] *ERROR* failed to pin userptr" and you'll find a lot of patches for different kernel versions.

E.g.
This patch set is to fix a bug in amdgpu / radeon drm that results in a crash when dma_map_sg combines elemnets within a scatterlist table.

There are 2 shortfalls in the current kernel.

1) AMDGPU / RADEON assumes that the requested and created scatterlist table lengths using from dma_map_sg are equal. This may not be the case using the newer dma-iommu implementation

2) drm_prime does not fetch the length of the scatterlist via the correct dma macro, this can use the incorrect length being used (>0) in places where dma_map_sg has updated the table elements.

The sg_dma_len macro is representative of the length of the sg item after dma_map_sg

Example Crash :
> [drm:amdgpu_ttm_backend_bind [amdgpu]] *ERROR* failed to pin userptr

This happens in OpenCL applications, causing them to crash or hang, on either amdgpu-pro or ROCm OpenCL implementations

I have verified this fixes the above on kernel 5.5 and 5.5rc using an AMD Vega 64 GPU
https://lkml.org/lkml/2020/3/25/204
Please do not private message me for tech support, these will be ignored!
ID: 97233 · Report as offensive

Message boards : GPUs : Recent frequent linux errors with amdgpu, kernel 5.10+, multiple projects

Copyright © 2020 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.