Solved amdgpu crash: ring gfx timeout and GPU reset

kent_dorfman766 · May 27, 2023

any thoughts on this? I had previously startx /usr/local/bin/openbox as root and executed glmark2 without issue. I then logged in as my regular user and created a .xinitrc file contining startup for openbox and tint2, then when I executed startx the desktop started and while opening and resizing a work xterm the display memory went nuts with repeating artifacts, so I logged in remotely and gathered some error dump info.

using stable 13.2 XFX Radeon RX580 (not overclocked) on E5440 Xeon SMP machine with 32Gbytes or RAM

I was led to believe that my configuration would be pretty stable, albeit slow with radeon GPU.

Code:

dmesg portion:
[drm ERROR :amdgpu_job_timedout] ring gfx timeout, signaled seq=633, emitted seq=635
[drm ERROR :amdgpu_job_timedout] Process information: process  pid 100462 thread  pid 100462
drmn0: GPU reset begin!
amdgpu: cp is busy, skip halt cp
amdgpu: rlc is busy, skip halt rlc
drmn0: BACO reset
drmn0: GPU reset succeeded, trying to resume
[drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
[drm] VRAM is lost due to GPU reset!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :amdgpu_cs_ioctl] Failed to initialize parser -85!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
[drm ERROR :uvd_v6_0_start] UVD not responding, giving up!!!
[drm ERROR :amdgpu_device_ip_set_powergating_state] set_powergating_state of IP block <uvd_v6_0> failed -1
drmn0: [drm] *ERROR* [drm ERROR :amdgpu_device_ip_resume_phase2] resume of IP block <uvd_v6_0> failed -60
drmn0: GPU reset(2) failed
drmn0: GPU reset end with ret = -60
[drm ERROR :amdgpu_job_timedout] ring gfx timeout, but soft recovered
[drm ERROR :amdgpu_job_timedout] ring gfx timeout, but soft recovered

Code:

messages portion:
May 26 23:32:43 greybox kernel: [drm ERROR :amdgpu_job_timedout] ring gfx timeout, signaled seq=633, emitted seq=635
May 26 23:32:43 greybox kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 100462 thread  pid 100462
May 26 23:32:43 greybox kernel: drmn0: GPU reset begin!
May 26 23:32:44 greybox kernel: amdgpu: cp is busy, skip halt cp
May 26 23:32:44 greybox kernel: amdgpu: rlc is busy, skip halt rlc
May 26 23:32:44 greybox kernel: drmn0: BACO reset
May 26 23:32:44 greybox kernel: drmn0: GPU reset succeeded, trying to resume
May 26 23:32:44 greybox kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
May 26 23:32:44 greybox kernel: [drm] VRAM is lost due to GPU reset!
May 26 23:32:45 greybox kernel: [drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
May 26 23:32:51 greybox syslogd: last message repeated 5 times
May 26 23:32:51 greybox devd[734]: notify_clients: send() failed; dropping unresponsive client
May 26 23:32:51 greybox kernel: [drm ERROR :amdgpu_cs_ioctl] Failed to initialize parser -85!
May 26 23:32:51 greybox kernel: [drm ERROR :uvd_v6_0_start] UVD not responding, trying to reset the VCPU!!!
May 26 23:32:54 greybox syslogd: last message repeated 3 times
May 26 23:32:54 greybox kernel: [drm ERROR :uvd_v6_0_start] UVD not responding, giving up!!!
May 26 23:32:54 greybox kernel: [drm ERROR :amdgpu_device_ip_set_powergating_state] set_powergating_state of IP block <uvd_v6_0> failed -1
May 26 23:32:55 greybox kernel: drmn0: [drm] *ERROR* [drm ERROR :amdgpu_device_ip_resume_phase2] resume of IP block <uvd_v6_0> failed -60
May 26 23:32:55 greybox kernel: drmn0: GPU reset(2) failed
May 26 23:32:55 greybox kernel: drmn0: GPU reset end with ret = -60
May 26 23:33:05 greybox kernel: [drm ERROR :amdgpu_job_timedout] ring gfx timeout, but soft recovered
May 26 23:33:15 greybox syslogd: last message repeated 1 times

FWIW, I have no xorg.conf file created, but am using the defaults based on my installing of the necessary drivers. Xorg.0.log is pretty uneventful and does indicate my use of amdgpu on the RX580 Promethius10 firmware.

kent_dorfman766 · May 27, 2023

Anecdotally, the crashes may be related to uid. I've been running an X session as root for the past nine plus hours with no hiccups and am exercising the GPU by using openGL visuals. IOW, why might the X server/drivers work under root, but barf for a non-root user? Coming from that other FOSS ecosystem, freeBSD X11 doesn't require any special group membership to access the sound or video, yes? Obviously not an issue when running as root.

Unfortunately, since it's been 25 years since my BSDi days, my perspective is going to be somewhat skewed toward how things are done under GNU.

LibreQuest · May 27, 2023

pw groupmod video -m user

Chapter 5. The X Window System

This chapter describes how to install and configure Xorg on FreeBSD, which provides the open source X Window System used to provide a graphical environment

docs.freebsd.org

kent_dorfman766 · May 27, 2023

LibreQuest said:
pw groupmod video -m user

Chapter 5. The X Window System

This chapter describes how to install and configure Xorg on FreeBSD, which provides the open source X Window System used to provide a graphical environment

docs.freebsd.org

Appears I already am a member of the video group.

LibreQuest · May 27, 2023

That's good news. I'm all spent on ideas in this case. I'll leave it to the experienced users. Best wishes on your system.

kent_dorfman766 · May 27, 2023

working theory is that taskbar program tint2 does something the GPU doesn't like, but only after I configured the taskbar options. Deleting its config file seems to have stopped the crashes. Not that it should "be able to" crash/reset the GPU, but that's the theory right now.

LibreQuest · May 27, 2023

That's an interesting find. That should make a good bug report.

kent_dorfman766 · May 29, 2023

incompatibility was found between amdgpu and the openbox/tint2 combination. The driver didn't like some of the decoration enhancement options.

mfoacs · Mar 4, 2024

Is there a way to recover your session (or start a new one) from a hung GPU? I mean, without forcing a reboot?

kent_dorfman766 · Mar 22, 2024

mfoacs said:
Is there a way to recover your session (or start a new one) from a hung GPU? I mean, without forcing a reboot?

This is an old thread but I'll answer anyways.

After rereading my OP then answer is a qualified maybe...I'd rather be safe than sorry where kernel errors/warnings are concerned.

mfoacs · Mar 24, 2024

Thanks for your answer, nevertheless.
For me, a recent DRM update of "latest" package tree, has solved the issue.
UEFI boot is also problematic.

GPU is 'Navi 23 [Radeon RX 6600/6600 XT/6600M]'
14.0-RELEASE-p5
drm-515-kmod-5.15.118_4

Like you said, it's an old thread, but still it might help someone to search for answers.