GPU Crash

tjohnson · Apr 21, 2026

Hi,

I have a Radeon 6750 graphics card and I am running 15.0-RELEASE. I could use some help with looking at a crash that happens reliably when llamacpp is stopped and I try to restart it:

Code:

 kernel: drmn0: [gfxhub] page fault (src_id:0 ring:40 vmid:1 pasid:32777, for process  pid 102859 thread  pid 102859)
 kernel: drmn0:   in page starting at address 0x00008001001f7000 from client 0x1b (UTCL2)
 kernel: drmn0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00140A50
 kernel: drmn0:      Faulty UTCL2 client ID: CPC (0x5)
 kernel: drmn0:      MORE_FAULTS: 0x0
 kernel: drmn0:      WALKER_ERROR: 0x0
 kernel: drmn0:      PERMISSION_FAULTS: 0x5
 kernel: drmn0:      MAPPING_ERROR: 0x0
 kernel: drmn0:      RW: 0x1
 kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.1.0 timeout, signaled seq=5546, emitted seq=5548
 kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 102859 thread  pid 102859
 kernel: drmn0: GPU reset begin!
 kernel: drmn0: MODE1 reset
 kernel: drmn0: GPU mode1 reset
 kernel: drmn0: GPU smu mode1 reset
 kernel: hdac0: Unexpected unsolicited response from address 0: 00000000
 syslogd: last message repeated 7 times
 kernel: drmn0: GPU mode1 reset failed
 kernel: drmn0: ASIC reset failed with error, -60 for drm dev, drmn0
 kernel: drmn0: GPU reset succeeded, trying to resume
 kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
 kernel: [drm] VRAM is lost due to GPU reset!
 kernel: [drm] PSP is resuming...
 kernel: [drm ERROR :psp_hw_start] PSP create ring failed!
 kernel: [drm ERROR :psp_resume] PSP resume failed
 kernel: [drm ERROR :amdgpu_device_fw_loading] resume of IP block <psp> failed -60
 kernel: drmn0: GPU reset(1) failed
 kernel: drmn0: GPU reset end with ret = -60
 kernel: [drm ERROR :amdgpu_job_timedout] GPU Recovery Failed: -60
 kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.1.0 timeout, signaled seq=5548, emitted seq=5548
 kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 102859 thread  pid 102859
 kernel: drmn0: GPU reset begin!

I tried playing with the following which didn't seem to help at all:

Code:

hw.amdgpu.vm_fault_stop="1"
hw.amdgpu.lockup_timeout="10000,10000,10000,10000"
hw.amdgpu.bad_page_threshold="-1"
hw.amdgpu.reset_method="2"
hw.amdgpu.enforce_isolation="1"
hw.amdgpu.runpm="0" 
hw.amdgpu.timeout_fatal_disable="1"
hw.amdgpu.sched_hw_submission="1"

The reset method never changed with these, so apparently the card decides it. llamacpp seemed to work without any issue for some time and now I have this behavior. I have no clue what changed. I notice it when I stop llama-server to change models, the service crashes, etc. I run llamacpp with vulkan. I have tried many different versions of llamacpp, including 8182 in ports and all have the same behavior. Only a reboot seems to help.

What am I missing?

Espionage724 · Apr 21, 2026

tjohnson said:
kernel: [drm] PSP is resuming...
kernel: [drm ERROR :psp_hw_start] PSP create ring failed!
kernel: [drm ERROR :psp_resume] PSP resume failed
kernel: [drm ERROR :amdgpu_device_fw_loading] resume of IP block <psp> failed -60

That seems interesting; I thought PSP was only on CPUs but why would AMDGPU use it? Do you have an AMD CPU?

I'd try different drm-kmod versions (61, latest, etc)

tjohnson · Apr 22, 2026

Yeah, I do have an AMD CPU. I'll give a different drm-kmod a shot and see what happens. Is there something in the BIOS that I do not have set right?

tjohnson · Apr 23, 2026

I was not allowing mmap in llama-server. After I removed that, I stopped getting the GPU crash. Unfortunately, llama-cpp (compiled from github) still crashed when running Gemma 4, but not gpt-oss. Using lldb, it seems that RADV is crashing while compiling the flash attention SPIR-V shader. This seems to only occur when running Gemma 4 with Vulkan (or running ffmpeg with vulkan). I will see if drm-latest fixes it. At the moment, it does not appear to be a bug within llama-cpp itself.

tjohnson · Apr 24, 2026

Unfortunately, moving to drm-latest did not fix the issue. Perhaps the recently released version of mesa-devel will. For the time being, I have disabled flash attention as a workaround with Gemma 4.

tjohnson · Apr 26, 2026

Upgrading mesa-devel to 26.1.b.305 fixed the flash attention issue. The GPU crash shouldn't happen, but the latest version of mesa prevents the system from getting to that path.

GPU Crash

tjohnson

Espionage724

tjohnson

tjohnson

tjohnson

tjohnson