GPU Crash

Hi,

I have a Radeon 6750 graphics card and I am running 15.0-RELEASE. I could use some help with looking at a crash that happens reliably when llamacpp is stopped and I try to restart it:
Code:
 kernel: drmn0: [gfxhub] page fault (src_id:0 ring:40 vmid:1 pasid:32777, for process  pid 102859 thread  pid 102859)
 kernel: drmn0:   in page starting at address 0x00008001001f7000 from client 0x1b (UTCL2)
 kernel: drmn0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00140A50
 kernel: drmn0:      Faulty UTCL2 client ID: CPC (0x5)
 kernel: drmn0:      MORE_FAULTS: 0x0
 kernel: drmn0:      WALKER_ERROR: 0x0
 kernel: drmn0:      PERMISSION_FAULTS: 0x5
 kernel: drmn0:      MAPPING_ERROR: 0x0
 kernel: drmn0:      RW: 0x1
 kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.1.0 timeout, signaled seq=5546, emitted seq=5548
 kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 102859 thread  pid 102859
 kernel: drmn0: GPU reset begin!
 kernel: drmn0: MODE1 reset
 kernel: drmn0: GPU mode1 reset
 kernel: drmn0: GPU smu mode1 reset
 kernel: hdac0: Unexpected unsolicited response from address 0: 00000000
 syslogd: last message repeated 7 times
 kernel: drmn0: GPU mode1 reset failed
 kernel: drmn0: ASIC reset failed with error, -60 for drm dev, drmn0
 kernel: drmn0: GPU reset succeeded, trying to resume
 kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
 kernel: [drm] VRAM is lost due to GPU reset!
 kernel: [drm] PSP is resuming...
 kernel: [drm ERROR :psp_hw_start] PSP create ring failed!
 kernel: [drm ERROR :psp_resume] PSP resume failed
 kernel: [drm ERROR :amdgpu_device_fw_loading] resume of IP block <psp> failed -60
 kernel: drmn0: GPU reset(1) failed
 kernel: drmn0: GPU reset end with ret = -60
 kernel: [drm ERROR :amdgpu_job_timedout] GPU Recovery Failed: -60
 kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.1.0 timeout, signaled seq=5548, emitted seq=5548
 kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 102859 thread  pid 102859
 kernel: drmn0: GPU reset begin!
I tried playing with the following which didn't seem to help at all:
Code:
hw.amdgpu.vm_fault_stop="1"
hw.amdgpu.lockup_timeout="10000,10000,10000,10000"
hw.amdgpu.bad_page_threshold="-1"
hw.amdgpu.reset_method="2"
hw.amdgpu.enforce_isolation="1"
hw.amdgpu.runpm="0" 
hw.amdgpu.timeout_fatal_disable="1"
hw.amdgpu.sched_hw_submission="1"
The reset method never changed with these, so apparently the card decides it. llamacpp seemed to work without any issue for some time and now I have this behavior. I have no clue what changed. I notice it when I stop llama-server to change models, the service crashes, etc. I run llamacpp with vulkan. I have tried many different versions of llamacpp, including 8182 in ports and all have the same behavior. Only a reboot seems to help.

What am I missing?
 

kernel: [drm] PSP is resuming...
kernel: [drm ERROR :psp_hw_start] PSP create ring failed!
kernel: [drm ERROR :psp_resume] PSP resume failed
kernel: [drm ERROR :amdgpu_device_fw_loading] resume of IP block <psp> failed -60
That seems interesting; I thought PSP was only on CPUs but why would AMDGPU use it? Do you have an AMD CPU?

I'd try different drm-kmod versions (61, latest, etc)
 
Yeah, I do have an AMD CPU. I'll give a different drm-kmod a shot and see what happens. Is there something in the BIOS that I do not have set right?
 
I was not allowing mmap in llama-server. After I removed that, I stopped getting the GPU crash. Unfortunately, llama-cpp (compiled from github) still crashed when running Gemma 4, but not gpt-oss. Using lldb, it seems that RADV is crashing while compiling the flash attention SPIR-V shader. This seems to only occur when running Gemma 4 with Vulkan (or running ffmpeg with vulkan). I will see if drm-latest fixes it. At the moment, it does not appear to be a bug within llama-cpp itself.
 
Back
Top