OpenCL crashes the GPU (AMD MI50)

Hello,

I am trying to get OpenCL running on my Freebsd 15 PC. I have 2 AMD MI50 GPU (vega20, radeon pro VII bios). These work great to run llama.cpp with the vulkan backend for general LLM, so I know the hardware works.
I would like to also run some OpenCL stuff on it. This PC is only used headless, I do not use the video output.
I installed clover and the opencl-headers, compiled a few examples from https://github.com/rsnemmen/OpenCL-examples, but any that I try to run just hangs (the process becomes unkillable). FBSD keeps running normally but those processes are just stuck.

/var/log/messages shows some errors when starting the test opencl program:
Code:
Mar 20 23:27:31 bigboss kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.2.0 timeout, signaled seq=2, emitted seq=3
Mar 20 23:27:31 bigboss kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 0 thread  pid 0
Mar 20 23:27:31 bigboss kernel: drmn0: GPU reset begin!
Mar 20 23:27:31 bigboss kernel: [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
Mar 20 23:27:31 bigboss kernel: drmn0: BACO reset
Mar 20 23:27:33 bigboss kernel: drmn0: GPU reset succeeded, trying to resume
Mar 20 23:27:33 bigboss kernel: [drm] PCIE GART of 512M enabled.
Mar 20 23:27:33 bigboss kernel: [drm] PTB located at 0x0000008000000000
Mar 20 23:27:33 bigboss kernel: [drm] VRAM is lost due to GPU reset!
Mar 20 23:27:33 bigboss kernel: [drm] PSP is resuming...
Mar 20 23:27:33 bigboss kernel: [drm] reserve 0x400000 from 0x83fec00000 for PSP TMR
Mar 20 23:27:33 bigboss kernel: drmn0: HDCP: optional hdcp ta ucode is not available
Mar 20 23:27:33 bigboss kernel: drmn0: DTM: optional dtm ta ucode is not available
Mar 20 23:27:33 bigboss kernel: drmn0: RAP: optional rap ta ucode is not available
Mar 20 23:27:33 bigboss kernel: drmn0: SECUREDISPLAY: securedisplay ta ucode is not available
Mar 20 23:27:33 bigboss kernel: [drm] kiq ring mec 2 pipe 1 q 0
Mar 20 23:27:33 bigboss kernel: [drm] UVD and UVD ENC initialized successfully.
Mar 20 23:27:33 bigboss kernel: [drm] VCE initialized successfully.
Mar 20 23:27:33 bigboss kernel: drmn0: ring gfx uses VM inv eng 0 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring gfx_low uses VM inv eng 1 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring gfx_high uses VM inv eng 4 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.0.0 uses VM inv eng 5 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.1.0 uses VM inv eng 6 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.2.0 uses VM inv eng 7 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.3.0 uses VM inv eng 8 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.0.1 uses VM inv eng 9 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.1.1 uses VM inv eng 10 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.2.1 uses VM inv eng 11 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring comp_1.3.1 uses VM inv eng 12 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
Mar 20 23:27:33 bigboss kernel: drmn0: ring sdma0 uses VM inv eng 0 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring page0 uses VM inv eng 1 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring sdma1 uses VM inv eng 4 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring page1 uses VM inv eng 5 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_0 uses VM inv eng 6 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_1 uses VM inv eng 9 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring vce0 uses VM inv eng 12 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring vce1 uses VM inv eng 13 on hub 8
Mar 20 23:27:33 bigboss kernel: drmn0: ring vce2 uses VM inv eng 14 on hub 8
Mar 20 23:27:34 bigboss kernel: drmn0: recover vram bo from shadow start
Mar 20 23:27:34 bigboss kernel: drmn0: recover vram bo from shadow done
Mar 20 23:27:34 bigboss kernel: drmn0: GPU reset(1) succeeded!
Mar 20 23:28:34 bigboss kernel: [drm ERROR :amdgpu_job_timedout] ring comp_1.2.0 timeout, signaled seq=4, emitted seq=4
Mar 20 23:28:34 bigboss kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 0 thread  pid 0
Mar 20 23:28:34 bigboss kernel: drmn0: GPU reset begin!

Any ideas ? also pointers to what should be happening here, where to look to debug this ? I once (maybe 18years ago haha) wrote kernel modules for virtualized sound card and more recently (10 years ago :-) ) worked on some opengl drivers for MacOS so I am not a total stranger to kernel debugging but it would help to have some idea what is going on where to look, how deep the rabbit hole goes etc. before I embark on of this.

thanks for any and all help, cheers
 
ok so I wrote an even simpler example and it turns out it's hanging at

Code:
ret = clGetPlatformIDs(ret_num_platforms, platforms, NULL);
    printf("ret at %d is %d\n", __LINE__, ret);

    ret = clGetDeviceIDs( platforms[1], CL_DEVICE_TYPE_ALL, 1,
            &device_id, &ret_num_devices);
    printf("ret at %d is %d\n", __LINE__, ret);
    // Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
    printf("ret at %d is %d\n", __LINE__, ret);

the clCreatContext() never returns, and I immediately see in the logs the GPU timeout and reset, etc.
so yeah, it's not the examples...
 
thanks, I tried that and solved it eventually while I was writing this reply !

opencl now works for me with rusticl:

pkg install opencl mesa-devel mesa-gallium-va clinfo

then define the env variables:

RUSTICL_ENABLE=radeonsi; export RUSTICL_ENABLE
OCL_ICD_VENDORDIR=/usr/local/etc/OpenCL/vendors; export OCL_ICD_VENDORDIR


TLDR;
I uninstalled the clover package and installed mesa-devel. I see it installed a bunch of libraries, a rusticl icd file and some rusticl libraries . Vulkan stuff still works (as far as running llama.cpp at least)
however clinfo does not find any devices

Code:
Number of platforms                               1
  Platform Name                                   rusticl
  Platform Vendor                                 Mesa/X.org
  Platform Version                                OpenCL 3.0
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions with Version                cl_khr_icd                                                       0x800000 (2.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             MESA
  Platform Host timer resolution                  1ns

  Platform Name                                   rusticl
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  rusticl
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No devices found in platform [rusticl?]
  clCreateContext(NULL, ...) [default]            No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.4
  ICD loader Profile                              OpenCL 3.0

as the mesa docs say for rusticl, I tried setting RUSTICL_ENABLE=radeonsi, but that did not change anything. I thought at least setting RUSTICL_ENABLE=llvmpipe would work and use the cpu, but no, clinfo output is unchanged.
Then I found there are a few mesa-gallium-* ports so I looked at their content and one installs a radeonsi_drv_video.so ... so I installed that and now it works !
 
Back
Top