AMDGPU potential Conflict or Hardware Issue?

taiwan740 · Nov 21, 2023

Hi

For a while I've used a really old 2006 Mac Pro as my main FreeBSD desktop machine. It's sluggish, but worked fine.

I bought a more up-to-date one with more cores and more memory (a 2008 model, right on the bleeding edge...) and decided to "port" my FreeBSD instance over to that, by installing the SSD and graphics card in the newer unit.

Now, the system wouldn't crash, or panic, but the display would lock up. Basically it became unresponsive via the console, but I could ssh onto the machine and control spotifyd using another device and it would play music.

dmesg | grep drm gives me this

Code:

[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE468 0xC7).
drmn0: Trusted Memory Zone (TMZ) feature not supported
[drm] register mmio base: 0x90500000
[drm] register mmio size: 262144
[drm] add ip block number 0 <vi_common>
[drm] add ip block number 1 <gmc_v8_0>
[drm] add ip block number 2 <tonga_ih>
[drm] add ip block number 3 <gfx_v8_0>
[drm] add ip block number 4 <sdma_v3_0>
[drm] add ip block number 5 <powerplay>
[drm] add ip block number 6 <dm>
[drm] add ip block number 7 <uvd_v6_0>
[drm] add ip block number 8 <vce_v3_0>
drmn0: Fetched VBIOS from VFCT
[drm] UVD is enabled in VM mode
[drm] UVD ENC is enabled in VM mode
[drm] VCE enabled in VM mode
[drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mc.bin'
drmn0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
drmn0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[drm] Detected VRAM RAM=2048M, BAR=256M
[drm] RAM width 64bits GDDR5
[drm] amdgpu: 2048M of VRAM memory ready
[drm] amdgpu: 3072M of GTT memory ready.
[drm] GART: num cpu pages 65536, num gpu pages 65536
[drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
drmn0: successfully loaded firmware image 'amdgpu/polaris12_pfp_2.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_me_2.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_ce_2.bin'
[drm] Chained IB support enabled!
drmn0: successfully loaded firmware image 'amdgpu/polaris12_rlc.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec_2.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec2_2.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma1.bin'
drmn0: successfully loaded firmware image 'amdgpu/polaris12_uvd.bin'
[drm] Found UVD firmware Version: 1.130 Family ID: 16
drmn0: successfully loaded firmware image 'amdgpu/polaris12_vce.bin'
[drm] Found VCE firmware Version: 53.26 Binary ID: 3
drmn0: successfully loaded firmware image 'amdgpu/polaris12_smc.bin'
[drm] Display Core initialized with v3.2.149!
lkpi_iic0: <LinuxKPI I2C> on drmn0
lkpi_iic1: <LinuxKPI I2C> on drmn0
lkpi_iic2: <LinuxKPI I2C> on drmn0
[drm] UVD and UVD ENC initialized successfully.
[drm] VCE initialized successfully.
drmn0: SE 2, SH per SE 1, CU per SH 5, active_cu_number 8
[drm] fb mappable at 0x80E30000
[drm] vram apper at 0x80000000
[drm] size 8294400
[drm] fb depth is 24
[drm]    pitch is 7680
name=drmn0 flags=0x0 stride=7680 bpp=32
vgapci0: child drmn0 requested pci_get_powerstate
drmn0: Using BACO for runtime pm
lkpi_iic3: <LinuxKPI I2C> on drm1
[drm] Initialized amdgpu 3.42.0 20150101 for drmn0 on minor 0
drmn0: [drm] *ERROR* 
drmn0: [drm] *ERROR* 
drmn0: [drm] *ERROR* 
drmn0: [drm] *ERROR* [drm ERROR :amdgpu_device_delayed_init_work_handler] ib ring test failed (-60).
drmn0: Disabling VM faults because of PRT request!

Additionally, running chrome fails and gives me this:-

Code:

[31648:203497472:1121/212318.054139:ERROR:process_singleton_posix.cc(458)] readlink failed: Resource temporarily unavailable (35)
[31648:203497472:1121/212318.054178:ERROR:process_singleton_lock_posix.cc(20)] readlink(~/.config/chromium/SingletonLock) failed: Resource temporarily unavailable (35)
[31648:203497472:1121/212318.054220:ERROR:process_singleton_posix.cc(310)] readlink(~/.config/chromium/SingletonLock) failed: File exists (17)
[31648:203497472:1121/212318.054230:ERROR:process_singleton_posix.cc(334)] Failed to create ~/.config/chromium/SingletonLock: File exists (17)
[31648:203497472:1121/212318.054245:ERROR:process_singleton_posix.cc(458)] readlink failed: File exists (17)
[31648:203497472:1121/212318.054256:ERROR:process_singleton_lock_posix.cc(20)] readlink(~/.config/chromium/SingletonLock) failed: File exists (17)
[31648:203497472:1121/212318.054296:ERROR:chrome_browser_main.cc(1448)] Failed to create a ProcessSingleton for your profile directory. This means that running multiple instances would start multiple browser processes rather than opening a new window in the existing process. Aborting now to avoid profile corruption.
amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description.
If they do, bad things may happen!

Anyone know what this means?

The firmware and drivers are all up to date for this particular installation, and the kernel is too. In my best Britney Spears impression, I must confess (I still believe...) that I'm running FreeBSD 15 on this particular instance but I don't *think* that is a factor however I've been wrong in the past once or twice. Sometimes it runs the fallback driver, other times it runs fine for hours, but most of the time it uses the fallback driver and then locks up. Putting the SSD and graphics card back into the original machine from 2006 "resolves" the issue.

My instinct suggests that this is either:-
A) I have some kind of configuration set up which is doing something weird with the drmn0 device - the 2006 machine is so old that when the configuration tries to run on that machine, it breaks, thus doesn't cause the conflict and therefore there is "no issue", but then transferring to slightly more up to date hardware allows this configuration to run and create the conflict. (zebra diagnosis)
1) I've broken the graphics card with my meaty pig hands.

taiwan740 · Nov 22, 2023

Some additional information:-

Running dmesg while ssh'd onto the machine when it's in its non-responsive state outputs this

Code:

amdgpu: 
 failed to send message 148 ret is 0 
amdgpu: 
 last message was failed ret is 0
amdgpu: 
 failed to send message 145 ret is 0 
amdgpu: 
 last message was failed ret is 0
amdgpu: 
 failed to send message 146 ret is 0

It seems to go unresponsive if it's not being used actively, like the screensaver is trying to activate. Generally, screensavers are the first thing I deactivate, I usually just let 'em cook, but maybe it's unhappy with the no output scenario?

I'll try rebuilding drm-515-kmod... again...

taiwan740 · Nov 22, 2023

taiwan740 said:
I'll try rebuilding drm-515-kmod... again...

Rebuilding didn't fix it.

I think I've found the issue. The PCIe I'm trying to use is PCIe 2.0 16x, but the card is designed for PCIe 3.0 8x. Despite PCIe claiming to be "fully forwards- and backwards-compatible" that quite obviously is not true in this case. When I install the graphics card to a PCIe 2.0 4x lane, everything works fine.

Apart from Chromium, so no loss.

What an irritating steaming pile of garbage.

taiwan740 · Nov 22, 2023

Not fixed, error came back.

I've put it back in the 2006 machine, where the graphics card seems to feel most at home. Interestingly, the fan on it now doesn't run, except on start-up.

No clue, parking the issue. Not worth the effort. Got other machines that I can use instead.

bgavin · Nov 22, 2023

My grand daughter had the above monitor lockup problem as described above.
A bit of Google shows there is a large number of other folks having the same issue.

Her machine is an AMD based Win11 machine on an ASRock mother board.
The GPU is an RTX 3060 made by Gigabyte.

All her drivers are the most recent nVidia version.

What appears to be the fix was a multi-reseating of both the RTX card, *AND* the 12v aux power connector.
This was a new machine, so I was certainly in doubt of old crusty connectors.

I then booted it under the Hiren's boot CD and ran MemTest under that WinPE environment.
The machine passed several days of MemTest86+ without error.
This told me the RTX wasn't the problem after reseating the card and connectors.

This problem was chronic for her, and has been gone two weeks after the above reseating.
I have zero love for AMD based system boards or graphics cards.
Too hard to support and too many quirks. YMMV but mine doesn't.

taiwan740 · Nov 22, 2023

bgavin said:
My grand daughter had the above monitor lockup problem as described above.
A bit of Google shows there is a large number of other folks having the same issue.

Her machine is an AMD based Win11 machine on an ASRock mother board.
The GPU is an RTX 3060 made by Gigabyte.

All her drivers are the most recent nVidia version.

What appears to be the fix was a multi-reseating of both the RTX card, *AND* the 12v aux power connector.
This was a new machine, so I was certainly in doubt of old crusty connectors.

I then booted it under the Hiren's boot CD and ran MemTest under that WinPE environment.
The machine passed several days of MemTest86+ without error.
This told me the RTX wasn't the problem after reseating the card and connectors.

This problem was chronic for her, and has been gone two weeks after the above reseating.
I have zero love for AMD based system boards or graphics cards.
Too hard to support and too many quirks. YMMV but mine doesn't.

I don't bother with NVIDIA graphics cards, I think they're a scam, especially with this Generative AI feature that won't be used by 99% of its patrons, and asking over $2000 for a RTX 4090. I do have a couple ancient NVIDIA's that are discrete cards in laptops, they don't work as well as their AMD counterparts on open source OS for my usage (which is just watching youtube occasionally, something which seems to be a tall ask of late!)

I'm using a Raspberry Pi for my desktop at the moment, it doesn't waste my time with all the compatibility nonsense that I've been trying to tackle.

Once my enthusiasm comes back, I've got a couple 10+ year old Radeon cards in the back of the wardrobe I can stick into my "non-working" hardware and I'll just think up some other operating system installation purpose for them.

bgavin · Nov 23, 2023

nVidia is business to make money... big money.
They are the leading edge of AI, and charge accordingly.
nVidia is also the leading edge for hard core gamers, and they charge accordingly for this as well.

I have 81 versions of nVidia drivers in my support tool kit as of this writing.
As to cost, it takes a lot of staff to constantly evolve this number of drivers and over such a wide product range.
I wrote a utility that scans all the nVidia INF to find the best/latest drivers that match those GUIDs I need.

I don't game at all, but I do a lot of video transcoding.
The nVidia Series 40 cards do both HEVC as well as AV1 encoding and decoding in hardware.
My big Xeon transcoding workstation takes 72 hours to transcode in software, and about 45 minutes with nVidia hardware.

My daily workstation still uses a Radeon 4770 dual monitor setup on an Intel i7 board.
This card came out in 2009 and still runs just fine today for a work machine.
As usual, I had some driver problems with it, but once I found a working set, they run fine.