Solved i915 unrecoverable GPU stutter

Hey there :]
The iGPU of my Thinkpad x200 (GMA 4500 MHD) is entering a heavy stutter after roughly a minute of a 100% GPU load.
(FreeBSD 13, i915kms, drm-fbsd13-kmod-5.4.92.g20210419)
This can be caused by either a 3D game or watching a 60fps video with the Debug info, whilst not in full screen. It also happens randomly after watching full screen video every 30 minutes.
This results in everything lagging, which never goes away. The lag includes the mouse cursor in Sway (Wayland) and the text output of the Terminal Alacritty, which directly uses OpenGL to draw the Terminal. The lag is not limited to the WM. Restarting the WM does not do anything. Once the lag started, it stays. Even switching from Wayland to X11 when the lag started does not solve it. I also waited overnight and the next morning the lag was still present. Only a reboot cures it.
Here is a video showcasing the symptoms, captured externally, as to avoid affecting the behaviour.
In chronological order:
  1. 0:07 - Lag starts after an elevated GPU load. (60fps video not fullscreen, with debug information drawn, notice the skyrocketing "Frames dropped (output)")
  2. 0:10 - Moving the mouse cursor, which now also lags
  3. 0:17 - Restarting the Video, to show the lag is not related to the video player
  4. 0:30 - Switching to an X11 window manager
  5. 0:48 - Again opening the video, showcasing the lag (The lag has survived a WM switch)
Dmesg for the sake of completness:
Code:
---<<BOOT>>---
Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 04:24:09 UTC 2021
    root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
FreeBSD clang version 11.0.1 (git@github.com:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe)
VT(vga): resolution 640x480
CPU: Intel(R) Core(TM)2 CPU         P8800  @ 2.66GHz (2666.83-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x1067a  Family=0x6  Model=0x17  Stepping=10
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0xc08e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,XSAVE,OSXSAVE>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  VT-x: (disabled in BIOS) HLT,PAUSE
  TSC: P-state invariant, performance statistics
real memory  = 10401873920 (9920 MB)
avail memory = 7883034624 (7517 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <COREv4 COREBOOT>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and the knob 'bypass_before_seeding' was enabled.
ioapic0 <Version 2.0> irqs 0-23
Launching APs: 1
Timecounter "TSC-low" frequency 1333413540 Hz quality 1000
KTLS: Initialized 2 threads
random: entropy device external interface
000.000019 [4354] netmap_init               netmap: loaded module
[ath_hal] loaded
WARNING: Device "kbd" is Giant locked and may be deleted before FreeBSD 14.0.
kbd1 at kbdmux0
mlx5en: Mellanox Ethernet driver 3.6.0 (December 2020)
nexus0
vtvga0: <VT VGA driver>
cryptosoft0: <software crypto>
aesni0: No AES or SHA support.
acpi0: <COREv4 COREBOOT>
acpi0: Power Button (fixed)
ACPI Error: No handler for Region [ERAM] (0xfffff800038d2480) [EmbeddedControl] (20201113/evregion-290)
ACPI Error: Region EmbeddedControl (ID=3) has no handler (20201113/exfldio-428)
ACPI Error: Aborting method \134_SB.PCI0.LPCB.EC.BAT0._STA due to previous error (AE_NOT_EXIST) (20201113/psparse-689)
ACPI Error: No handler for Region [ERAM] (0xfffff800038d2480) [EmbeddedControl] (20201113/evregion-290)
ACPI Error: Region EmbeddedControl (ID=3) has no handler (20201113/exfldio-428)
ACPI Error: Aborting method \134_SB.PCI0.LPCB.EC.BAT1._STA due to previous error (AE_NOT_EXIST) (20201113/psparse-689)
ACPI Error: No handler for Region [ERAM] (0xfffff800038d2480) [EmbeddedControl] (20201113/evregion-290)
ACPI Error: Region EmbeddedControl (ID=3) has no handler (20201113/exfldio-428)
ACPI Error: Aborting method \134_SB.DOCK._STA due to previous error (AE_NOT_EXIST) (20201113/psparse-689)
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
cpu0: <ACPI CPU> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0
acpi_ec0: <Embedded Controller: GPE 0x11> port 0x62,0x66 on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: could not evaluate _ADR - AE_NOT_FOUND
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0x3400-0x3407 mem 0xe1000000-0xe13fffff,0xd0000000-0xdfffffff irq 16 at device 2.0 on pci0
agp0: <Intel GM45 SVGA controller> on vgapci0
WARNING: Device "agp" is Giant locked and may be deleted before FreeBSD 14.0.
agp0: aperture size is 256M, detected 360444k stolen memory
vgapci0: Boot video device
vgapci1: <VGA-compatible display> mem 0xe1400000-0xe14fffff at device 2.1 on pci0
em0: <Intel(R) PRO/1000 Network Connection> port 0x3000-0x301f mem 0xe1600000-0xe161ffff,0xe1624000-0xe1624fff irq 16 at device 25.0 on pci0
em0: Using 1024 TX descriptors and 1024 RX descriptors
em0: Using an MSI interrupt
em0: Ethernet address: 00:26:2d:fd:71:0c
em0: netmap queues/slots: TX 1/1024, RX 1/1024
uhci0: <Intel 82801I (ICH9) USB controller> port 0x3020-0x303f irq 16 at device 26.0 on pci0
uhci0: LegSup = 0x2f00
usbus0 on uhci0
usbus0: 12Mbps Full Speed USB v1.0
uhci1: <Intel 82801I (ICH9) USB controller> port 0x3040-0x305f irq 17 at device 26.1 on pci0
uhci1: LegSup = 0x2f00
usbus1 on uhci1
usbus1: 12Mbps Full Speed USB v1.0
uhci2: <Intel 82801I (ICH9) USB controller> port 0x3060-0x307f irq 18 at device 26.2 on pci0
uhci2: LegSup = 0x2f00
usbus2 on uhci2
usbus2: 12Mbps Full Speed USB v1.0
ehci0: <Intel 82801I (ICH9) USB 2.0 controller> mem 0xe1626000-0xe16263ff irq 18 at device 26.7 on pci0
usbus3: EHCI version 1.0
usbus3 on ehci0
usbus3: 480Mbps High Speed USB v2.0
hdac0: <Intel 82801I HDA Controller> mem 0xe1620000-0xe1623fff irq 16 at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pci2: <ACPI PCI bus> on pcib2
ath0: <Atheros AR938x> mem 0xe1500000-0xe151ffff irq 17 at device 0.0 on pci2
ar9300_flash_map: unimplemented for now
Restoring Cal data from DRAM
Restoring Cal data from EEPROM
ar9300_hw_attach: ar9300_eeprom_attach returned 0
ath0: [HT] enabling HT modes
ath0: [HT] enabling short-GI in 20MHz mode
ath0: [HT] 1 stream STBC receive enabled
ath0: [HT] 1 stream STBC transmit enabled
ath0: [HT] LDPC transmit/receive enabled
ath0: [HT] 3 RX streams; 3 TX streams
ath0: AR9380 mac 448.3 RF5110 phy 1220.0
ath0: 2GHz radio: 0x0000; 5GHz radio: 0x0000
pcib3: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pci4: <ACPI PCI bus> on pcib4
uhci3: <Intel 82801I (ICH9) USB controller> port 0x3080-0x309f irq 16 at device 29.0 on pci0
uhci3: LegSup = 0x2f00
usbus4 on uhci3
usbus4: 12Mbps Full Speed USB v1.0
uhci4: <Intel 82801I (ICH9) USB controller> port 0x30a0-0x30bf irq 17 at device 29.1 on pci0
uhci4: LegSup = 0x2f00
usbus5 on uhci4
usbus5: 12Mbps Full Speed USB v1.0
uhci5: <Intel 82801I (ICH9) USB controller> port 0x30c0-0x30df irq 18 at device 29.2 on pci0
uhci5: LegSup = 0x2f00
usbus6 on uhci5
usbus6: 12Mbps Full Speed USB v1.0
ehci1: <Intel 82801I (ICH9) USB 2.0 controller> mem 0xe1627000-0xe16273ff irq 16 at device 29.7 on pci0
usbus7: EHCI version 1.0
usbus7 on ehci1
usbus7: 480Mbps High Speed USB v2.0
pcib5: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci5: <ACPI PCI bus> on pcib5
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel ICH9M AHCI SATA controller> port 0x3408-0x340f,0x3418-0x341b,0x3410-0x3417,0x341c-0x341f,0x30e0-0x30ff mem 0xe1625000-0xe16257ff irq 17 at device 31.2 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
acpi_acad0: <AC Adapter> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
WARNING: Device "psm" is Giant locked and may be deleted before FreeBSD 14.0.
psm0: model IBM/Lenovo TrackPoint, device ID 14
acpi_acad1: <AC Adapter> on acpi0
battery0: <ACPI Control Method Battery> on acpi0
battery1: <ACPI Control Method Battery> on acpi0
acpi_button0: <Sleep Button> on acpi0
acpi_lid0: <Control Method Lid Switch> on acpi0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
Timecounters tick every 1.000 msec
hdacc0: <Conexant CX20561 (Hermosa) HDA CODEC> at cad 0 on hdac0
hdaa0: <Conexant CX20561 (Hermosa) Audio Function Group> at nid 1 on hdacc0
pcm0: <Conexant CX20561 (Hermosa) (Analog 2.0+HP/2.0)> at nid 26,22 and 24 on hdaa0
pcm1: <Conexant CX20561 (Hermosa) (Internal Analog Mic)> at nid 29 on hdaa0
unknown: <Conexant CX20561 (Hermosa) HDA CODEC Modem Function Group> at nid 2 on hdacc0 (no driver attached)
ugen5.1: <Intel UHCI root HUB> at usbus5
ugen7.1: <Intel EHCI root HUB> at usbus7
uhub0 on usbus5
uhub0: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus5
uhub1 on usbus7
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus7
ugen3.1: <Intel EHCI root HUB> at usbus3
ugen6.1: <Intel UHCI root HUB> at usbus6
uhub2 on usbus3
uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
uhub3 on usbus6
uhub3: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus6
ugen4.1: <Intel UHCI root HUB> at usbus4
ugen1.1: <Intel UHCI root HUB> at usbus1
uhub4 on usbus4
uhub4: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4
uhub5 on usbus1
uhub5: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus1
ugen2.1: <Intel UHCI root HUB> at usbus2
uhub6 on usbus2
uhub6: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
ugen0.1: <Intel UHCI root HUB> at usbus0
uhub7 on usbus0
uhub7: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
Trying to mount root from ufs:ada0.eli []...
Root mount waiting for: usbus0 usbus1 usbus2 usbus3 usbus4 usbus5 usbus6 usbus7 CAM
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Samsung SSD 860 EVO 500GB RVT03B6Q> ACS-4 ATA SATA 3.x device
ada0: Serial Number S4XBNF0M917888K
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 476940MB (976773168 512 byte sectors)
cd0 at ahcich1 bus 0 scbus1 target 0 lun 0
cd0: <MATSHITA DVD-RAM UJ892 SB01> Removable CD-ROM SCSI device
cd0: Serial Number HG97 832986
cd0: 150.000MB/s transfers (SATA 1.x, UDMA5, ATAPI 12bytes, PIO 8192bytes)
cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed
GEOM_ELI: Device ada0.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
uhub3: 2 ports with 2 removable, self powered
uhub4: 2 ports with 2 removable, self powered
uhub7: 2 ports with 2 removable, self powered
uhub0: 2 ports with 2 removable, self powered
uhub5: 2 ports with 2 removable, self powered
uhub6: 2 ports with 2 removable, self powered
Root mount waiting for: usbus3 usbus7
Root mount waiting for: usbus3 usbus7
uhub2: 6 ports with 6 removable, self powered
uhub1: 6 ports with 6 removable, self powered
ugen3.2: <vendor 0x17ef product 0x1005> at usbus3
uhub8 on uhub2
uhub8: <vendor 0x17ef product 0x1005, class 9/0, rev 2.00/0.01, addr 2> on usbus3
uhub8: MTT enabled
Root mount waiting for: usbus3
uhub8: 4 ports with 4 removable, self powered
mountroot: waiting for device ada0.eli...
random: unblocking device.
ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x400-0x41f mem 0xe1628000-0xe16280ff irq 18 at device 31.3 on pci0
smbus0: <System Management Bus> on ichsmb0
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
[drm] Unable to create a private tmpfs mount, hugepage support will be disabled(-19).
Successfully added WC MTRR for [0xd0000000-0xdfffffff]: 0;
[drm] Got stolen memory base 0x7e000000, size 0x16000000
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
[drm] Connector LVDS-1: get mode from tunables:
[drm]   - kern.vt.fb.modes.LVDS-1
[drm]   - kern.vt.fb.default_mode
[drm] Connector VGA-1: get mode from tunables:
[drm]   - kern.vt.fb.modes.VGA-1
[drm]   - kern.vt.fb.default_mode
[drm] Connector HDMI-A-1: get mode from tunables:
[drm]   - kern.vt.fb.modes.HDMI-A-1
[drm]   - kern.vt.fb.default_mode
[drm] Connector DP-1: get mode from tunables:
[drm]   - kern.vt.fb.modes.DP-1
[drm]   - kern.vt.fb.default_mode
[drm] Connector HDMI-A-2: get mode from tunables:
[drm]   - kern.vt.fb.modes.HDMI-A-2
[drm]   - kern.vt.fb.default_mode
[drm] Connector DP-2: get mode from tunables:
[drm]   - kern.vt.fb.modes.DP-2
[drm]   - kern.vt.fb.default_mode
[drm] Connector DP-3: get mode from tunables:
[drm]   - kern.vt.fb.modes.DP-3
[drm]   - kern.vt.fb.default_mode
[drm] RC6 disabled, disabling runtime PM support
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
[drm] Initialized i915 1.6.0 20190822 for drmn0 on minor 0
WARNING: Device "fb" is Giant locked and may be deleted before FreeBSD 14.0.
VT: Replacing driver "vga" with new "fb".
start FB_INFO:
type=11 height=800 width=1280 depth=32
cmsize=16 size=4096000
pbase=0xd000e000 vbase=0xfffff800d000e000
name=drmn0 flags=0x0 stride=5120 bpp=32
cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 cmap[3]=c4a000
end FB_INFO
drmn0: fb0: i915drmfb frame buffer device
wlan0: Ethernet address: 10:9a:dd:a0:a7:ae
lo0: link state changed to UP
wlan0: link state changed to UP
As the lag never stops, it appears this is not Thermal throttling related (I'm not even sure GM45 is capable of it). On DragonFlyBSD and a T500 (same iGPU) I recall a periodic "GPU hang" happening, but don't remember the details. I have no idea how to start to debug this or where to look. What steps can I take to understand what is happening?
 
Hello,

do you have tried to use modesetting or x11-drivers/xf86-video-intel ?

For video acceleration is multimedia/libva-intel-driver installed?
When on X11, I tried both modesetting and the intel driver provided by xf86-video-intel. Though my main WM is Wayland based, so that doesn't apply. Either way, a high GPU load causes this stutter, no matter which WM or Driver it comes from. So I suspect i915kms being at fault.

Libva only supports video acceleration on Gen5 iGPUs in an obscure branch of libva on Arch Linux. So generally, hardware decoding is not a thing with Gen5 iGPUs. This was the very first attempt from Intel at hardware decoding, thus broken and unsupported.
But again, this is not only caused by Video, but any 3D scene as well.
 
I was able to recreate the same behavior on Linux.
It really may be thermal related after all. The heatpad I chose to sit between Chipset and heatsink may have been too thin.
 
It seems my hunch was correct.
The heatpad I used was 0.5mm thick, which was a bit thinner than the OEM one. Now I replaced the GPU heatpad with a 1mm one and the problem seems gone, as after 30 minutes there are no more stutters.
 
Back
Top