Solved Intel (igb) Quad-Port (82576) crashing at load

Hi,

I bought a Intel Pro/1000 VT Quad-Port with a 82576 chip connected via PCIe x4 running FreeBSD 13.
Once I put some load on the card, the system reboots without any hint on the screen or the logs.

I did some research and found a hint regarding ASPM being enabled, but I didnt find a way to disable it.
In loader.conf I put hw.pci.enable_aspm = 0 but it didnt change anything.

As far as I can see this chip is rather common and often used.

Any idea what I could do to fix this?

Thanks!

Code:
igb0@pci0:3:0:0:    class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10e8 subvendor=0x8086 subdevice=0xa02c
    vendor     = 'Intel Corporation'
    device     = '82576 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xfe420000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xfe000000, size 4194304, enabled
    bar   [18] = type I/O Port, range 32, base 0xd020, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0xfe444000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 512(512) FLR NS
                 max read 512
                 link x4(x4) speed 2.5(2.5) ASPM L0s/L1(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 4 corrected
    ecap 0003[140] = Serial 1 001b21ffff555cc8
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 8 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ca
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
igb1@pci0:3:0:1:    class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10e8 subvendor=0x8086 subdevice=0xa02c
    vendor     = 'Intel Corporation'
    device     = '82576 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xfe400000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xfdc00000, size 4194304, enabled
    bar   [18] = type I/O Port, range 32, base 0xd000, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0xfe440000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 512(512) FLR NS
                 max read 512
                 link x4(x4) speed 2.5(2.5) ASPM L0s/L1(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 4 corrected
    ecap 0003[140] = Serial 1 001b21ffff555cc8
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 8 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ca
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
igb2@pci0:4:0:0:    class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10e8 subvendor=0x8086 subdevice=0xa02c
    vendor     = 'Intel Corporation'
    device     = '82576 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xfd820000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xfd400000, size 4194304, enabled
    bar   [18] = type I/O Port, range 32, base 0xc020, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0xfd844000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 512(512) FLR NS
                 max read 512
                 link x4(x4) speed 2.5(2.5) ASPM L0s/L1(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 001b21ffff555ccc
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 8 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ca
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
igb3@pci0:4:0:1:    class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10e8 subvendor=0x8086 subdevice=0xa02c
    vendor     = 'Intel Corporation'
    device     = '82576 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xfd800000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xfd000000, size 4194304, enabled
    bar   [18] = type I/O Port, range 32, base 0xc000, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0xfd840000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 512(512) FLR NS
                 max read 512
                 link x4(x4) speed 2.5(2.5) ASPM L0s/L1(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 001b21ffff555ccc
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 8 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ca
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
 
If the system just reboots I'd suspect some hardware issue otherwise you'd see some kernel panic or such.
Anyhow, does it occur if you disable MSI-X?
 
Did you test this card on another os? I'll test it with some live linux distro first, if the problem is present there, try to plug in it in another machine. I think it will be helpful to exclude the broken card
 
Anyhow, does it occur if you disable MSI-X?
I will try that.
Did you test this card on another os? I'll test it with some live linux distro first, if the problem is present there, try to plug in it in another machine. I think it will be helpful to exclude the broken card
I didn't yet. Sadly this is my only machine I have with a PCIe slot.

I can try to install a different OS and try again. The onboard card (RealTek) works without issues so far.
 
I could imagine system is triple faulting and hence rebooting without any further information. If that's true it could be more challenging to debug.

As it was suggested trying different OS versions and/or different OS type is not a bad idea to test.
 
iperf output is like this before it dies:
Code:
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  11.3 MBytes  95.1 Mbits/sec                 
[  5]   1.00-2.00   sec  11.2 MBytes  94.2 Mbits/sec                 
[  5]   2.00-3.00   sec  11.2 MBytes  94.1 Mbits/sec                 
[  5]   3.00-4.00   sec  11.2 MBytes  94.2 Mbits/sec                 
[  5]   4.00-5.00   sec  1.68 MBytes  14.1 Mbits/sec                 
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec

I just tried it again, now I have a kernel panic on the screen but AFTER the reboot.
It really just suddenly dies and reboots. No hints, nothing. No other device is connected at the moment.
I did not succeed yet to disable MSI-X, sysctl doesnt no this identifier.

The kernel panic is now about the filesystem seems:
panic: ufs_dirbad: /: bad dir ino 963090 at offset 512: mangled entry
 
Make sure your crashes are saved (look at handbook on how to configure crash dumps).
But from what you mentioned I'd say it's triple fault.

The panic you had later about ufs_dirbad is most likely due to the immediate reboot prior to this crash. You need to boot to rescue mode, maybe even boot of the cd/usb and do a fsck of that FS.
 
Any usable information in /var/crash/?
I need to find a way to get it to boot again. Currently it panics at boot after the last load test.
Maybe I need to re-install and try again. Afterwards I try to get details from there.

EDIT: It doesnt get to the boot loader and stops with db> but wont accept any input.
 
This is on 13-STABLE
Code:
root@tsukihi:/home/freebsd # sysctl -a | grep igb |grep msix
dev.igb.1.iflib.disable_msix: 0
dev.igb.0.iflib.disable_msix: 0
 
  • Thanks
Reactions: Ben
I ran fsck to fix it, it said it fixed something but the panic remains.

I will reinstall and try again.
Thanks for your input so far. I was hoping there was an easy solution like the old "disable ACPI" days ;-)

I will come back with news.
 
I will try that.

I didn't yet. Sadly this is my only machine I have with a PCIe slot.

I can try to install a different OS and try again. The onboard card (RealTek) works without issues so far.
I ran fsck to fix it, it said it fixed something but the panic remains.

I will reinstall and try again.
Thanks for your input so far. I was hoping there was an easy solution like the old "disable ACPI" days ;-)

I will come back with news.
Just try live cd image, and not to wipe your current system. Just plug in the usb with something linuxuided. Maybe the problem could be with the lack of power on your power supply or the bad memory. I had some kind of problems. Just try to rub your ram with rubber band and plug it back in the motherboard. Hopefully the problem is not in your motherboard or other upgradable parts.
 
EDIT: It doesnt get to the boot loader and stops with db> but wont accept any input.
Is that custom kernel? db> prompt is DDB debugger prompt. It could be it's expecting input on serial console, had this issue before.

If this is just a test machine and you can reinstall it I'd do that, choose ZFS as /. It can survive these sudden reboots better than UFS (don't have any evidence other than my experience to support that).

Along with blind0ne's advise it doesn't hurt to run memtest+ or similar to stress the memory too.
 
Are you sure its genuine?
Is there a way to find out?
Is that custom kernel? db> prompt is DDB debugger prompt. It could be it's expecting input on serial console, had this issue before.
Along with blind0ne's advise it doesn't hurt to run memtest+ or similar to stress the memory too.
It's a OPNsense instance but I am about to test it with Debian soon.

memtest+ is on the list next. The system (Fujitsu S930) looks clean (was refurbished), but I will check the RAM anyway.
 
I set dumpdev="AUTO" in rc.conf, but no dump in /var/crash.

I set:
Code:
dev.igb.0.iflib.disable_msix=1
dev.igb.0.eee_control=0
hw.pci.do_power_suspend=0

But no use. As soon as I run iperf it will start with max. link speed (100 MBps) and go down to 0. After 2-3 seconds with 0 MBps it reboots.

BIOS settings are disabled as much as possible (Powermanagement etc) but no use.
Disabling ASPM seems not possible via FreeBSD/BIOS.
 
Bash:
Device            1K-blocks            Used        Avail Capacity
/dev/ada0p3        8388608                0            8388608        0%

I found something about issues with SMBus but the "hack" to cover 2 pins with tape seem not have worked up to now.
 
Alright, I finally put Ubuntu Live Desktop (20.04 LTS) on a USB stick.
When I ran apt update && apt install via the Intel card, it also rebooted.

What it tells us? Most likely a hardware issue.

I cant test the Intel card, so I need to either
a) buy a new NIC
b) buy new RAM

I cant upgrade the power supply, as the thin client has an external power supply (like a notebook). Putting more power would be difficult, but I dont really think this is the problem. The other NIC (onboard) seems to work and is inactive while I run my test.

What would you suggest? I will run the memtest now.
 
Limit your RAM to one module during tests, it will speed up the checks too.
Under-power could be a problem if the PSU is weak. You could put at least wattmeter in line to see if the power drawn is maybe too much for that PSU.

Some other things I'd do:
Test NIC somewhere else
Test different NIC in that PCIe slot

If you are under warranty you could try to RMA that NIC too.

An obvious question but still - have you checked the /var/log/messages for doublefault messages, etc.? Both on FreeBSD and Linux.
 
Back
Top