Very strange system behavior with intel-ix-kmod driver

Hi all. I have very strange system behavior with custom intel ix driver installed from ports. I need SR-IOV functionality with my intel 10Gb NIC, so I build net/intel-ix-kmod driver from ports and add if_ix_updated_load="YES" to my /boot/loader.conf on FreeBSD 12.1-RELEASE-p7 system.

But after rebooting system boot failed on zfs:zroot mount point. How it can be even related?

Dmesg on normal boot (disabled if_ix_updated load in loader.conf):
Code:
...
Jul 16 20:25:53 msrv kernel: ses0: da7,pass8 in Array Device Slot 19, SAS Slot: 1 phys at slot 0
Jul 16 20:25:53 msrv kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Jul 16 20:25:53 msrv kernel: ses0:  phy 0: parent 5001438023a1de26 addr 5000c50059178479
Jul 16 20:25:53 msrv kernel: ses0: da4,pass5 in Array Device Slot 20, SAS Slot: 1 phys at slot 0
Jul 16 20:25:53 msrv kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Jul 16 20:25:53 msrv kernel: ses0:  phy 0: parent 5001438023a1de26 addr 5000c50059232ab9

Jul 16 20:25:53 msrv kernel: Trying to mount root from zfs:zroot/ROOT/default []... 

Jul 16 20:25:53 msrv kernel: uhub0: 26 ports with 26 removable, self powered
Jul 16 20:25:53 msrv kernel: ugen0.2: <ISSC ISSCEDRBTA> at usbus0
Jul 16 20:25:53 msrv kernel: ugen0.3: <vendor 0x0557 product 0x7000> at usbus0
Jul 16 20:25:53 msrv kernel: uhub1 on uhub0
...

Dmesg on failed boot (with if_ix_updated_load="YES")
Code:
...
Jul 16 20:15:53 msrv kernel: ses0: da7,pass8 in Array Device Slot 19, SAS Slot: 1 phys at slot 0
Jul 16 20:15:53 msrv kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Jul 16 20:15:53 msrv kernel: ses0:  phy 0: parent 5001438023a1de26 addr 5000c50059178479
Jul 16 20:15:53 msrv kernel: ses0: da4,pass5 in Array Device Slot 20, SAS Slot: 1 phys at slot 0
Jul 16 20:15:53 msrv kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Jul 16 20:15:53 msrv kernel: ses0:  phy 0: parent 5001438023a1de26 addr 5000c50059232ab9

Jul 16 20:15:53 msrv kernel: Mounting from zfs:zroot/ROOT/default failed with error 5
Boot stopped on this.

pcicconf -lvc ix0:
Code:
ix0@pci0:1:0:0: class=0x020000 card=0x000c8086 chip=0x10fb8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 64 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS
                 link x8(x8) speed 5.0(5.0) ASPM disabled(L0s)
    cap 03[e0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 00e0edffff9eba54
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     6 VFs configured out of 64 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ed
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304

uname -a
Code:
FreeBSD msrv.example.com 12.1-RELEASE-p7 FreeBSD 12.1-RELEASE-p7 GENERIC  amd64

Thank you for any suggestions how to overcome this.
 
Compile a custom kernel? Try again? It seems like a conflict
What should I do in kernel configuration file? Disable builtin intel ix driver?
Tried several times with some period (several weeks) between tries. No luck. Last try was yesterday, and after another failed attempt decide to post this help request. I have no ideas how NIC driver can prevent zfs from normal operation.
 
I have only seen something similar in a lab environment, enable a NIC, & the disk goes to Sht... It is a kernel problem in my opinion, No need to disable thing in the kernel, do a BUILDKERNEL and a BUILDWORLD. The disk is ok, the NIC driver is the cause of the conflict. Build several kernels and try them all, one with default config, others trial and error ... you will fix it keep going don't give up!
 
Hard to tell. Clearly, there is a problem that causes an E_IO (io error = errno 5) when mounting ZFS, and clearly that is caused by the presence of the intel-ix kernel module. You never told us what your disk controller is ... but it is very likely that it is itself a PCI(e) card, given that you seem to have many SAS disks in an enclosure, and those don't get connected to the motherboard SATA ports (duh). It seems incredibly unlikely that there is a fundamental problem of destructive interaction (like bad interrupt sharing), since that doesn't happen with modern PCIe any longer. Note that the problem doesn't necessarily have to be in the disk device driver itself, it could be somewhere else in the IO stack, or even in ZFS. And when I say "problem", I don't mean that these pieces of code have a bug, only that they break.

My hunch #1 is that there is a bug in the intel-ix kernel module, and that the disk driver is the innocent victim of that bug. Could be as simple as a write to a wrong pointer, and most of the time that write hits nothing interesting, but when the disk driver or ZFS are present, they happen to be in the wrong place at the wrong time. What you could try: disable all your SAS disks, and boot from something else (SATA, USB-stick), but with the intel-ix module. If things start working ... then you learned that either the problem only affects the combination of the two, or that the innocent victim isn't important. If things still fail, you know that the intel-ix module is bad, and you have something you can start working in.

Hunch #2 is very mundane: Could be a power problem. Modern high-function PCIe cards can suck amazing amounts of power (I have stories of Broadcom/LSI cards to tell that used so much power, with inadequate cooling they torched themselves), and perhaps having both disk and network card running at the same time causes power glitches. But those problems should not be exactly repeatable. Easiest way to fix that is to either beef up the power supply, or disable other things.

The next suggestion is to go to the kernel mailing list, but there you'll need a lot more detail (like models of both devices, dmesg, and such), and hope that one of the developers has a similar hardware configuration.
 
Story continues. I changed Intel NIC to Chelsio one: 'T420-CR Unified Wire Ethernet Controller'. If I add

Code:
t4fw_cfg_load="YES"
t5fw_cfg_load="YES"
t6fw_cfg_load="YES"
if_cxgbe_load="YES"

as stated in "man cxgbe", boot hangs on "Mounting from zfs:zroot/ROOT/default failed with error 5" is here again.
I can load if_cxgbe driver after successful boot manually and without any problems. Maybe this information will help clarify this issue.

P.S. Small update. If I leave only "if_cxgbe_load="YES"" in loader.conf, I can load system successfully. Still this is strange and confusing behavior.
 
Chelsio one: 'T420-CR Unified Wire Ethernet Controller'. If I add

Code:
t4fw_cfg_load="YES"
t5fw_cfg_load="YES"
t6fw_cfg_load="YES"
if_cxgbe_load="YES"
P.S. Small update. If I leave only "if_cxgbe_load="YES"" in loader.conf, I can load system successfully.
Try setting only the firmware for the T420, not all the others:
/etc/loader.conf
Code:
t4fw_cfg_load="YES"
if_cxgbe_load="YES"
Alternatively you could try loading the kernel modules from /etc/rc.conf:
Code:
kld_list="if_cxgbe t4fw_cfg"
 
Also check /var/log/messages for any output related to the network card and firmware many / most devices that require firmware reports on success / failure there.
 
Got some updates. With this configuration:
Code:
root@msrv:~ # cat /etc/iov/cxgbe0.conf
PF {
        device: cxgbe0;
        num_vfs: 4;
}
I have an errors:
Code:
root@msrv:~ # pciconf -lvc

...
none9@pci0:1:1:0:       class=0xffffff card=0x00000000 chip=0x48011425 rev=0xff hdr=0x7f
    vendor     = 'Chelsio Communications Inc'
    device     = 'T420-CR Unified Wire Ethernet Controller [VF]'
pciconf: list_caps: bad header type

And in boot messages:
Code:
+t4vf0: <Chelsio T420-CR VF> at device 1.0 on pci1
+t4vf0: failed to find a usable interrupt type.  allowed=7, msi-x=0, msi=0, intx=1device_attach: t4vf0 attach returned 6
+t4vf0: <Chelsio T420-CR VF> at device 1.4 on pci1
+t4vf0: failed to find a usable interrupt type.  allowed=7, msi-x=0, msi=0, intx=1device_attach: t4vf0 attach returned 6
+t4vf0: <Chelsio T420-CR VF> at device 2.0 on pci1
+t4vf0: failed to find a usable interrupt type.  allowed=7, msi-x=0, msi=0, intx=1device_attach: t4vf0 attach returned 6
+t4vf0: <Chelsio T420-CR VF> at device 2.4 on pci1
+t4vf0: failed to find a usable interrupt type.  allowed=7, msi-x=0, msi=0, intx=1device_attach: t4vf0 attach returned 6

I have very unsuccessful experience with SR-IOV feature, because on my two servers, with different NICs I cant get it working (https://forums.FreeBSD.org/threads/enabling-sr-iov-on-intel-driver.70647/post-486578)
 
Back
Top