Intel Pro 10G (ix) freezing during high traffic

absduser

Member

Reaction score: 3
Messages: 61

I am seeing a repeating issue on FreeBSD 11.1-RELEASE with the ix driver (3.1.13-k) where during high levels of traffic my network card stops passing traffic. Here are the particulars:

- at the time of the "crash" the NIC is pushing out ~1GB of traffic (primarily outbound) and around 75-90k pps
- the NIC is still pingable locally, but cannot accept or send traffic externally

netstat -m: (at the time of the crash)
Code:
94039/21086/115125 mbufs in use (current/cache/total)
65737/12069/77806/16775612 mbuf clusters in use (current/cache/total/max)
65737/11934 mbuf+clusters out of packet secondary zone in use (current/cache)
1018/8622/9640/8387806 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/2485275 9k jumbo clusters in use (current/cache/total/max)
0/0/0/1397967 16k jumbo clusters in use (current/cache/total/max)
159055K/63897K/222953K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 sendfile syscalls
0 sendfile syscalls completed without I/O request
0 requests for I/O initiated by sendfile
0 pages read by sendfile as part of a request
0 pages were valid at time of a sendfile request
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed
sysctl dev.ix | grep interrupt_rate: (at the time of the crash)
Code:
dev.ix.1.queue7.interrupt_rate: 0
dev.ix.1.queue6.interrupt_rate: 0
dev.ix.1.queue5.interrupt_rate: 0
dev.ix.1.queue4.interrupt_rate: 0
dev.ix.1.queue3.interrupt_rate: 0
dev.ix.1.queue2.interrupt_rate: 0
dev.ix.1.queue1.interrupt_rate: 0
dev.ix.1.queue0.interrupt_rate: 0
dev.ix.0.queue7.interrupt_rate: 500000
dev.ix.0.queue6.interrupt_rate: 500000
dev.ix.0.queue5.interrupt_rate: 500000
dev.ix.0.queue4.interrupt_rate: 500000
dev.ix.0.queue3.interrupt_rate: 500000
dev.ix.0.queue2.interrupt_rate: 500000
dev.ix.0.queue1.interrupt_rate: 500000
dev.ix.0.queue0.interrupt_rate: 500000
(prior to crash those figures were in flux)

systat -tcp 1: (at the time of the crash)
Code:
                   /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   ||||||||||||||||||||||||||||||||||||||


             TCP Connections                       TCP Packets
           0 connections initiated           75646 total packets sent
           0 connections accepted            64016 - data
           0 connections established           258 - data (retransmit by dupack)
           0 connections dropped               258 - data (retransmit by sack)
           0 - in embryonic state            11373 - ack-only
           0 - on retransmit timeout             0 - window probes
           0 - by keepalive                      0 - window updates
           0 - from listen queue                 0 - urgent data only
                                                 0 - control
                                                 0 - resends by PMTU discovery

             TCP Timers                      40420 total packets received
       20044 potential rtt updates           16807 - in sequence
       20599 - successful                       58 - completely duplicate
          11 delayed acks sent                   0 - with some duplicate data
           0 retransmit timeouts              3208 - out-of-order
           0 persist timeouts                  416 - duplicate acks
           0 keepalive probes                20599 - acks
           0 - timeouts                          0 - window probes
                                                 1 - window updates
                                                 0 - bad checksum
vmstat -i: (NOT at the time of the crash)
Code:
interrupt                          total       rate
irq5: uart2                        12665          0
irq18: ehci0 uhci5                     2          0
irq19: uhci2 uhci4                    27          0
cpu0:timer                     128218130       1725
cpu1:timer                      70989642        955
cpu4:timer                      77084729       1037
cpu23:timer                     57337377        771
cpu12:timer                     56820954        764
cpu6:timer                      74254569        999
cpu7:timer                      74549149       1003
cpu2:timer                      79381091       1068
cpu20:timer                     58652387        789
cpu10:timer                     59629106        802
cpu8:timer                      59224673        797
cpu22:timer                     58609410        788
cpu9:timer                      58162364        782
cpu16:timer                     57266397        770
cpu18:timer                     57564243        774
cpu19:timer                     56484023        760
cpu15:timer                     56022798        754
cpu5:timer                      76716821       1032
cpu11:timer                     58499243        787
cpu13:timer                     55827216        751
cpu17:timer                     56234589        756
cpu21:timer                     57115392        768
cpu3:timer                      77477070       1042
cpu14:timer                     57042863        767
irq256: igb0:que 0             180546537       2429
irq257: igb0:que 1             162536944       2187
irq258: igb0:que 2             155586807       2093
irq259: igb0:que 3             172733041       2324
irq260: igb0:que 4             103526741       1393
irq261: igb0:que 5             199118299       2679
irq262: igb0:que 6             157922942       2124
irq263: igb0:que 7             120417137       1620
irq264: igb0:link                      2          0
irq274: mps0                    41945045        564
irq275: mps1                    22103463        297
irq276: mps2                         511          0
irq277: ahci0:ch0                 210612          3
irq278: ahci0:ch1                 210884          3
irq279: ahci0:ch2                    133          0
irq280: ahci0:ch3                    133          0
irq281: ahci0:ch4                    133          0
irq293: ix0:q0                  61730991        830
irq294: ix0:q1                  22327729        300
irq295: ix0:q2                  90093508       1212
irq296: ix0:q3                  71431567        961
irq297: ix0:q4                  58330475        785
irq298: ix0:q5                  45056527        606
irq299: ix0:q6                  40896578        550
irq300: ix0:q7                  49991675        673
irq301: ix0:link                      28          0
irq302: ix1:q0                  70982750        955
irq303: ix1:q1                  48796468        656
irq304: ix1:q2                  71915750        967
irq305: ix1:q3                 100512928       1352
irq306: ix1:q4                  50342892        677
irq307: ix1:q5                  70621409        950
irq308: ix1:q6                  78816402       1060
irq309: ix1:q7                  40005799        538
irq310: ix1:link                      20          0
irq311: mps3                   512411663       6893
irq312: mps4                   417590889       5618
Total                         4797892342      64544
systat -ifstat -match igb0 -pps: (NOT at the time of the crash, but under similar network conditions, different card)
Code:
                    /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   |||||||||||||||||||||||||||||||

      Interface           Traffic               Peak                Total
           igb0  in     42.438 Kp/s         64.016 Kp/s            1.502 Gp
                 out    89.479 Kp/s         99.960 Kp/s            3.201 Gp
sysctls:
Code:
hw.ix.rxd=4096
hw.ix.txd=4096
net.isr.maxthreads="-1"

net.inet.tcp.drop_synfin=1
net.inet.ip.portrange.hifirst=62000
net.inet.ip.portrange.hilast=64000
security.mac.portacl.port_high=65535
net.inet.ip.fw.one_pass=0
net.inet.tcp.mssdflt=1460
net.inet.tcp.recvspace=2263000
net.inet.tcp.sendspace=2263000
net.inet.tcp.minmss=1300
net.inet.tcp.syncache.rexmtlimit=0
net.inet.tcp.tso=0
net.inet.tcp.cc.algorithm=htcp

kern.ipc.maxsockbuf=16777216
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
hw.intr_storm_threshold=10000
I have also tried these (but no difference was seen):
Code:
dev.ix.0.fc=0
ifconfig ix0 -tso
hw.ix.enable_aim=0
Also interesting that after the crash there were fatal errors reported in the PCI stats:
Code:
27045-ix0@pci0:131:0:0: class=0x020000 card=0x00018086 chip=0x15288086 rev=0x01 hdr=0x00
27128-    vendor     = 'Intel Corporation'
27165-    device     = 'Ethernet Controller 10-Gigabit X540-AT2'
27224-    class      = network
27249:    subclass   = ethernet
27275-    cap 01[40] = powerspec 3  supports D0 D3  current D0
27332-    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
27395-    cap 11[70] = MSI-X supports 64 messages, enabled
27448-                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
27513-    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
27581-                 link x8(x8) speed 5.0(5.0) ASPM disabled(L0s/L1)
27647-    ecap 0001[100] = AER 2 1 fatal 1 non-fatal 1 corrected
27706-    ecap 0003[140] = Serial 1 a0369fffff3e4538
At boot that line read:
Code:
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected

Install info:

demsg:
Code:
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> mem 0xf7e00000-0xf7ffffff,0xf7dfc000-0xf7dfffff irq 17 at device 0.0 numa-domain 1 on pci10
ix0: Using MSIX interrupts with 9 vectors
ix0: Ethernet address: a0:36:9f:3e:7f:2c
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 8/4096, RX 8/4096
System:
Code:
FreeBSD 11.1-RELEASE-p6 #0: Tue Dec 19 13:52:29 PST 2017
    user@11_1:/usr/src/sys/amd64/compile/kernel.11_1amd64 amd64
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz (2400.14-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206c2  Family=0x6  Model=0x2c  Stepping=2
 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
 Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 274882101248 (262148 MB)
avail memory = 267105476608 (254731 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <123011 APIC1930>
FreeBSD/SMP: Multiprocessor System Detected: 24 CPUs
FreeBSD/SMP: 2 package(s) x 6 core(s) x 2 hardware threads

Thus far, to solve the problem I either re-config all networking to another card (including ix1, which also crashes eventually) or "HUP" the nic:
Code:
devctl disable ix0
devctl enable ix0
devctl suspend ix0
devctl resume ix0
There are no messages of any kind about failures or errors in logs or on console.
 

Phishfry

Son of Beastie

Reaction score: 1,504
Messages: 4,347

Why are you using FreeBSD RELEASE 11.1 ? That has been EOL'ed since September 2018.
The ix driver has undergone many versions in the two years since that release.
I would also recommend trying the ports version of the module net/intel-ix-kmod if, after updating, you still have troubles.
 
Top