Solved - (Workaround) FreeBSD 10.1, sudden network down

B

Hi,

I have a server with FreeBSD 10.1 installed and running ZFS and NFS, until now it was suddenly down twice within 10 days during high traffic, every server is not pingable until I rebooted the server.

First bootup without any network tuning , second bootup with the following parameters for network tuning

Code:

# Network tuning
kern.ipc.somaxconn=4096
kern.ipc.maxsockbuf=16777216
net.inet.tcp.delayed_ack=0
net.inet.tcp.path_mtu_discovery=0
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65536
net.inet.tcp.sendbuf_max=16777216

The server is running Intel(R) PRO/1000 Network Connection 7.4.2 and detected as emX instead of igb.

Previously it was FreeBSD 9.1 with more than 365 days uptime without any issue with the same physical cable and connect to the same switch port.

Any idea or might it be the driver issue?

OP

B

belon_cfy

Dec 2, 2014

Thread Starter
#2

The FreeBSD 10.1 server network does not respond again during high throughput, third time within ten days. Definitely something wrong with this version of FreeBSD. No error message has been found in dmesg and /var/log/messages.

gkbsd

Dec 2, 2014

#3

Hello,

Have you tried disabling your network tuning just to check how it goes?

Regards,
Guillaume

OP

B

belon_cfy

Dec 3, 2014

Thread Starter
#4

Hi,

Yes, the problem occurred even with or without the network tuning parameters.

gkbsd

Dec 3, 2014

#5

Is there any downside not using the Intel driver and letting FreeBSD handle the network card? If it's doable, it would help running with the default FreeBSD driver, temporarily, and see if the problem still occurs. If the problem does not appear anymore then, you will be sure your problem is driver related.

Guillaume

wblock@

Developer

Dec 3, 2014

#6

It is not clear what you mean by that. The Intel drivers in FreeBSD are provided by Intel. Did you download and compile a driver from somewhere else?

OP

B

belon_cfy

Dec 4, 2014

Thread Starter
#7

I'm using the default driver along with FreeBSD 10.1, no additional compilation or driver installation has been done.

wblock@

Developer

Dec 4, 2014

#8

Then you are using the Intel driver. em(4) and igb(4) are for different Intel cards.

OP

B

belon_cfy

Dec 5, 2014

Thread Starter
#9

I'm using the Supermicro Xeon E1230-V2 with intel nic integrated which is detected as em(4). The problem is when running FreeBSD 10.1 with some workload it will down (until now is 3 times within 10 days). Before that the same machine is running FreeBSD 9.1 with nearly 365 days without any issue.

wblock@

Developer

Dec 5, 2014

#10

Please ask on the freebsd-net mailing list.

nforced

Dec 14, 2014

#11

I have to say I have the same problem with my Intel® PRO/1000 PT Dual Port Server Adapter - http://www.intel.com/content/www/us/en/network-adapters/gigabit-network-adapters/pro-1000-pt-dp.html
Just started after upgrade from 10.0 to 10.1. Network down and "dest host unreachable" etc. errors to host right after NFSv4 load starts, everything goes back to normal after about 30-60 seconds without any human intervention.

My system always used the em driver so no difference in that direction.

Code:

em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe020-0xe03f mem 0xf7d80000-0xf7d9ffff,0xf7d60000-0xf7d7ffff irq 16 at device 0.0 on pci1
em0: Using an MSI interrupt
em0: Ethernet address: 00:15:17:78:6d:76
em1: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe000-0xe01f mem 0xf7d20000-0xf7d3ffff,0xf7d00000-0xf7d1ffff irq 17 at device 0.1 on pci1
em1: Using an MSI interrupt
em1: Ethernet address: 00:15:17:78:6d:77

Could this be related https://www.freebsd.org/releases/10.1R/relnotes.html
The em(4) driver has been updated to version 7.4.2. [r269196]

Any clues?

nforced

Dec 14, 2014

#12

Seems like the em 7.4.2 driver is the cause. I found many people having such problems starting here http://lists.freebsd.org/pipermail/freebsd-bugs/2014-September/058075.html

Here is what I just did and looks like the problem is gone now (at least I can't reproduce it the way I could before):

Code:

ifconfig:
options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>

So I added -tso4 in /etc/rc.conf to the interface serving NFSv4

#service netif restart

I also had to #service routing restart

Here's the result

Code:

options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWTSO>

I can't see any noticeable performance difference after this change but I feel bad about doing this. I wonder if it will be better to replace this driver with the one shipped with 9.2...

belon_cfy can you try if this works for you? I am curious

OP

B

belon_cfy

Dec 14, 2014

Thread Starter
#13

By the way, it can be resumed by turning off the network interface and on again.

Code:

ifconfig em1 down
ifconfig em1 up

Same as what you thought, I suspect the problem might be caused by the Intel 7.4.2 driver as well. I have swapped the HDD to a new server with Intel I210 NIC two days ago and perform the same test again, no issue until now so far.

nforced

Dec 14, 2014

#14

belon_cfy said:
ifconfig em1 down
ifconfig em1 up

Sure, that's ok too but not going to work if you are SSH connected to this interface

Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out :beer:

OP

B

belon_cfy

Dec 15, 2014

Thread Starter
#15

nforced said:
Sure, that's ok too but not going to work if you are SSH connected to this interface
Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out

Thanks for your suggestion, will disabling the TSO4 impact network performance?

nforced

Dec 15, 2014

#16

belon_cfy said:
Thanks for your suggestion, will disabling the TSO4 impact network performance ?

I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list

http://lists.freebsd.org/pipermail/freebsd-stable/2014-September/080088.html

So, either run with TSO disabled or reduce the rsize, wsize of all NFS
mounts to 32768 (which reduces the # of mbufs in a transmit list to 19).

2. Revert to "older" version of the driver because I don't need any of this for my machine:

http://svnweb.freebsd.org/base?view=revision&revision=269196

MFC of R267935: Sync the E1000 shared code to Intel internal, and more importantly add new I218 adapter support to em.

OP

B

belon_cfy

Dec 16, 2014

Thread Starter
#17

nforced said:
I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list

2. Revert to "older" version of the driver because I don't need any of this for my machine:

I'm now switching back to the previous old server with Intel Pro 1000 nic and stress testing it again with -tso. Will update you all few days later.

devil_devil

Dec 18, 2014

#18

Hi,

Please check your mbuf clusters with the command: netstat -m

nforced

Dec 19, 2014

#19

Code:

netstat -m
2048/6262/8310 mbufs in use (current/cache/total)
2046/1592/3638/503540 mbuf clusters in use (current/cache/total/max)
2046/1496 mbuf+clusters out of packet secondary zone in use (current/cache)
0/4/4/251769 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74598 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41961 16k jumbo clusters in use (current/cache/total/max)
4604K/4765K/9369K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

OP

B

belon_cfy

Dec 21, 2014

Thread Starter
#20

Possibly triggered from jumbo frame network environment? All of my NFS servers MTU size have been set to 9000.

nforced

Jan 2, 2015

#21

Hey, things still work normal here ever since I disabled TSO

I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:

/boot/loader.conf
hw.em.rxd=4096
hw.em.txd=4096

This seems to boost things a bit for me.

OP

B

belon_cfy

Jan 8, 2015

Thread Starter
#22

nforced said:
Hey, things still work normal here ever since I disabled TSO
I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:

/boot/loader.conf hw.em.rxd=4096 hw.em.txd=4096

This seems to boost things a bit for me.

Thanks for sharing the NIC tuning.

However, I found the default value of igb and em are the same which is 1024.

Code:

hw.em.rxd: 1024
hw.em.txd: 1024
hw.igb.rxd: 1024
hw.igb.txd: 1024
hw.ix.txd: 2048
hw.ix.rxd: 2048

OP

B

belon_cfy

Jan 15, 2015

Thread Starter
#23

Since disabling TSO on emX does help , I will mark this thread as solved with workaround provided by nforced.

Last edited by a moderator: Jan 20, 2015

OP

B

belon_cfy

Jan 20, 2015

Thread Starter
#24

Same issue happens today, one of the EM ports is down on another server. Server still working fine and accessible from the management network, however the storage network is not working until I reboot the whole server.

The server already have TSO disabled, and running FreeBSD 9.3

T

tingo

Jan 27, 2015

#25

Time to ping the Intel NIC guy on the freebsd-stable mailing list, perhaps?

Solved (Workaround) FreeBSD 10.1, sudden network down