Solved (Workaround) FreeBSD 10.1, sudden network down

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#1
Hi,

I have a server with FreeBSD 10.1 installed and running ZFS and NFS, until now it was suddenly down twice within 10 days during high traffic, every server is not pingable until I rebooted the server.

First bootup without any network tuning , second bootup with the following parameters for network tuning
Code:
# Network tuning
kern.ipc.somaxconn=4096
kern.ipc.maxsockbuf=16777216
net.inet.tcp.delayed_ack=0
net.inet.tcp.path_mtu_discovery=0
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65536
net.inet.tcp.sendbuf_max=16777216
The server is running Intel(R) PRO/1000 Network Connection 7.4.2 and detected as emX instead of igb.

Previously it was FreeBSD 9.1 with more than 365 days uptime without any issue with the same physical cable and connect to the same switch port.

Any idea or might it be the driver issue?
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#2
The FreeBSD 10.1 server network does not respond again during high throughput, third time within ten days. Definitely something wrong with this version of FreeBSD. No error message has been found in dmesg and /var/log/messages.
 

gkbsd

Member

Thanks: 36
Messages: 75

#5
Is there any downside not using the Intel driver and letting FreeBSD handle the network card? If it's doable, it would help running with the default FreeBSD driver, temporarily, and see if the problem still occurs. If the problem does not appear anymore then, you will be sure your problem is driver related.

Guillaume
 

wblock@

Administrator
Staff member
Administrator
Moderator
Developer

Thanks: 3,616
Messages: 13,850

#6
It is not clear what you mean by that. The Intel drivers in FreeBSD are provided by Intel. Did you download and compile a driver from somewhere else?
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#7
I'm using the default driver along with FreeBSD 10.1, no additional compilation or driver installation has been done.
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#9
I'm using the Supermicro Xeon E1230-V2 with intel nic integrated which is detected as em(4). The problem is when running FreeBSD 10.1 with some workload it will down (until now is 3 times within 10 days). Before that the same machine is running FreeBSD 9.1 with nearly 365 days without any issue.
 

nforced

Member

Thanks: 8
Messages: 87

#11
I have to say I have the same problem with my Intel® PRO/1000 PT Dual Port Server Adapter - http://www.intel.com/content/www/us/en/network-adapters/gigabit-network-adapters/pro-1000-pt-dp.html
Just started after upgrade from 10.0 to 10.1. Network down and "dest host unreachable" etc. errors to host right after NFSv4 load starts, everything goes back to normal after about 30-60 seconds without any human intervention.

My system always used the em driver so no difference in that direction.

Code:
em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe020-0xe03f mem 0xf7d80000-0xf7d9ffff,0xf7d60000-0xf7d7ffff irq 16 at device 0.0 on pci1
em0: Using an MSI interrupt
em0: Ethernet address: 00:15:17:78:6d:76
em1: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe000-0xe01f mem 0xf7d20000-0xf7d3ffff,0xf7d00000-0xf7d1ffff irq 17 at device 0.1 on pci1
em1: Using an MSI interrupt
em1: Ethernet address: 00:15:17:78:6d:77
Could this be related https://www.freebsd.org/releases/10.1R/relnotes.html
The em(4) driver has been updated to version 7.4.2. [r269196]

Any clues?
 

nforced

Member

Thanks: 8
Messages: 87

#12
Seems like the em 7.4.2 driver is the cause. I found many people having such problems starting here http://lists.freebsd.org/pipermail/freebsd-bugs/2014-September/058075.html

Here is what I just did and looks like the problem is gone now (at least I can't reproduce it the way I could before):

Code:
ifconfig:
options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
So I added -tso4 in /etc/rc.conf to the interface serving NFSv4

#service netif restart

I also had to #service routing restart

Here's the result
Code:
options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWTSO>
I can't see any noticeable performance difference after this change but I feel bad about doing this. I wonder if it will be better to replace this driver with the one shipped with 9.2...

belon_cfy can you try if this works for you? I am curious :)
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#13
By the way, it can be resumed by turning off the network interface and on again.
Code:
ifconfig em1 down
ifconfig em1 up
Same as what you thought, I suspect the problem might be caused by the Intel 7.4.2 driver as well. I have swapped the HDD to a new server with Intel I210 NIC two days ago and perform the same test again, no issue until now so far.
 

nforced

Member

Thanks: 8
Messages: 87

#14
ifconfig em1 down
ifconfig em1 up
Sure, that's ok too but not going to work if you are SSH connected to this interface ;)
Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out :beer:
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#15
Sure, that's ok too but not going to work if you are SSH connected to this interface ;)
Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out :beer:
Thanks for your suggestion, will disabling the TSO4 impact network performance?
 

nforced

Member

Thanks: 8
Messages: 87

#16
Thanks for your suggestion, will disabling the TSO4 impact network performance ?
I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list
http://lists.freebsd.org/pipermail/freebsd-stable/2014-September/080088.html

So, either run with TSO disabled or reduce the rsize, wsize of all NFS
mounts to 32768 (which reduces the # of mbufs in a transmit list to 19).
2. Revert to "older" version of the driver because I don't need any of this for my machine:
http://svnweb.freebsd.org/base?view=revision&revision=269196

MFC of R267935: Sync the E1000 shared code to Intel internal, and more importantly add new I218 adapter support to em.
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#17
I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list


2. Revert to "older" version of the driver because I don't need any of this for my machine:
I'm now switching back to the previous old server with Intel Pro 1000 nic and stress testing it again with -tso. Will update you all few days later.
 

nforced

Member

Thanks: 8
Messages: 87

#19
Code:
netstat -m
2048/6262/8310 mbufs in use (current/cache/total)
2046/1592/3638/503540 mbuf clusters in use (current/cache/total/max)
2046/1496 mbuf+clusters out of packet secondary zone in use (current/cache)
0/4/4/251769 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74598 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41961 16k jumbo clusters in use (current/cache/total/max)
4604K/4765K/9369K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
 

nforced

Member

Thanks: 8
Messages: 87

#21
Hey, things still work normal here ever since I disabled TSO :)
I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:
/boot/loader.conf
hw.em.rxd=4096
hw.em.txd=4096
This seems to boost things a bit for me.
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#22
Hey, things still work normal here ever since I disabled TSO :)
I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:
/boot/loader.conf
hw.em.rxd=4096
hw.em.txd=4096
This seems to boost things a bit for me.
Thanks for sharing the NIC tuning.

However, I found the default value of igb and em are the same which is 1024.
Code:
hw.em.rxd: 1024
hw.em.txd: 1024
hw.igb.rxd: 1024
hw.igb.txd: 1024
hw.ix.txd: 2048
hw.ix.rxd: 2048
 
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#23
Since disabling TSO on emX does help , I will mark this thread as solved with workaround provided by nforced.
 
Last edited by a moderator:
OP
OP
B

belon_cfy

Well-Known Member

Thanks: 7
Messages: 260

#24
Same issue happens today, one of the EM ports is down on another server. Server still working fine and accessible from the management network, however the storage network is not working until I reboot the whole server.

The server already have TSO disabled, and running FreeBSD 9.3
 
Top