Solved (Workaround) FreeBSD 10.1, sudden network down

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Hi,

I have a server with FreeBSD 10.1 installed and running ZFS and NFS, until now it was suddenly down twice within 10 days during high traffic, every server is not pingable until I rebooted the server.

First bootup without any network tuning , second bootup with the following parameters for network tuning
Code:
# Network tuning
kern.ipc.somaxconn=4096
kern.ipc.maxsockbuf=16777216
net.inet.tcp.delayed_ack=0
net.inet.tcp.path_mtu_discovery=0
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65536
net.inet.tcp.sendbuf_max=16777216
The server is running Intel(R) PRO/1000 Network Connection 7.4.2 and detected as emX instead of igb.

Previously it was FreeBSD 9.1 with more than 365 days uptime without any issue with the same physical cable and connect to the same switch port.

Any idea or might it be the driver issue?
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

The FreeBSD 10.1 server network does not respond again during high throughput, third time within ten days. Definitely something wrong with this version of FreeBSD. No error message has been found in dmesg and /var/log/messages.
 

gkbsd

Member

Reaction score: 36
Messages: 75

Hello,

Have you tried disabling your network tuning just to check how it goes?

Regards,
Guillaume
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Hi,

Yes, the problem occurred even with or without the network tuning parameters.
 

gkbsd

Member

Reaction score: 36
Messages: 75

Is there any downside not using the Intel driver and letting FreeBSD handle the network card? If it's doable, it would help running with the default FreeBSD driver, temporarily, and see if the problem still occurs. If the problem does not appear anymore then, you will be sure your problem is driver related.

Guillaume
 

wblock@

Administrator
Staff member
Administrator
Moderator
Developer

Reaction score: 3,631
Messages: 13,850

It is not clear what you mean by that. The Intel drivers in FreeBSD are provided by Intel. Did you download and compile a driver from somewhere else?
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

I'm using the default driver along with FreeBSD 10.1, no additional compilation or driver installation has been done.
 

wblock@

Administrator
Staff member
Administrator
Moderator
Developer

Reaction score: 3,631
Messages: 13,850

Then you are using the Intel driver. em(4) and igb(4) are for different Intel cards.
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

I'm using the Supermicro Xeon E1230-V2 with intel nic integrated which is detected as em(4). The problem is when running FreeBSD 10.1 with some workload it will down (until now is 3 times within 10 days). Before that the same machine is running FreeBSD 9.1 with nearly 365 days without any issue.
 

nforced

Member

Reaction score: 9
Messages: 87

I have to say I have the same problem with my Intel® PRO/1000 PT Dual Port Server Adapter - http://www.intel.com/content/www/us/en/network-adapters/gigabit-network-adapters/pro-1000-pt-dp.html
Just started after upgrade from 10.0 to 10.1. Network down and "dest host unreachable" etc. errors to host right after NFSv4 load starts, everything goes back to normal after about 30-60 seconds without any human intervention.

My system always used the em driver so no difference in that direction.

Code:
em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe020-0xe03f mem 0xf7d80000-0xf7d9ffff,0xf7d60000-0xf7d7ffff irq 16 at device 0.0 on pci1
em0: Using an MSI interrupt
em0: Ethernet address: 00:15:17:78:6d:76
em1: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xe000-0xe01f mem 0xf7d20000-0xf7d3ffff,0xf7d00000-0xf7d1ffff irq 17 at device 0.1 on pci1
em1: Using an MSI interrupt
em1: Ethernet address: 00:15:17:78:6d:77
Could this be related https://www.freebsd.org/releases/10.1R/relnotes.html
The em(4) driver has been updated to version 7.4.2. [r269196]

Any clues?
 

nforced

Member

Reaction score: 9
Messages: 87

Seems like the em 7.4.2 driver is the cause. I found many people having such problems starting here http://lists.freebsd.org/pipermail/freebsd-bugs/2014-September/058075.html

Here is what I just did and looks like the problem is gone now (at least I can't reproduce it the way I could before):

Code:
ifconfig:
options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
So I added -tso4 in /etc/rc.conf to the interface serving NFSv4

#service netif restart

I also had to #service routing restart

Here's the result
Code:
options=4009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWTSO>
I can't see any noticeable performance difference after this change but I feel bad about doing this. I wonder if it will be better to replace this driver with the one shipped with 9.2...

belon_cfy can you try if this works for you? I am curious :)
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

By the way, it can be resumed by turning off the network interface and on again.
Code:
ifconfig em1 down
ifconfig em1 up
Same as what you thought, I suspect the problem might be caused by the Intel 7.4.2 driver as well. I have swapped the HDD to a new server with Intel I210 NIC two days ago and perform the same test again, no issue until now so far.
 

nforced

Member

Reaction score: 9
Messages: 87

ifconfig em1 down
ifconfig em1 up
Sure, that's ok too but not going to work if you are SSH connected to this interface ;)
Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out :beer:
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Sure, that's ok too but not going to work if you are SSH connected to this interface ;)
Yup, as I read there's definitely something wrong with this driver version, I never had problems with the previous version. Anyway disabling the TSO4 is a workaround for me while this gets sorted out :beer:
Thanks for your suggestion, will disabling the TSO4 impact network performance?
 

nforced

Member

Reaction score: 9
Messages: 87

Thanks for your suggestion, will disabling the TSO4 impact network performance ?
I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list
http://lists.freebsd.org/pipermail/freebsd-stable/2014-September/080088.html

So, either run with TSO disabled or reduce the rsize, wsize of all NFS
mounts to 32768 (which reduces the # of mbufs in a transmit list to 19).
2. Revert to "older" version of the driver because I don't need any of this for my machine:
http://svnweb.freebsd.org/base?view=revision&revision=269196

MFC of R267935: Sync the E1000 shared code to Intel internal, and more importantly add new I218 adapter support to em.
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

I personally can't see any difference, I read about TSO in general is a good thing to have on a NIC and reduces CPU load but in my setup the CPU during high load NFSv4 doesn't go over ~20% so I am ok with this.
When I have the time I will try two other things:

1. from the mailing list


2. Revert to "older" version of the driver because I don't need any of this for my machine:
I'm now switching back to the previous old server with Intel Pro 1000 nic and stress testing it again with -tso. Will update you all few days later.
 

nforced

Member

Reaction score: 9
Messages: 87

Code:
netstat -m
2048/6262/8310 mbufs in use (current/cache/total)
2046/1592/3638/503540 mbuf clusters in use (current/cache/total/max)
2046/1496 mbuf+clusters out of packet secondary zone in use (current/cache)
0/4/4/251769 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74598 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41961 16k jumbo clusters in use (current/cache/total/max)
4604K/4765K/9369K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Possibly triggered from jumbo frame network environment? All of my NFS servers MTU size have been set to 9000.
 

nforced

Member

Reaction score: 9
Messages: 87

Hey, things still work normal here ever since I disabled TSO :)
I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:
/boot/loader.conf
hw.em.rxd=4096
hw.em.txd=4096
This seems to boost things a bit for me.
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Hey, things still work normal here ever since I disabled TSO :)
I tried reverting back to the old driver but hit on a rock, if it's compiled inside the kernel as it is, one needs to recompile the kernel without before being able to use the module version or compile the "correct" version inside the kernel (I got build errors while trying this)... Also I can't find the previous version on any Intel site, maybe they have archive or something I can't find.

So at the end I'll just keep it that way.

Here is one interesting topic I found https://fasterdata.es.net/host-tuning/nic-tuning/

Code:
/boot/loader.conf
hw.em.rxd=4096
hw.em.txd=4096
This seems to boost things a bit for me.
Thanks for sharing the NIC tuning.

However, I found the default value of igb and em are the same which is 1024.
Code:
hw.em.rxd: 1024
hw.em.txd: 1024
hw.igb.rxd: 1024
hw.igb.txd: 1024
hw.ix.txd: 2048
hw.ix.rxd: 2048
 
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Since disabling TSO on emX does help , I will mark this thread as solved with workaround provided by nforced.
 
Last edited by a moderator:
OP
OP
B

belon_cfy

Well-Known Member

Reaction score: 7
Messages: 260

Same issue happens today, one of the EM ports is down on another server. Server still working fine and accessible from the management network, however the storage network is not working until I reboot the whole server.

The server already have TSO disabled, and running FreeBSD 9.3
 

tingo

Daemon

Reaction score: 368
Messages: 1,945

Time to ping the Intel NIC guy on the freebsd-stable mailing list, perhaps?
 
Top