10.2 RELEASE, re0, watchdog timeout

SergeySmirnykh · Feb 27, 2016

Hi, friends!

There is a server executing a router role on a network. Through which there passes the traffic near 400Mb/s.
Until recently everything was relative not badly. Yesterday I updated system on:

Code:

# uname -a
FreeBSD PPTP 10.2-RELEASE FreeBSD 10.2-RELEASE #0: Fri Feb 26 12:21:31 SAMT 2016  root@PPTP-14:/usr/src/sys/amd64/compile/mykernel  amd64

In system there are two network cards:

Code:

# ifconfig -a | less
re0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
  ether fc:aa:14:11:88:65
  inet 46.243.224.14 netmask 0xfffffc00 broadcast 46.243.227.255
  media: Ethernet 1000baseT <full-duplex>
  status: active
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO>
  ether a0:36:9f:41:48:8f
  inet 10.0.0.14 netmask 0xffffff00 broadcast 10.0.0.255
  media: Ethernet autoselect (1000baseT <full-duplex>)
  status: active

In /var/log/messages errors began to appear:

Code:

Feb 27 19:27:10 PPTP-14 kernel: re0: watchdog timeout
Feb 27 19:27:10 PPTP-14 kernel: re0: link state changed to DOWN
Feb 27 19:27:13 PPTP-14 kernel: re0: link state changed to UP
Feb 27 19:27:13 PPTP-14 devd: Executing '/etc/rc.d/dhclient quietstart re0'
Feb 27 19:28:21 PPTP-14 kernel: re0: watchdog timeout
Feb 27 19:28:21 PPTP-14 kernel: re0: link state changed to DOWN
Feb 27 19:28:25 PPTP-14 kernel: re0: link state changed to UP
Feb 27 19:28:25 PPTP-14 devd: Executing '/etc/rc.d/dhclient quietstart re0'

Respectively connection for some time breaks. How to fight against it?

ZOleg · Feb 29, 2016

Same situation on my servers with Realtek 8168 network cards. With FreeBSD driver network card occasionally lost connection, with Realtek driver connection always stable.
Try driver from Realtek - http://www.realtek.com/downloads/do...d=5&Level=5&Conn=4&DownTypeID=3&GetDown=false

SirDice · Feb 29, 2016

Please use freebsd-update(8) to update to the latest patches.

SirDice · Feb 29, 2016

ZOleg said:
Try driver from Realtek - http://www.realtek.com/downloads/do...d=5&Level=5&Conn=4&DownTypeID=3&GetDown=false

Don't. As you can see from the list they only have drivers for FreeBSD 7.x and 8.x, both of which are End-of-Life.

ZOleg · Feb 29, 2016

This driver build and work well with FreeBSD 10.x, iI use this driver on my servers and it works much better than ~~freebsd~~FreeBSD driver.

SirDice · Feb 29, 2016

ZOleg said:
This driver build and work well with FreeBSD 10.x, i use this driver on my servers and it works much better than freebsd driver.

It looks like that driver is actually older than the one from FreeBSD.

SergeySmirnykh · Mar 2, 2016

Hi, everybody!
I used the new driver from the website of the developer which was advised in the second post.
Though there it is also told that it for version 8, but it is suitable for version 10 too and so far works perfectly.

weiclin · Jul 11, 2016

I have the same issue with RTL8111GR, system are 10.3-RELEASE-p5
download and compile the driver from website solves the problem for me

MAM · Sep 12, 2016

SirDice said:
Don't. As you can see from the list they only have drivers for FreeBSD 7.x and 8.x, both of which are End-of-Life.

SirDice said:
It looks like that driver is actually older than the one from FreeBSD.

I was punished with the same problem described here since 10.0. For me it was even worse because sometimes the blockade of the LAN card could only be overcome by a total reboot. Of course, having the main router fail makes the whole nets go havoc, the situation became unbearable after a few weeks.

As a last resort I took the "old driver" that has been meantioned here and gave him a try.

"old but stable" you can say. Since 2 months, no dropouts, no hangs, no overruns!

Comparing the sources makes clear why: the old one does not use that timeout feature at all! there is a
condition in it that prevents the timeout part to be compiled in.
like:

#if OS_VER < VERSION(9,0)
/* Clear the timeout timer. */
ifp->if_timer = 0;
#endif
...
#if OS_VER < VERSION(7,0)
static void re_watchdog(ifp)
struct ifnet *ifp;
{
struct re_softc *sc;

sc = ifp->if_softc;

printf("re%d: watchdog timeout\n", sc->re_unit);
ifp->if_oerrors++;

re_txeof(sc);
re_rxeof(sc);
re_init(sc);

return;
}
#endif

Could it be that for some unknown reason, these #ifdef were lost in the current kernel sources???
Could it be that the currently shipped driver therefor is much much older than it could be???

Obviously current cards dont work very well with those timers...

gustopn · Dec 10, 2016

I have the same problem on FreeBSD 11.0-STABLE #1 r309170.
It keeps doing

Code:

kernel: re0: watchdog timeout

.
The problem is not the number of interrupts on a CPU core, but the time it spends.
When it hits above 89% of interrupt on a CPU core, it gets the watchdog timeout.
Maybe this is not happening on all machines, since when you have a more powerful core
where your interrupt is happening, that may not be the issue, but here on a 1500 MHz 4 core CPU it happens often.
Especially when you transfer full-dupplex. When you transfer files only in one direction and it goes with gigabit
speed, it hits abbout 60% of interrupt time on a CPU core. That does not trigger the watchdog timeout yet.
But as soon as you start a transfer in opposite direction, it will be a problem.
My temporary solution is to limit it using pf firewall up to 500Mb on the sending side and the
TCP ACK sending to about 5Mb, which amounts to the same bandwidth.
Of course, that is not nice because it dropps a lot of ACK packets then.
Next problem is that the pf firewall makes the problem even worse because it uses queues for it's QoS
and when the qlimit is too high, it will also trigger a watchdog timeout, because QoS is adding latency.
Making the TCP ACK qlimit to a small number like 5 or 10 makes that problem go away.
But it is a second reason why a watchdog timeout happens. Increased latency causes it to timeout
even when there are very few interrupts (like 5000) running and the CPU core's time for interrupts is
on a non-problematic level of 30%. In contrast, it can handle up to 20 000 interrupts on realtek device
and a higher (up to 80%) interrupt CPU core time without a problem when transferring in one direction
with a firewall disabled or enabled.
I also noticed that disabling MSI-X helps a little, but that may also be due to some problems on the mainboard,
since

Code:

kernel: xhci1: Unable to map MSI-X table

my USB3 seems to also have a problem with it.
That's why I put in my /boot/loader.conf the following:

Code:

hw.pci.enable_msix=0
hw.re.msix_disable=1

Now it is only using MSI and so should use less interrupts. But that does not help much with the other causes.
It also does not help with the amount of time spent on interrupts. But it makes the watchdog timeout being
triggered less often on concurrent transfers.
Wikipedia says that MSI-X helps to distribute the interrupts among CPU cores. That does not happen with MSI-X to
me at all. But with enabled MSI it still distributes interrupts from different devices to different cores. So it still distributes
the load when you have more devices doing interrupts, like the SATA disk controller.
Interestingly, when you disable MSI-X and MSI, it distributes the interrupts on more CPU cores, but the problem does not
disappear. However, that happens only when you have a pf firewall disabled. Maybe the reason is some lock inside the pf
firewall that causes it to be using only one core. However, disabeling MSI also is not a good idea. But it gives the best
performance and some ppl in forums are also suggesting it. However, I do not think that it has anything to do with
MSI. The same as it has nothing to do with RX or TX checksum offloading in hardware or the VLAN tagging.
I can explain it simply by a higher load of a CPU that causes the NIC to transfer less, or slower. That's why some
ppl do not experience the problem with watchdog timeouts after disableing these hardware accelerations.
But the reason is different. It is due to the network cards throtteling to transfer 30% less traffic.
From all of that I assume that the watchdog timeout is not a part of the solution but a part of the problem.
And the hw.re.intr_filter with dev.re.%d.int_rx_mod options that cause the driver to moderate interrupts
may work on a single core CPU, but to me it causes the driver not to recover. System has to be rebooted manually then.
And that is also my assumption that the watchdog timer runs on a different CPU, thus not noticing the load on
the NIC stopping it (and usually starting again) while thinking it became unresponsive.
Second problem is the driver itself which does not seem to be up-to-date any more.
Under high interrupt load - which to my experience should not be a problem - I see small devices like ARM and MIPS
under 100% interrupt load all the time when doing I/O - the freebsd's own re driver causes the NIC loose track of
what is going on. But that should also not be a problem, since it is quite normal (and I am even enforcing it with my
pf firewall QoS) that network hardware starts to drop frames when it is under heavy load. But that should not necessarily
mean that it has to fail and restart itself. Assuming that a core has at least 3 GHz is as bad idea as assuming that the
NIC knows "what time it is" (when the kernel driver and the core handeling the interrupts is not the same).
I tried to compile the realtek's own driver which ppl say that solves this problem completely, but that does not compile
on FreeBSD-11 any more. Basically there are two ways how to deal with this problem. One can set this NIC
to 100 Mbit/s mode. That works perfectly, but is then very slow of course. Alternatively one can limit the bandwidth
using a firewall, which adds even more latency and thus risk of a watchdog timeout, but you still have about 5x the bandwidth
you would have with 100 Mbit/s mode only. So that is an argument.
The second possibility is to order some kind of cheap USB3 gigabit NIC like I did and hope that that one will work better,
but why does one have a gigabit NIC on PCIe 1x onboard then? That's a waste of hardware.
Either way, FreeBSD should either update the re driver or throw it out. Instead it would be nice to have the original
Realtek driver work on FreeBSD-11 to see if that one works better. Last update to freebsd's re driver was happening
in FreeBSD 9 when they committed all this. And that is years ago. Since only minor adjustments were made.
Making the watchdog timer adjustable over sysctl or loader.conf variables to adjust it to a higher number and maybe
play with it a bit, so that it does not hit so often on slow CPU cores comptuers, would be also interesting idea.

girgen@ · Jul 3, 2018

This is discussed in a couple of bug reports as well, so I'm just cross linking here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227979
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=166724

10.2 RELEASE, re0, watchdog timeout

SergeySmirnykh

ZOleg

SirDice

Administrator

SirDice

Administrator

ZOleg

SirDice

Administrator

SergeySmirnykh

weiclin

MAM

gustopn

girgen@