em driver problems? taskq em goes up to 100% CPU

ibmed · Dec 30, 2008

Hi,
I seem to have the following problem:

Preamble:
There's two FreeBSD boxes that do ipfw nat.
Both worked with natd until some time ago, when it became clear that we need a better solution. So I upgraded sources and recompiled to kernel to include ipfw nat features. The boxes have onboard msk-net cards that worked fine under natd.
After moving to ipfw nat the overall load of the boxes has significantly reduced. But: when the traffic load increased, the boxes started to print the following messages:
kernel: msk0: Rx FIFO overrun!
kernel: msk1: Rx FIFO overrun!
and after some time just hang up.

I assumed that there's a problem with msk driver urgently buy two Intel cards (Intel <EXPI9404PT> PRO/1000 PT Quad Port (OEM) PCI-E x4 10/100/1000Mbps) - one for each box. It solved to FIFO problem, BUT: there's now a problem with spontaneous peaks of system load: irrelatively to anything I can guess the system processes taskq em0, em1, em2 begin to eat 100% of CPU (and stay like that for about 1 or 2 minutes). And when that happens, traffic stops to pass through the box. The problem is there on both boxes.

I tried upgrading both of them to RELENG_7_0, RELENG_7_1 (RC2 currently), RELENG_7 - the problem remains.

Any suggestions on what can I do to solve the problem?
I have just no idea on what causes that and what to do to fix it. I would really appreciate your help..

And just to be thorough, here's my kernel options:
options IPFIREWALL
options IPFIREWALL_VERBOSE
options IPFIREWALL_VERBOSE_LIMIT=400
options IPFIREWALL_DEFAULT_TO_ACCEPT
options IPFIREWALL_FORWARD
options IPFIREWALL_NAT
options LIBALIAS
options DUMMYNET
options IPDIVERT

fastforwarding is on, polling is off:
net.inet.ip.fastforwarding: 1

em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
ether 00:00:10:01:71:02
inet 10.1.71.2 netmask 0xffffff00 broadcast 10.1.71.255
media: Ethernet autoselect (1000baseTX <full-duplex>)
status: active

Foobar · Jan 30, 2009

Hello,

did you find a solution to this 'em' problem?

I have a lot of boxes with different hardware, eg motherboards, CPUs (Intel Core2Duo, Xeon). But all have different Intel ethernet cards, which determined as em interface.
Some times at this FreeBSD boxes (7.1-RELEASE) I have freezed em0 or em1 with 100% cpu eating.
On boxes runned: quagga(ospf, zebra), ng_nat, mpd5.2, net-snmp53 and cron scripts...

I thought that a SMP problem.

trev · Jan 31, 2009

To state the obvious, something is using the NIC at a very high rate. Find that something and you've found the culprit.

A possible work-around is to use device polling (you'll need to rebuild your kernel unless you've already enabled it) on the NIC, although this will reduce performance, it may allow you to diagnose what it is that is causing your NIC to work overtime.

Djn · Jan 31, 2009

Polling isn't a bad alternative, either: The latency increases by at most the time between two polls (so 1 ms, if you poll at 1000HZ), and you replace the interrupt handling overhead with the somewhat nicer polls. Definitely worth trying, if there's high load.

However, it sounds like the problem at hand here isn't necessarily caused by excessive network traffic...

em driver problems? taskq em goes up to 100% CPU

ibmed

Foobar

trev

Djn