High CPU usage, all interrupts

So this is a pfsense box running their latest (10.1-RELEASE-p25 FreeBSD) and I've got an odd problem that's probably not pfsense-specific, so I thought I'd see if anyone here had some pointers.

Basically the box is consistently chewing up 30% - 40% CPU all the time. It's all apparently in interrupt processing. The cpu usage stays the same whether the box is passing traffic or not.

NIC cards involved:

Code:
re1:

rgephy1: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus2
rgephy1:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re1: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet> port 0xcc00-0xccff mem 0xfe2ff000-0xfe2ff0ff irq 16 at device 0.0 on pci3
re1: Chip rev. 0x10000000
re1: MAC rev. 0x00000000
miibus2: <MII bus> on re1

re0: 

rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xdc00-0xdcff mem 0xfe5ff000-0xfe5fffff,0xd0000000-0xd000ffff irq 16 at device 0.0 on pci1
re0: Using 1 MSI-X message
re0: Chip rev. 0x3c000000
re0: MAC rev. 0x00400000
miibus0: <MII bus> on re0

bge0:

brgphy0: <BCM57780 1000BASE-T media interface> PHY 1 on miibus1
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge0: <Broadcom BCM57780 A1, ASIC rev. 0x57780001> mem 0xfe4f0000-0xfe4fffff irq 16 at device 0.0 on pci2
bge0: CHIP ID 0x57780001; ASIC REV 0x57780; CHIP REV 0x577800; PCI-E
miibus1: <MII bus> on bge0

What top looks like:

Code:
last pid: 69559;  load averages:  0.77,  0.85,  0.80                                                        up 6+18:45:26  19:53:46
68 processes:  3 running, 64 sleeping, 1 waiting
CPU:  0.0% user,  0.3% nice,  0.7% system, 38.0% interrupt, 61.0% idle
Mem: 28M Active, 126M Inact, 163M Wired, 779M Buf, 3515M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root        2 155 ki31     0K    32K RUN     1 248.2H 128.86% idle
   12 root       21 -72    -     0K   336K WAIT    1  75.9H  76.17% intr
    0 root       11 -92    0     0K   176K -       1  31:28   0.00% kernel
   15 root        1 -16    -     0K    16K -       1   8:08   0.00% rand_harvestq

And from systat:

Code:
Interrupts
191k total
188k bge0 re1
2110 hpet0+ 20
     uhci0 ehci
     uhci3 23
 354 re0 256

Again, note that with no network traffic, that high rate of interrupts continues.

Any ideas?
 
So this is a pfsense box running their latest (10.1-RELEASE-p25 FreeBSD) and I've got an odd problem that's probably not pfsense-specific, so I thought I'd see if anyone here had some pointers.
Sure, but someone will be along shortly to point out PC-BSD, FreeNAS, NAS4Free, and all other FreeBSD Derivatives.

Basically the box is consistently chewing up 30% - 40% CPU all the time. It's all apparently in interrupt processing. The cpu usage stays the same whether the box is passing traffic or not.
...
Code:
Interrupts
191k total
188k bge0 re1
2110 hpet0+ 20
     uhci0 ehci
     uhci3 23
 354 re0 256
Again, note that with no network traffic, that high rate of interrupts continues.

Any ideas?
Both of these devices are sharing INT 16. In a case like that, FreeBSD has to cycle through all of the drivers using a particular interrupt, in the hope that one of the drivers will handle it. If all of the drivers go "nope, not mine", then the interrupt may not be dismissed and may trigger again.

You'll want to get those 2 devices onto their own unshared interrupts. Your system's BIOS may have an advanced configuration option called "interrupt swizzling". There may be other BIOS options you can try, like "Modern ordering". There isn't a standard name for this option, unfortunately.

If that doesn't do it, try physically relocating one of the cards to a different slot. I assume one of these devices is on an expansion card and the other one is on the motherboard?

Once they are each on their own interrupt, you can see if the interrupt rate drops to a reasonable level. If not, you may have a broken piece of hardware.
 
It's not just those two devices, it's basically everything. :)

I'm old enough to remember manually setting IRQs on ISA cards and the like, but things have changed so much that I don't even have a sense if this is bizarre or not:

Code:
grep "irq 16" /var/log/dmesg.boot
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xdc00-0xdcff mem 0xfe5ff000-0xfe5fffff,0xd0000000-0xd000ffff irq 16 at device 0.0 on pci1
vgapci0: <VGA-compatible display> port 0xecd8-0xecdf mem 0xfe800000-0xfebfffff,0xc0000000-0xcfffffff irq 16 at device 2.0 on pci0
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
bge0: <Broadcom BCM57780 A1, ASIC rev. 0x57780001> mem 0xfe4f0000-0xfe4fffff irq 16 at device 0.0 on pci2
re1: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet> port 0xcc00-0xccff mem 0xfe2ff000-0xfe2ff0ff irq 16 at device 0.0 on pci3
atapci0: <Intel ICH7 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf irq 16 at device 31.1 on pci0

My gut tells me this is probably solved by some boot.loader tunable dealing with MSI-X or something like that.

Oh, bge0 is onboard (this is an older Dell SFF box), one realtek is a PCI card, the other is a PCI-e card. So I can remove each of those, but not move them around since the board only has these two slots.

I do see now there's a newer BIOS out, so I need to try that.
 
One thing you might try is disabling some unused hardware in BIOS - things like PS/2 mouse controller, floppy drive, even USB, etc. Sometimes that can help the system more effectively assign IRQs since there is less hardware it thinks it needs to manage.
 
I wouldn't be surprised if the high interrupts are caused by the Realtek cards. Those cards are cheap and have pretty shitty performance. If at all possible use Intel network cards.
 
Argh! I forgot to follow-up here. Personally, I don't feel the hate for Realtek - they perform well enough for home use under Windows and Linux, and I've not really had too many issues myself under FreeBSD. I sometimes wonder if the problem there is that every no-name vendor uses them and we simply can't keep track of all the quirks.

That said, the problem is gone after a BIOS update. Box has been rebooted a few times and the problem has not reappeared, so I'm going to say the update was the fix. Performance has been fine - the onboard broadcom chip is my LAN side connection and one of the Realteks (PCI) is on a very slow DSL line, the other (PCI-e) is on a 100/100 FIOS line. I have no problems saturating that in both directions and the CPU usage barely registers, so I'm quite happy.

Oh also for the record, the two cards in case anyone is wondering:

http://smile.amazon.com/Protronix-Gigabit-Ethernet-Profile-Controller/dp/B008UG5588 (PCI)
http://smile.amazon.com/Protronix-Gigabit-Ethernet-Profile-Controller/dp/B008FAELF2 (PCI-e)

And the "refurb(?)" Dell (someone assumed refurb meant bios would be up to date, oops): http://smile.amazon.com/Dell-OptiPlex-380-Microsoft-Professional/dp/B00OQT5J2A

All in all a nice sub-$100 home firewall that has the horsepower to run a bunch of extra pfsense stuff (like IDS).
 
Personally, I don't feel the hate for Realtek - they perform well enough for home use under Windows and Linux, and I've not really had too many issues myself under FreeBSD. I sometimes wonder if the problem there is that every no-name vendor uses them and we simply can't keep track of all the quirks.
Early on, they released some absolutely awful chips which gave them a very bad reputation. They managed to work around some of the faults in their (binary-only) drivers for Windows, but their public programming documentation was never updated to show the workarounds. Presumably they didn't want to admit to problems. One particular open-source developer apparently got fed up with this and flamed Realtek pretty badly in the comments in his driver.

I haven't had any problems with modern Realtek chips, but I only use them when they're built-in on motherboards. For add-in cards I use Intel or Broadcom.

This sort of thing was by no means specific to Realtek. The first 2 revisions of DEC's DE500 10/100 card didn't autonegotiate properly - the first revision didn't negotiate at all, the second tried to negotiate but did it wrong. I think they got it right the third time.

Some vendors who were shipping one brand of controller chip on their card would sometimes change to a completely different chip, without changing the model number of their card or updating the packaging. Changing to a Realtek chip was a rather common cost-saving measure. That's also the cause of some of the dissatisfaction.
 
Ha, from ~2003, src/sys/pci/if_rl.c used to say this:

Code:
/*
* The RealTek 8139 PCI NIC redefines the meaning of 'low end.' This is
* probably the worst PCI ethernet controller ever made, with the possible
* exception of the FEAST chip made by SMC. The 8139 supports bus-master
* DMA, but it has a terrible interface that nullifies any performance
* gains that bus-master DMA usually offers.
*
* ...
*
* It's impossible given this rotten design to really achieve decent
* performance at 100Mbps, unless you happen to have a 400Mhz PII or
* some equally overmuscled CPU to drive it.
*
 
It is speculated that the very comment quoted (or similar quotes from Linux developers) is behind Realtek's unwillingness to provide any real programming info on their chips, in other words it soured all relations between them and open source device driver developers.
 
I had some Realtek chip based NICs in the past. Some just stopped working and needed a AC loss to start working again. Just powering off or reboot didn't fixed it. My old (2009) home desktop with onboard Realtek chip still run into that trouble from time to time.
 
If everybody stopped retelling their old (very old) war stories about RealTek nics and got on with today's life, every day would be so much better.
RealTek NICs of today works. Sometimes someone comes across one that fails or isn't supported yet; that happens for other vendors too.
Now, could we end this thread, please?
 
Realtek cards are not OK for a router. Enable netmap (IPS on opnsense) and then download a few popular Linux distros via torrent.
Your box will lock up soon enough if you're using the default mtu. The card stops processing packets.
A workaround seems to use a higher mtu, but that's just black magic. The card seems to reset the queue when overloaded and things keep working.

Also, you'll reach maybe 70% of your Gigabit connection vs 90%+ with an Intel card.

So, it kind of works, but it's not great, especially if you have a decent WAN and a few diverse users in your LAN.
 
Got any real information? Like links to specific cards and matching writeups with detailed description of setup, failure and measurements?
Or is this just another FUD comment?
 
No FUD here. Everything is available through your favourite search engine.

"NIC supported by netmap are mandatory on the server used as packet generator/receiver: Chelsio (the best one!), Intel (em, ixgbe). RealTek (re) NIC are supported but avoid them at all cost!"

http://bsdrp.net/documentation/examples/setting_up_a_forwarding_performance_benchmark_lab

"Netmap usage improve the receiving packet rate to about 580Kpps only: It's strange that it didn't reach the maximum Ethernet frame rate (1.48Mpps) with netmap.[...]Still not able to reach the maximum Ethernet throughput with netmap !?!? Realtek chipset limitation ?"

http://bsdrp.net/documentation/exam...s_apu?s[]=netmap#netmap_s_pkt-gen_performance

On one of the G revisions, netmap works fine in software mode (while using a lot of CPU), but the card collapses under load and nobody seems to be interested in fixing the open source drivers or patching the non-free ones (designed for FreeBSD 8).
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206932
It could also be the PC itself since those cards rely more on the host than Intel cards, but that doesn't change the general message.

There is almost no development done on Realtek cards in FreeBSD which is perfectly understandable given how unreliable documentation seems to be and how differently different hardware revisions behave.

So, I'll re-iterate, don't put Realtek cards in a router.
 
A test of just one embedded platform is a very small sample to generate that broad statement "don't put Realtek cards in a router", IMHO.
As you write yourself "it could also be the PC itself" - how do we know that the APU1 isn't a very slow pony, caused by something else?
More tests of different RealTek chips on different motherboards, or (better!) tests of both Intel and Realtek cards on the same motherboard (and other hardware) would be better to document if this is a real problem.

Do you have any tests with Intel network cards / chips?
 
It's not just my test. Another FreeBSD project, FreeNAS, recommends Intel cards as well and the reasons they give are the same.

If you want more tests, just search online, there are plenty of reports of Realtek card not being able to handle high throughput or failing for various reasons. To be fair, if I don't use netmap with the patched drivers, the card in my router doesn't give up, so things have improved, but I do think netmap is important to have in a modern router+firewall.

Both the netmap and bsdrp projects have used Intel cards in their tests with no problem. They only mention issues with the Realtek cards.
 
Back
Top