Solved Many TCP connections -> FreeBSD networking unreachable

fraenki · May 16, 2021

Hi,

I'm running FreeBSD 12 as a load balancer (HAProxy). However, when my application slows down and TCP connections cannot be processed fast enough, the server becomes completely unreachable – TCP, ICMP, CARP, everything IP-related is 100% down.

At first I've discovered the usual messages:

Code:

kernel: sonewconn: pcb 0xfffff80072ff11e8: Listen queue overflow: 193 already in queue awaiting acceptance (3980 occurrences)

This was fixed by increasing kern.ipc.soacceptqueue to 4096, now these messages are gone.

Unfortunately the server still becomes 100% unreachable when several thousand connections hit the server (but only in situations where my application is too slow to process them).

I'd like to avoid this, the server should remain reachable in this situation, especially the other services that do not receive thousands of connections. I guess I have to increase some tuning parameters, but I can't figure out which one. No kernel messages, nothing in the logs.

Any idea how to debug/analyze this?

FWIW, the specs: Xeon E5 CPU and Intel X540 NIC (ix).

Regards
- Frank

SirDice · May 17, 2021

fraenki said:
but only in situations where my application is too slow to process them

Fix the problem not the symptoms.

fraenki · May 17, 2021

SirDice said:
Fix the problem not the symptoms.

Sure, that's on my list. But a server shouldn't be completely unreachable in this situation.

covacat · May 17, 2021

try to disable any hw offloading done by the nic ?

rootbert · May 17, 2021

are you using PF? - maybe the the limit of states is reached

fraenki · May 18, 2021

rootbert said:
are you using PF? - maybe the the limit of states is reached

Yes, PF is in use but states limit is not reached (usually only 1% usage).

covacat said:
try to disable any hw offloading done by the nic ?

Oh yes, good idea! I was under the impression that I've already disabled everything, but I've missed some features:

Code:

ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=8538b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO>
    media: Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)

So I disabled all remaining HW offloading features and flow control:

Code:

ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=803828<VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC>
    media: Ethernet autoselect (10Gbase-T <full-duplex>)

Let's see if this improves the situation. Thanks for the replies/suggestions!

Regards
- Frank

fraenki · May 20, 2021

Unfortunately, even after disabling all HW offloading features, the server is still unreachable when hit by many TCP connections.

A ping during this situation nicely demonstrates this:

 icmp_seq=62 ttl=59 time=1.37 ms

icmp_seq=63 ttl=59 time=1.20 ms

icmp_seq=64 ttl=59 time=1.23 ms

icmp_seq=65 ttl=59 time=1.34 ms

icmp_seq=66 ttl=59 time=106 ms

icmp_seq=67 ttl=59 time=2602 ms

icmp_seq=68 ttl=59 time=2372 ms

icmp_seq=69 ttl=59 time=2221 ms

icmp_seq=70 ttl=59 time=5413 ms

icmp_seq=71 ttl=59 time=11797 ms

icmp_seq=74 ttl=59 time=8816 ms

icmp_seq=83 ttl=59 time=1.32 ms

The response times skyrocket, and then after seq 74 the server is completely unreachable for 9 seconds. (This was just a test, so the downtime was pretty short this time.)

Is there something I can do to prevent this from happening? A sysctl to tune? Or should I replace the NIC?

Regards
- Frank

richardtoohey2 · May 21, 2021

Might be a dumb question, but is it definitely the NIC? Nothing in front (firewall, proxy, etc.) that might be causing issues?

cmoerz · May 21, 2021

Is there anything in dmesg or /var/log/messages that gives any additional insight? I know, it's usually the most painful conlusion, but I once had a similar situation of "unreachable under load" and that was simply caused by a faulty NIC.

covacat · May 21, 2021

also is all the network stack down or just on that NIC
can you test on another iface ? (even lo0)

fraenki · May 21, 2021

richardtoohey2 said:
Might be a dumb question, but is it definitely the NIC? Nothing in front (firewall, proxy, etc.) that might be causing issues?

I was considering this too, so I've double-checked and it's definitely this FreeBSD server that is to blame. Everything else in the network is fully operational, only this server goes down.

cmoerz said:
Is there anything in dmesg or /var/log/messages that gives any additional insight? I know, it's usually the most painful conlusion, but I once had a similar situation of "unreachable under load" and that was simply caused by a faulty NIC.

Unfortunately no log messages, no kernel errors, etc. (Except the "listen queue overflow", but that was fixed as mentioned earlier.)

I've already performed a failover to another server (with the exact same hardware, software and configuration). Unfortunately FreeBSD behaves exactly the same on the other server. So I'd say it's not a hardware defect (but could still be a driver malfunctioning). I've also upgraded the firmware of this (Dell) server to the latest releases, but this did not change anything.

covacat said:
also is all the network stack down or just on that NIC
can you test on another iface ? (even lo0)

It's a dual port NIC, which is currently part of a active/passive LAGG. But I've already tried to use LACP with both NIC ports active, and when many TCP connections hit the server, the LACP links of both NIC ports are unavailable. The switch reports that both links are detached from the LACP port group in this situation:

 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 11058 %% Interface Te2/0/16 detached from ch2.

DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 11059 %% Interface Te1/0/16 detached from ch2.

DOT3AD[dot3ad_core_lac]: dot3ad_lac.c(252) 11062 %% ch2 is down.

I guess this simply confirms that no communication is possible when this happens, not even LACP.

So LACP made this somewhat worse, so I've switched back to the active/passive LAGG.

However, I will test whether or not traffic is able to pass lo0. That's an interesting question.

This issue is so weird.

Regards
- Frank

cmoerz · May 21, 2021

I'd consider double checking your switch configuration; if you're using LAGG or LACP, you'll have to make sure the settings on those ports are really aligned with what you're setting on the server end. Does your switch support logging to a syslog server? You might want to check for errors on that end as well.

fraenki · May 21, 2021

cmoerz said:
I'd consider double checking your switch configuration; if you're using LAGG or LACP, you'll have to make sure the settings on those ports are really aligned with what you're setting on the server end. Does your switch support logging to a syslog server? You might want to check for errors on that end as well.

Yes, that was the first thing I've already double checked. I took great measures to ensure that it's not the switch that is causing this, and I'm pretty confident the switch is not to blame.
LAGG active/passive is working perfectly fine. LACP would in theory also work fine, it just does not play well when IP communication stalls 100% (as seen from the previously posted switch logs) – recovery time seems to be a bit better without LACP (not much, however).
Actually I'm using LAGG+LACP on a different FreeBSD server with great success (but with only few TCP connections).

The switch logs are silent when the outage occurs (the switch only got notice of it when LACP was enabled).

cmoerz · May 22, 2021

Alright, thanks for clarifying. This may be a total shot in the dark, but is it possible that this applies to your problem?

Intel 10G card issue

FreeBSD 11.2-RELEASE 2x10G Intel ix0@pci0:2:0:0: class=0x020000 card=0x061115d9 chip=0x10fb8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' class = network subclass = ethernet bar [10] = type...

forums.freebsd.org

According to the other OP, by putting hw.ix.enable_msix=0 into /boot/loader.conf, this might get resolved?

fraenki · May 23, 2021

cmoerz said:
According to the other OP, by putting hw.ix.enable_msix=0 into /boot/loader.conf, this might get resolved?

OK, tested this on the spare server and changed hw.ix.enable_msix from 1 to 0. After a reboot, the server immediately had networking issues, even without any networking load. I've reverted the change and the networking was again working as expected. I couldn't find much information about the recommended setting for X540 on Dell, so I'll assume that disabling msix is just a no-go in this setup.

cmoerz · May 23, 2021

Glad, you only tried this on the spare server.
I'm afraid there may be a driver issue with your cards.

There appears to also be an Intel NIC driver in ports net/intel-ix-kmod that you could try out, though I don't know, whether that is any better. Would recommend cross checking versions whether this is any real option for you or not.

Apparently, those NICs have some history of freezing during load. Then again, those posting might just be some flukes:

Intel Pro 10G (ix) freezing during high traffic

I am seeing a repeating issue on FreeBSD 11.1-RELEASE with the ix driver (3.1.13-k) where during high levels of traffic my network card stops passing traffic. Here are the particulars: - at the time of the "crash" the NIC is pushing out ~1GB of traffic (primarily outbound) and around 75-90k pps...

forums.freebsd.org

Either way, a similar issue on FreeNAS suggested turning off TSO and VLANHWTSO (options -tso, -vlanhwtso for ifconfig):

Bug #4560: Problem with 10GB ixgbe driver / ESXI nfs shares - FreeNAS - iXsystems & FreeNAS Redmine

Redmine

redmine.ixsystems.com

You may want to give that a try on your spare server. I have to admit, I'm out of my depth here, since I don't have that exact same NIC to test before giving any recommendation.

fraenki · May 23, 2021

cmoerz said:
There appears to also be an Intel NIC driver in ports net/intel-ix-kmod that you could try out, though I don't know, whether that is any better. Would recommend cross checking versions whether this is any real option for you or not.

It looks like I'm already using a newer version of the driver:

 $ sysctl -a | grep driver_version

dev.ix.1.iflib.driver_version: 4.0.1-k

dev.ix.0.iflib.driver_version: 4.0.1-k

cmoerz said:
Either way, a similar issue on FreeNAS suggested turning off TSO and VLANHWTSO (options -tso, -vlanhwtso for ifconfig):

All HW offloading features are already turned off.

cmoerz said:
I have to admit, I'm out of my depth here, since I don't have that exact same NIC to test before giving any recommendation.

Thanks for your suggestions!

Regards
- Frank

bakul · May 23, 2021

What SirDice said. Fix your main problem rather than waste time on this secondary problem. If anything I would protect the server by using a *smaller* accept queue length so that clients get an early indication of the server falling behind rather than wait for a very long time in the queue as well as server not getting overloaded to the point it can't respond to other events.

fraenki · May 23, 2021

bakul said:
...server not getting overloaded to the point it can't respond to other events.

The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle. Mind you, this server is just a load balancer, the application is running on a different server.
My expectation is that any reasonable sized server should be able to handle several thousand incoming connections without a full IP outage. I think that's not too much to ask for.

Regards
- Frank

bakul · May 24, 2021

fraenki said:
The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle.

I said “overloaded” because of the following:

fraenki said:
when my application slows down and TCP connections cannot be processed fast enough, the server becomes completely unreachable

In general the accept queue should almost always be empty because the load balancer would pass on the incoming connection to a web server (or whatever) as soon as possible. That the queue overflows indicates this is not happening. Increasing the queue size won’t fix the underlying problem. It is much more likely the problem is in some user code or configuration than the kernel, network driver or offload logic on the iocard. At least that is the sequence in which you should debug.

As for debugging, one idea is to look at how the load balancer passes on connections. May be use tcpdump or ktrace. I’m guessing that is where the problem lies.

andrzej4bsd · May 25, 2021

How is the lagg compiled: within the kernel (device lagg) or as a module (in loader: if_lagg_load="YES")?
(Asking, but don't belive that this matters, anyway you can of course experiment by compiling lagg differently if you like to do so)

Regarding:

Code:

kernel: sonewconn: pcb 0xfffff80072ff11e8: Listen queue overflow: 193 already in queue awaiting acceptance (3980 occurrences)

What is your setting of following commands:
sysctl kern.ipc.somaxconn
sysctl net.isr.maxqlimit
sysctl net.route.netisr_maxqlen
Maybe in your case those limits could be increased? (from the code I would suspect that especially maxqlimit might be related to the above log)

Regarding:

The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle.

if resources aren't an issue, than maybe you could enable debug or trace:
sysctl net.link.lagg.lacp.debug=1

Some references:

Κεφάλαιο 12. Ρύθμιση και Βελτιστοποίηση

docs.freebsd.org

calomel.org freebsd network tuning

calomel.org freebsd network tuning. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

fraenki · Jun 15, 2021

OK, so I've replaced the Intel X540 NIC with a Chelsio T520-BT NIC, just to be sure that it's not a driver issue. And yes, it's not a driver issue. The outage also occurs with the Chelsio NIC.

I've already mentioned that this issue does also affect LACP (which was only enabled for testing purposes). Someone pointed out, that LACP is on layer 2. This lead to the assumption that it's not a TCP/IP issue after all (and that all my TCP/IP tuning should be considered useless).

In another test I've noticed that the system does not respond to key presses when TCP/IP traffic freezes (but the system recovers after several seconds and then responds to the key presses). So this confirms that it's not just a TCP/IP issue, I guess?

Any ideas?

Alain De Vos · Jun 16, 2021

I ask myself the question how do you know if the kernel is able to process the I/O tasks in time or not.

fraenki · Jun 16, 2021

Alain De Vos said:
I ask myself the question how do you know if the kernel is able to process the I/O tasks in time or not.

Could you please elaborate further on this?

Grzegorz Wiktorowski · Jun 16, 2021

When I was playing with HAProxy I wondered whether under heavy load the FreeBSD kernel.ipc.somaxconn value (with default 128) should have matched the maxconn parameter in HAProxy configuration.

Solved Many TCP connections -> FreeBSD networking unreachable

Administrator