Solved Many TCP connections -> FreeBSD networking unreachable

Hi,

I'm running FreeBSD 12 as a load balancer (HAProxy). However, when my application slows down and TCP connections cannot be processed fast enough, the server becomes completely unreachable – TCP, ICMP, CARP, everything IP-related is 100% down.

At first I've discovered the usual messages:

Code:
kernel: sonewconn: pcb 0xfffff80072ff11e8: Listen queue overflow: 193 already in queue awaiting acceptance (3980 occurrences)

This was fixed by increasing kern.ipc.soacceptqueue to 4096, now these messages are gone.

Unfortunately the server still becomes 100% unreachable when several thousand connections hit the server (but only in situations where my application is too slow to process them).

I'd like to avoid this, the server should remain reachable in this situation, especially the other services that do not receive thousands of connections. I guess I have to increase some tuning parameters, but I can't figure out which one. No kernel messages, nothing in the logs.

Any idea how to debug/analyze this?

FWIW, the specs: Xeon E5 CPU and Intel X540 NIC (ix).


Regards
- Frank
 
are you using PF? - maybe the the limit of states is reached

Yes, PF is in use but states limit is not reached (usually only 1% usage).

try to disable any hw offloading done by the nic ?

Oh yes, good idea! I was under the impression that I've already disabled everything, but I've missed some features:

Code:
ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=8538b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO>
    media: Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)

So I disabled all remaining HW offloading features and flow control:

Code:
ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=803828<VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC>
    media: Ethernet autoselect (10Gbase-T <full-duplex>)

Let's see if this improves the situation. Thanks for the replies/suggestions!


Regards
- Frank
 
Unfortunately, even after disabling all HW offloading features, the server is still unreachable when hit by many TCP connections.

A ping during this situation nicely demonstrates this:

icmp_seq=62 ttl=59 time=1.37 ms
icmp_seq=63 ttl=59 time=1.20 ms
icmp_seq=64 ttl=59 time=1.23 ms
icmp_seq=65 ttl=59 time=1.34 ms
icmp_seq=66 ttl=59 time=106 ms
icmp_seq=67 ttl=59 time=2602 ms
icmp_seq=68 ttl=59 time=2372 ms
icmp_seq=69 ttl=59 time=2221 ms
icmp_seq=70 ttl=59 time=5413 ms
icmp_seq=71 ttl=59 time=11797 ms
icmp_seq=74 ttl=59 time=8816 ms
icmp_seq=83 ttl=59 time=1.32 ms


The response times skyrocket, and then after seq 74 the server is completely unreachable for 9 seconds. (This was just a test, so the downtime was pretty short this time.)

Is there something I can do to prevent this from happening? A sysctl to tune? Or should I replace the NIC?


Regards
- Frank
 
Is there anything in dmesg or /var/log/messages that gives any additional insight? I know, it's usually the most painful conlusion, but I once had a similar situation of "unreachable under load" and that was simply caused by a faulty NIC.
 
Might be a dumb question, but is it definitely the NIC? Nothing in front (firewall, proxy, etc.) that might be causing issues?

I was considering this too, so I've double-checked and it's definitely this FreeBSD server that is to blame. Everything else in the network is fully operational, only this server goes down.

Is there anything in dmesg or /var/log/messages that gives any additional insight? I know, it's usually the most painful conlusion, but I once had a similar situation of "unreachable under load" and that was simply caused by a faulty NIC.

Unfortunately no log messages, no kernel errors, etc. (Except the "listen queue overflow", but that was fixed as mentioned earlier.)

I've already performed a failover to another server (with the exact same hardware, software and configuration). Unfortunately FreeBSD behaves exactly the same on the other server. So I'd say it's not a hardware defect (but could still be a driver malfunctioning). I've also upgraded the firmware of this (Dell) server to the latest releases, but this did not change anything.

also is all the network stack down or just on that NIC
can you test on another iface ? (even lo0)

It's a dual port NIC, which is currently part of a active/passive LAGG. But I've already tried to use LACP with both NIC ports active, and when many TCP connections hit the server, the LACP links of both NIC ports are unavailable. The switch reports that both links are detached from the LACP port group in this situation:

DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 11058 %% Interface Te2/0/16 detached from ch2.
DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 11059 %% Interface Te1/0/16 detached from ch2.
DOT3AD[dot3ad_core_lac]: dot3ad_lac.c(252) 11062 %% ch2 is down.


I guess this simply confirms that no communication is possible when this happens, not even LACP.

So LACP made this somewhat worse, so I've switched back to the active/passive LAGG.

However, I will test whether or not traffic is able to pass lo0. That's an interesting question.

This issue is so weird. :-/


Regards
- Frank
 
I'd consider double checking your switch configuration; if you're using LAGG or LACP, you'll have to make sure the settings on those ports are really aligned with what you're setting on the server end. Does your switch support logging to a syslog server? You might want to check for errors on that end as well.
 
I'd consider double checking your switch configuration; if you're using LAGG or LACP, you'll have to make sure the settings on those ports are really aligned with what you're setting on the server end. Does your switch support logging to a syslog server? You might want to check for errors on that end as well.

Yes, that was the first thing I've already double checked. I took great measures to ensure that it's not the switch that is causing this, and I'm pretty confident the switch is not to blame.
LAGG active/passive is working perfectly fine. LACP would in theory also work fine, it just does not play well when IP communication stalls 100% (as seen from the previously posted switch logs) – recovery time seems to be a bit better without LACP (not much, however).
Actually I'm using LAGG+LACP on a different FreeBSD server with great success (but with only few TCP connections).

The switch logs are silent when the outage occurs (the switch only got notice of it when LACP was enabled).
 
Alright, thanks for clarifying. This may be a total shot in the dark, but is it possible that this applies to your problem?

According to the other OP, by putting hw.ix.enable_msix=0 into /boot/loader.conf, this might get resolved?
 
According to the other OP, by putting hw.ix.enable_msix=0 into /boot/loader.conf, this might get resolved?

OK, tested this on the spare server and changed hw.ix.enable_msix from 1 to 0. After a reboot, the server immediately had networking issues, even without any networking load. I've reverted the change and the networking was again working as expected. I couldn't find much information about the recommended setting for X540 on Dell, so I'll assume that disabling msix is just a no-go in this setup.
 
Glad, you only tried this on the spare server.
I'm afraid there may be a driver issue with your cards.

There appears to also be an Intel NIC driver in ports net/intel-ix-kmod that you could try out, though I don't know, whether that is any better. Would recommend cross checking versions whether this is any real option for you or not.

Apparently, those NICs have some history of freezing during load. Then again, those posting might just be some flukes:

Either way, a similar issue on FreeNAS suggested turning off TSO and VLANHWTSO (options -tso, -vlanhwtso for ifconfig):

You may want to give that a try on your spare server. I have to admit, I'm out of my depth here, since I don't have that exact same NIC to test before giving any recommendation.
 
There appears to also be an Intel NIC driver in ports net/intel-ix-kmod that you could try out, though I don't know, whether that is any better. Would recommend cross checking versions whether this is any real option for you or not.

It looks like I'm already using a newer version of the driver:

$ sysctl -a | grep driver_version
dev.ix.1.iflib.driver_version: 4.0.1-k
dev.ix.0.iflib.driver_version: 4.0.1-k


Either way, a similar issue on FreeNAS suggested turning off TSO and VLANHWTSO (options -tso, -vlanhwtso for ifconfig):

All HW offloading features are already turned off.

I have to admit, I'm out of my depth here, since I don't have that exact same NIC to test before giving any recommendation.

Thanks for your suggestions!


Regards
- Frank
 
What SirDice said. Fix your main problem rather than waste time on this secondary problem. If anything I would protect the server by using a *smaller* accept queue length so that clients get an early indication of the server falling behind rather than wait for a very long time in the queue as well as server not getting overloaded to the point it can't respond to other events.
 
...server not getting overloaded to the point it can't respond to other events.

The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle. Mind you, this server is just a load balancer, the application is running on a different server.
My expectation is that any reasonable sized server should be able to handle several thousand incoming connections without a full IP outage. I think that's not too much to ask for.

Regards
- Frank
 
The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle.
I said “overloaded” because of the following:
when my application slows down and TCP connections cannot be processed fast enough, the server becomes completely unreachable
In general the accept queue should almost always be empty because the load balancer would pass on the incoming connection to a web server (or whatever) as soon as possible. That the queue overflows indicates this is not happening. Increasing the queue size won’t fix the underlying problem. It is much more likely the problem is in some user code or configuration than the kernel, network driver or offload logic on the iocard. At least that is the sequence in which you should debug.

As for debugging, one idea is to look at how the load balancer passes on connections. May be use tcpdump or ktrace. I’m guessing that is where the problem lies.
 
How is the lagg compiled: within the kernel (device lagg) or as a module (in loader: if_lagg_load="YES")?
(Asking, but don't belive that this matters, anyway you can of course experiment by compiling lagg differently if you like to do so)

Regarding:
Code:
kernel: sonewconn: pcb 0xfffff80072ff11e8: Listen queue overflow: 193 already in queue awaiting acceptance (3980 occurrences)
What is your setting of following commands:
sysctl kern.ipc.somaxconn
sysctl net.isr.maxqlimit
sysctl net.route.netisr_maxqlen
Maybe in your case those limits could be increased? (from the code I would suspect that especially maxqlimit might be related to the above log)

Regarding:
The FreeBSD server is at no point "overloaded" in that sense that CPU or RAM is a bottleneck, it's in fact ~70% idle.
if resources aren't an issue, than maybe you could enable debug or trace:
sysctl net.link.lagg.lacp.debug=1

Some references:
 
Last edited by a moderator:
OK, so I've replaced the Intel X540 NIC with a Chelsio T520-BT NIC, just to be sure that it's not a driver issue. And yes, it's not a driver issue. The outage also occurs with the Chelsio NIC.

I've already mentioned that this issue does also affect LACP (which was only enabled for testing purposes). Someone pointed out, that LACP is on layer 2. This lead to the assumption that it's not a TCP/IP issue after all (and that all my TCP/IP tuning should be considered useless).

In another test I've noticed that the system does not respond to key presses when TCP/IP traffic freezes (but the system recovers after several seconds and then responds to the key presses). So this confirms that it's not just a TCP/IP issue, I guess?

Any ideas?
 
When I was playing with HAProxy I wondered whether under heavy load the FreeBSD kernel.ipc.somaxconn value (with default 128) should have matched the maxconn parameter in HAProxy configuration.
 
Back
Top