sonewconn: pcb 0xff...: Listen queue overflow: ... already in queue awaiting acceptance

I have been running FreeBSD servers for many years with default settings without seeing this error message until November, 2015.

My /etc/sysctl.conf was:
Code:
# Postgresql 9.2
kern.ipc.shmmax=1073741824
kern.ipc.shmall=262144

That’s all.

When I saw this message the first time, I Googled and added a new line:
Code:
kern.ipc.somaxconn=4096

However,a few days ago I saw the message again on another server that already had the increased value. Apache was not working. I have no idea what to do.

The servers were running FreeBSD 10.1, Apache 2.4.17, Ruby on Rails with Passenger, Exim, Dovecot, PostgreSQL. I’m using pf. A few months ago I saw that pf complained about the number of the established connections. I don’t know whether it can be related.

I’m not even sure it was caused by Apache. netstat(1) didn’t tell me much, not that I knew how to read it and what to look for, I was just stressed and restarted Apache which solved the issue.

Except the regular updates, there were few changes on the servers.

1. I switched to HTTPS two days before I saw this error the first time. My sites are behind CloudFlare. Between CloudFlare and my sites there were HTTP communication. I generated self-signed keys with the following command:

openssl req -x509 -days 1825 -newkey rsa:2048 -sha256 -nodes -keyout something.key -out something.crt


And switched to HTTPS. This was new, however, I have been running Exim, PostgreSQL and Dovecot with self-signed keys for years.

I did not change anything else in Apache. I use more or less the default (example) SSL config. Apache is running with the Event module. I read that it might be a problem with SSL. I thought it was solved.

2. ClouldFlare recently started to support HTTP2. I turned it on. However, between CloudFlare and my servers, I’m not using HTTP2. It’s only working between the browsers and CF. I don’t know how they implemented it and how many connections they use. What I know is that I have never seen too many Passenger processes. I usually see 7-15, depending on the server, including the watchdog, the core and the launcher process. I can’t remember seeing too many httpd processes either.

3. When I saw the error message the second time (on the second server), Passenger was stuck on 100% CPU usage. I don’t know whether Passenger caused this error message, or this error made Passenger stuck.

Since I had no better idea, I switched to FreeBSD 10.2, upgraded to Apache 2.4.18, and upgraded the Passenger gems.

I also prayed and opened this topic.

I appreciate any input, and idea, what should I look for and what other setting than kern.ipc.somaxconn should I change.

Thank you.
 
A netstat -n -p tcp and netstat -s -p tcp could be helpful here when you next see the error. A pfctl -si would also add some context when this happens.
 
Apache is, for whatever reason, not able to handle the number of incoming connections. Because Apache can't take new connections fast enough the system will start queuing. This is where somaxconn comes in. Normally you won't notice this queuing. But if the application is slow enough even this queue will fill up and this is what produces those "Listen queue overflows".

In your case I'd take a look at your web applications, something is causing them to handle traffic slower. Slow enough for connections to start queuing.
 
Unless the hardware is old it may not need to be upgraded. Sometimes you can simply increase somaxconn (the added queue length is hardly noticeable). And bigger hardware isn't always the correct solution, your system might be starved of I/O which is causing the slowness. In my opinion a better solution is to add another server and use something like net/haproxy to load-balance between them. The added benefit of this is that you can take a host off-line without interfering your websites (so you can update the OS for example). It's better to have 3 or 4 smaller machines than 1 big one.

A setup which I have built several times consists of 2 HAProxy hosts (with a CARP address between them) and 3 or more webservers as backends. This gives you a lot of flexibility and resilience. It's also capable of handling a large amount of incoming traffic.
 
I agree, I will get a new machine with SSD and upgrade another. So I will have 3.

CloudFlare does more or less the same as net/haproxy, plus a hundred other things (caching, security / protection, dns). However, it does not support automatic fail-over but manual or programmed via their API.

As far as I remember, net/haproxy supports auto fail-over. I will consider using it.

The database is still a single point of failure. I can do manual fail-over with PostgreSQL as I use stream replication. I doubt it is safe to do it with scripts. There are too many possible errors.
 
As far as I remember, net/haproxy supports auto fail-over. I will consider using it.
It does indeed. What we did was create a small PHP (or Ruby-on-Rails, or whatever) script. This script simply returns a "200 OK" or "503 Unavailable". Because it's a dynamic script it also tests if PHP (or Ruby or whatever) works. We also have a small local script that can make the script return "503" when the server is running. This is useful to temporarily remove a host from the pool without actually bringing it down.


The database is still a single point of failure. I can do manual fail-over with PostgreSQL as I use stream replication. I doubt it is safe to do it with scripts. There are too many possible errors.
Yes, this is something to watch out for. I'm currently working with a client to add load-balancing/fail-over for MySQL. This will also be handled by HAProxy. The idea is to have multiple (read-only) slaves that can handle most of the traffic (most of it is read access anyway), connections to those are load-balanced by HAProxy. Only writes will go to the master directly.
 
All I can say is, make sure everything is up to date and there isn't some bug in Apache or Passenger that's causing it. Then take a closer look at the web applications themselves. It's possible the application starts misbehaving when you get a certain amount of concurrent connections. Run the applications on a test server and test the heck out of them, run a massive benchmark against it, just to see if it breaks. If the application uses a database make sure that runs smoothly too. The application may be working correctly but if it can't get it's data quickly enough it will cause problems too.
 
You are right, I just had no time to post more logs as I was on a meeting. It’s Passenger actually.

There is no load on the server right now and still: http://pastebin.com/4TDpW34R

Last time it was the same, Passenger 100% percent CPU, but I didn’t let it run without load. This time I let it run without any traffic. It’s still on 100% after an hour.

The question is whether using net/haproxy on the same server would help or not.

I don’t have two other servers to run the proxies. Let’s say there are two servers running Apache. What if I put net/haproxy on both of them?

Server 1 => haproxy => Server 1 Apache / Server 2 Apache
Server 2 => haproxy => Server 1 Apache / Server 2 Apache

CloudFlare would also proxy but it doesn’t do failover. If Apache is failing on Server 1 but net/haproxy is still working fine then this should solve the issue by directing all the traffic to Server 2.

However, if the whole network is struggling on Server 1 then this wouldn’t help.

I have been using Passenger + Apache for years without problems. I have these problem since I switched to SSL. Is it possible that I’m using the wrong worker? Passenger recommends event worker.
 
Back
Top