[ Apache + Nginx ] Issue with availability on high traffic

Hello,
I'm trying here to see if I can get some help on this subject.
Just so you can understand, I've been using quiet some time and resources on this matter but I couldn't find why it's doing that.

Our architecture might be "overkill", but we are mostly inheriting it and trying to build on it.
The "working" situation:
  • 5 FreeBSD servers, installed with 13.2-RELEASE, and Apache 2.4.58 for web contents
    • Server arch : amd64, 6 cores multithreaded CPU, 64GB RAM, 2 SATA disks for systems (ZFS mirror), 2 Samsung SSD 870 disks for fast access on disk cache data (ZFS mirror)
    • Each server has the exact same configuration and are serving the same types of contents
    • IPFW is configure on the host to only allow specific IP addresses to access
    • Each server has a jail with the Web Server and data inside a zfs dataset so we can move it fast if we need new server.
    • Service is serving mostly through CGI because of internal technologies
    • Apache is configured with the worker MPM module
      • Apache config:
        <IfModule mpm_worker_module>
            ServerLimit             32
            StartServers            8
            ThreadLimit             512
            MaxRequestWorkers       16384
            ThreadsPerChild         512
            MinSpareThreads         512
            MaxSpareThreads         1024
            MaxConnectionsPerChild  10000
        </IfModule>
    • Server is configure with these SYSCTL
      • Code:
        security.jail.allow_raw_sockets=1
        security.jail.mount_allowed=1
        
        kern.ipc.somaxconn=32768
        net.inet.tcp.maxtcptw=200000
        net.inet.icmp.icmplim=50
        net.inet.icmp.drop_redirect=1
        net.inet.tcp.icmp_may_rst=0
        net.inet.tcp.blackhole=2
        net.inet.udp.blackhole=1
        net.link.ether.inet.log_arp_wrong_iface=0
        net.inet.tcp.msl=2500
        net.inet.tcp.sendspace=262144
        net.inet.tcp.recvspace=262144
        net.inet.tcp.sendbuf_max=16777216
        net.inet.tcp.recvbuf_max=16777216
        net.inet.tcp.sendbuf_inc=32768
        net.inet.tcp.finwait2_timeout=500
        net.inet.tcp.fast_finwait2_recycle=1
        net.inet.ip.intr_queue_maxlen=4096
  • 4 FreeBSD servers, installed with 14.3-RELEASE, nginx-full 1.28.2 installed
    • server arch: amd64, 4 cores multithreaded CPU, 64GB RAM, 2 SATA disks for system (ZFS mirror)
    • each server has the exact same configuration
    • nginx is configured as load-balacing the 5 apache server
      • upstream is composed of the 5 servers jail IP addresses with SSL port: a.b.c.d:443
      • some pieces of configuration
        NGINX:
        worker_processes 8;
        
        events {
            worker_connections 8192;
            accept_mutex off;
            use kqueue;
        }
        # [...]
        http {
            # [...]
            client_header_buffer_size 16k;
            large_client_header_buffers 4 16k;
        
            sendfile on;
            keepalive_timeout 65;
        
            gzip_proxied any;
            # [...]
            server {
                listen 443 ssl;
                http2 on;
                
                # [using https://ssl-config.mozilla.org/ intermediate configuration for SSL]
                
                location / {
                    access_log /path/to/access.log combined;
                    proxy_hide_header Upgrade;
                    proxy_hide_header X-Powered-By;
                    proxy_connect_timeout 120s;
                    proxy_read_timeout 120s;
                    proxy_send_timeout 120s;
                    proxy_redirect off;
                    proxy_ssl_verify off;
                    # [fronts is a 5 lines upstream block only with `server a.b.c.d:443;`]
                    proxy_pass https://fronts;
                    proxy_set_header Host $host;
                    proxy_set_header X-Real-IP $remote_addr;
                    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                    proxy_set_header X-Forwarded-Proto $scheme;
                }
                # [...]
            }
            # [...]
        }
    • SYSCTL are mostly the same
    • PF is configured
  • All servers are geographicaly close to each other, so the ping delay is well under 1ms, and the link is 1Gbit/s
What is not working:
Since it might be overkill, I'm trying to reduce the number of apache servers to 2 servers (nginx upstream block configured with 2 lines), but it produces some problems.
What I can see in nginx error log, and access logs is:
  • error log: "no live upstreams", meaning that nginx has marked the 2 remote backend as down
  • access logs: a lot of http code 502 bad gateway
On the Apache server:
  • I can see that they are serving requests, and there are no errors in logs (host and jail) telling we've reached any specific limit.
  • There are still memory available on the system (ofc), and CPU is idle at 60%.
  • The netstat -4Lan command does not show any queueing.
  • Some apache_exporter+prometheus+grafana dashboard does really show anything about apache suffering.
  • Direct access to the Apache server with `/etc/hosts` configuration, or CURL commands with specific options, show that the Apache web server is responding in time.

I've tried asking some friends, and also asking AI chat (someone suggested me to try), but noone of what I could try worked : SYSCTL, apache config, nginx config.

Would you have any recommendations?
 
This does not look like an Apache capacity issue, but rather nginx incorrectly marking your backends as “down”.

Indicators:
  • Apache responds fine when accessed directly (curl, /etc/hosts)
  • CPU/RAM are not exhausted
  • no listen queue issues
  • yet nginx reports "no live upstreams" and returns 502

Likely causes:

1. Upstream failure handling (critical with only 2 backends)
nginx may mark backends as failed too aggressively. With only 2 servers, that leads to total outage.

Code:
upstream fronts {
    least_conn;

    server a.b.c.d:443 max_fails=5 fail_timeout=10s;
    server e.f.g.h:443 max_fails=5 fail_timeout=10s;

    keepalive 100;
}

Code:
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 3;

2. No keepalive → high connection overhead

Code:
proxy_http_version 1.1;
proxy_set_header Connection "";

3. CGI as bottleneck
CGI (process-per-request) causes latency spikes under load.
With 5 backends this is hidden, with 2 it leads to nginx marking them as failed.

Why it works with 5 servers but not with 2:

With 5 backends, slow responses are absorbed.
With 2, a few slow requests are enough → both marked down → "no live upstreams".


Fixing upstream config + enabling keepalive should already help.
Long-term: replace CGI (e.g. FastCGI).

If you can share your full upstream block or relevant error logs, this can be confirmed more precisely.
 
Back
Top