Solved Network loss on FreeBSD internet gateway since installation of Caddy web server

Hello,

I'm facing a very strange problem. Yesterday I've replaced the nginx web server on my home internet gateway with Caddy2 (from pkg). The gateway runs FreeBSD 12.2-RELEASE and does NAT and firewall (pf) for my LAN. This gateway has been running for ages with no particular network problems.

Motivation for using Caddy web server was to learn something new, gain QUIC protocol and LetsEncrypt automation. Fact is Caddy can't open restricted ports as root then drop privileges. I've added this to /boot/loader.conf so that Caddy would run as non-privileged user and could open ports 80 and 443:
Code:
mac_portacl_load="YES"
And this to /etc/sysctl.conf:
Code:
security.mac.portacl.port_high=1023
net.inet.ip.portrange.reservedlow=0
net.inet.ip.portrange.reservedhigh=0
security.mac.portacl.suser_exempt=1
security.mac.portacl.rules=uid:1007:tcp:80,uid:1007:tcp:443,uid:1007:udp:80,uid:1007:udp:443
and rebooted.

Caddy was working but I've got warnings in the logs:
Code:
2022/02/24 17:57:40 failed to increase receive buffer size: set udp [::]:443: setsockopt: no buffer space available. See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.

Following documentation I've increased kern.ipc.maxsockbuf to 3014656 (then later to 6014656).

Later that day, I've had a network disconnection: PCs on the LAN have lost internet access, etc. I've rebooted the FreeBSD gateway, network was back.
Then we started to watch Netflix (from a device on the LAN), after ~20 minutes of streaming network went down. New reboot, Netflix we go, 20 minutes later network went down again.

I've disabled Caddy in rc.conf, and commented out every modification I've made in /boot/loader.conf and /etc/sysctl.conf, rebooted, and started Netflix again: it went smoothly, no more network downtime…

On the gateway, ifconfig output was exactly the same before and during network outage: no change in interfaces' status. I've found no message in logs that would shed the light on the root cause. Relevant ifconfig output is:

Code:
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=81249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER>
    ether x:x:x:x:71:40
    inet 192.168.0.1 netmask 0xffffff00 broadcast 192.168.0.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=81249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER>
    ether x:x:x:x:71:41
    inet a.b.c.20 netmask 0xffffff00 broadcast a.b.c.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

em0 being LAN side and em1 Internet side.

Symptoms on the gateway were "evolving": right after the connection went down I would have ping: sendto: No buffer space available when trying to ping internet hosts but after few minutes it would turn into ping: sendto: No route to host.

Any help greatly appreciated… I might have missed something obvious but I'm pretty sure my config is OK and should work.
Thanks!
 
I've already undone everything. Modifications in loader.conf and sysctl.conf in particular are now commented out. Using a Jail would still require enabling these modifications again because network tuning and MAC ACL are handled by host, not by Jail. Bhyve is a no-go for other reasons.
 
Just stick to nginx which 'just works'™ and is thoroughly and widely tested and supported?

Nginx should announce support for QUIC in the main branch anytime soon. It was planned for the end of 2021 but AFAIR openssl lacked support for it. OTOH - why would you need QUIC support on a home gateway anyways?
I would prefer a battle-tested, slim and fast webserver like nginx at any time before playing around with something that has 35 times (!!!) the install size and obviously doesn't work reliably. For LetsEncrypt automation just use security/acme.sh which is a simple, lightweight shellscript. Baking the whole certificate handling into the webserver sounds like a pretty bad idea IMHO...
 
sko, the debate is not about what I want or what I need, it's more about finding where things start to misbehave. Your answer helps no-one, especially not the FreeBSD community and the way it looks from outside. Telling people they don't need something because it does not work correctly on FreeBSD is not the best answer.
From my short experience with FreeBSD and UNIX sysadmin in general (~23 years) I feel like there is something wrong that does not come from Caddy but from setup modification I've made to accommodate the software. I'm not sure yet and I'll make some more tests ASAP (same setup but Caddy not running, for example).
In the mean time, I'm looking for answer / help to find out if I've made something wrong with my setup, even thought I've followed the handbook to setup MAC PORTACL.
In the very end, there is a probability that Nginx+QUIC would need the same network tuning adjustments that make my system unstable.
 
in FreeBSD, behavior analysis is essential to understanding what is happening. some Engineers here, as SirDice may agree, with what I'll say. You will analyze the behavior of this element with specific settings. an advanced test, sometimes we identify the problem, by a simple analysis. I'm not giving you the solution anymore the opportunity to formulate, more because the problem is common to happen. which was the first event, that's what we have to take as knowledge.

let's do an analysis with tcpdump, and see what's going on.
 
So far I've tested with:
kern.ipc.maxsockbuf=6014656 alone + netflix running
MAC PORTACL alone + netflix running
MAC PORTACL + Caddy running + netflix running

no network outage.

No time for a test with kern.ipc.maxsockbuf + MAC PORTACL + Caddy, I'll do it properly next week. In the mean time, I'll let a tcpdump run with rolling pcap files.
 
Followup:

Tonight, I've enabled kern.ipc.maxsockbuf=6014656 in /etc/sysctl.conf and rebooted. Caddy does not yet launch at startup, so after reboot the state was:

kern.ipc.maxsockbuf=6014656
MAC PORTACL

I've made an ssh connection to the server as soon as sshd was available and started tcpdump at 17:25:51.
At 17:26:28 WAN became unavailable and tcpdump output fills with retransmissions.
Logs are empty, all I got were these 2 lousy lines, and nothing between:

Code:
Mar  1 17:25:51 host kernel: em1: promiscuous mode enabled
Mar  1 17:27:41 host kernel: em1: promiscuous mode disabled

Before rebooting I've made the mistake of trying to revive the network with this command: /etc/rc.d/dhclient restart em1, only after I thought about grabbing info from netstat -r :
Code:
Internet:
Destination        Gateway            Flags     Netif Expire
0.0.0.0/8          link#2             U           em1
default            A.B.C.254          UGS         em1
A.B.C.0/24         link#2             U           em1
A.B.C.D            link#2             UHS         lo0
localhost          link#3             UH          lo0
192.168.0.0/24     link#1             U           em0
192.168.0.1        link#1             UHS         lo0
192.168.1.0/24     link#4             U         wlan0
192.168.1.1        link#4             UHS         lo0

I'm afraid that dhclient might be responsible for the "0.0.0.0/8..." line. Anyway, proper routing tableis after clean reboot:

Code:
Internet:
Destination        Gateway            Flags     Netif Expire
default            A.B.C.254          UGS         em1
A.B.C.0/24         link#2             U           em1
A.B.C.D            link#2             UHS         lo0
127.0.0.1          link#3             UH          lo0
192.168.0.0/24     link#1             U           em0
192.168.0.1        link#1             UHS         lo0
192.168.1.0/24     link#4             U         wlan0
192.168.1.1        link#4             UHS         lo0

I've rebooted, started Caddy server and netflix client on the LAN:

kern.ipc.maxsockbuf=6014656
MAC PORTACL
Caddy running

-> no network outage in more than 3 hours.
Network is in perfect condition so far: UL 686Mbps / DL 931Mbps / ping 6ms.

Not sure what could be the next move.
 
One month later, no problem at all. Probably my network outages where not related to Caddy or any other setup changes.
 
Back
Top