pf 4.5 issues with NAT on FreeBSD 9.0 i386

Update: I originally thought the problem was with unbound, but it seems to only have manifested itself there. Original post follows, but I believe the real issue is due to pf. See the second post for info.

I'm having a strange problem with unbound 1.4.14 on FreeBSD 9.0-RELEASE for i386. Previously, I was running a unbound (I forget the exact version, but it was just minor version or two before 1.4.14 prior to upgrading FBSD) and nsd setup in separate jails on 8.2 on the same machine without issues. After upgrading to 9.0 and recompiling all ports (portmaster -af), and updating all the files in the jail, unbound has ceased to function correctly.

Before I get too deep, let me explain my setup. The machine in question is a gateway for a home network, and has two interfaces (in reality, it has four, but the other two are irrelevant here): vlan2, which is the services network where unbound resides, and fxp0, which is the WAN interface. Bear in mind that unbound is being run on the same machine doing routing and not on a different machine on the same vlan. One of the many IP addresses on the vlan2 interface on the router is assigned to the jail for unbound, which is 192.168.1.220, and nsd is running (and working) as my authoritative/slave non-recursive server on 192.168.1.219. Unbound has no issues querying the stub zones hosted on the nsd server. However, when trying to query anything else (i.e. public domains), unbound is unable to resolve anything. NAT is done on fxp0 by pf, and appears to be working without any issues.

Here's a quick diagram for anyone who's a more visual thinker:
Code:
                 ----------- router -----------
                |                              |
                |   unbound (.220)             |
                |        |                     |
VLAN2 hosts <========= vlan2         fxp0 ==(NAT'ed via pf)==> Internet
                |        |                     |
                |     nsd (.219)               |
                |                              |
                 ------------------------------

So, why can't unbound query internet DNS servers? Well, I'm not entirely sure. If I try to look up a public DN (e.g. http://www.google.com), I see lots of queries for the root domain name servers sent out (thanks to tcpdump). It does numerous queries for all the root servers, and even though these queries are answered in a timely manner, it appears that unbound ignores the replies. I even updated my named.cache to the latest version, which is:
Code:
;       last update:    Jun 8, 2011
;       related version of root zone:   2011060800
...but this made no difference--it still hammers the root servers for some unknown reason. I've checked both pf states and the log files, and from what I can tell, both appear to validate that it is receiving the reply packets that I'm seeing in tcpdump. I also tried running it outside the jail, and removing anything but the bare necessities (like stub zones), and rebuilding the config from the sample, and the same problem occurs. I've also checked the config with unbound-checkconf, and it finds no errors. What changed so drastically between 8.2 and 9.0 that might cause this?

I've attached one of my more stripped-down configurations for unbound, along with a very verbose log file of a single query for http://www.google.com via dig (eventually returning SERVFAIL after more than 30 seconds). Any idea on where to go from here would be greatly appreciated.

Edit: If you need more info, such as packet traces, just ask.
 

Attachments

Messing around a bit, I tried changing the "outgoing-interface" to 0.0.0.0. Lo and behold, it worked--well, outside of the jail, anyway. This makes me suspect of some change in how pf's NAT works or a change to how packets are routed internally on FreeBSD.

I did see in UPDATING that on 2011-06-28, pf had been updated to the latest version before the syntax was changed, v4.5. I originally thought this had made it into 8.2, but it seems 8.2 was released earlier than I recall. If this is indeed the case, is there any way to get the old behavior (allowing for NATed daemon jails), short of reverting to 8.2?

Edit: if an admin can move this thread to the "Firewalls" section, that would be appreciated.
 
Try:
Code:
pfctl -sn
Does it show the ipv6 NAT? A colleague of mine had this exact same problem and he told me that IPv6 is default with 9.0 and by specifying "inet" in the pf rules, things went back to normal.

You can try this out, seems to me like the same issue.
 
IPv6 is, for practical purposes, absent from the machine. It's in the kernel, as it's required for pf [Edit: Actually, it used to be, but isn't any more], but the only interface with addresses is lo0. My nat rules are explicitly for inet traffic only, and states show up when unbound attempts a lookup, but it's acting like it's not getting a reply.

Stranger still, from others I've talked to on IRC say they have the same setup (jailed services on the same machine running NAT) and report no issues with 9.0. I'm in the process of writing my own program to test things out more thoroughly--I'll report back with results when I have them.

Update: Tested running unbound in a jail on an aliased lo0 address (127.0.0.2) that was NAT'ed to the external interface on a FreeBSD install in a VM (VirtualBox). This worked, though there are obviously a few differences between this setup and the real hardware. I'm going to reinstall 9.0 on the problematic machine, and reinstall all ports, and see if the problem persists.

Update 2: I cannot replicate the problem in a VM, even when I get it very close to the same setup that's on the physical hardware. As such, I think this is hardware-related, and thus I'm not going to pursue it further.
 
Fixed! Sort of...

Going to post this as a new message in case someone wants to test out my theory.

I replaced the switch the machine was connected to, and because I had a couple less ports on the new switch, I removed the lagg interface (originally ganging together two vr NICs with LACP) and went to a single NIC connection. Also, the "internet" is now over another VLAN, instead of fxp0. Why did this fix things? Well, my only guess is that my setting for net.inet.ip.check_interface=1 could have been interfering with things. Packets would arrive on fxp0, but then would be redirected (via pf's NAT) to vlan2, which was subsequently on lagg0 and furthermore on vr0/vr1. Thus, when replies would come back on fxp0, and unbound was jailed, they'd have to be moved over to vlan2/lagg0/vr0 & vr1, which would send up red flags with check_interface enabled. If it was outside the jail and sending from 0.0.0.0, it would simply bind to the WAN address and send from there, being able to receive on the same interface the packets were arriving on. Now, with everything on different vlans, but the same physical interface, nothing has to move between interfaces, or even can except between different vlans on the same physical interface.

I haven't tested this hypothesis as I simply don't have the time or energy to do so, but if someone else wants to take up task, please post results here.
 
Back
Top