Bind (named) doesn't work after internet connection changed

At home, I run a FreeBSD server, which acts as a router, firewall, and DNS/DHCP/NTP/... server. It has one internal ethernet, which all computers in the house are connected to. It has two external ethernet ports, one each for the old and the new internet providers: the old one has been used for many years, the new one was added last week. I use pf as both a firewall (to make sure no obnoxious traffic enters, and to filter some undesired connections), and to provide NAT of the many internal hosts to the outside world. Selecting which outside internet provider to use requires only "route change default <which gateway>". All that works fine, except that named fails. And without name service, nothing is really usable.

So let's talk about the DNS configuration. I run a full installation of bind, because (a) our external internet has always been unreliable, and I want the internal network to continue functioning, and (b) I have quite a few internal computers that should have names, but are intentionally invisible from the outside world, so they are not listed in the public name service that runs in the cloud (that's called split horizon DNS). The way named is configured is: It is the authoritative server for our internal domain zone (and the inverses required), and otherwise it uses the default setup of going to the root servers (not to 8.8.8.8 or to our internet provider's DNS server). The config in /etc/resolv.conf is to use 192.168.0.1 as the name server, and that is also given (via DHCP) to all internal clients.

Here is what goes wrong: If I just switch the route command (above) to use the new internet connection, all DNS queries for external things fail; my DNS server returns 2(SERVFAIL). Restarting the named server (with "service named restart") after switching to the new connection does not help, but it least gives me one extra message in the log: "managed-keys-zone: No DNSKEY RRSIGs found for '.': success" (whatever that might mean). If I don't restart named, it eventually puts out lots of warning messages: "validating <domainname>com.wlan0/NS: bad cache hit (wlan0/DS)". Again, I have no idea what this means, and note that my server does not have any WiFi hardware and no wlan0 device.

My suspicion is that the DNS server has some form of internal cache (perhaps related to DNSSEC and keys), and that cache refers to what external IP address it received the cached data from. And I suspect that the /usr/local/etc/namedb/working/managed-keys.bind file plays a role in that. I've tried flushing all caches and reload the named daemon with "rndc flush" and "rndc reload", plus restarting it with "service named restart", none of that helps.

I have verified that the new internet service passes all packets, including DNS packets to/from port 53, using nc. The problem seems to be solely on my end, with my named refusing to cooperate. So what am I doing wrong? How does named remember what IP address it saw recently, to know to behave badly when that IP address changes?
 
do you have a query-source statement ?
No, it's commented out, just like in the sample config that ships with the bind package. My named.conf is mostly like the sample, with just the following changes:
Code:
listen-on { 192.168.0/24; }

// Primary zones: Zones for which we are the primary
zone "example.com" {
    type primary;
    file "/usr/local/etc/namedb/primary/example.zone";
};
zone "0.168.192.in-addr.arpa" {
    type primary;
    file "/usr/local/etc/namedb/primary/example_inverse.zone";
};
(and obviously example is in reality my domain name)
 
if you tcpdump while bind is failing do you see any queries going out to the root servers ?
if they go is the src address the current ext ip ?
 
If you reboot with the new ISP default route, and it works, that would rule out the firewall.

If it doesn't work, then turning off the packet filters momentarily for a test might say something useful.
 
Problem found, at the bottom of this post. Not solved yet.

if you tcpdump while bind is failing do you see any queries going out to the root servers ?
if they go is the src address the current ext ip ?
Yes, and the source addresses are correct (I've sometimes forgotten to do NAT, which leads to wrong source addresses, turning the Ethernet into a black hole). I do see queries going out to the root servers, and I do see responses getting back from them. But I'm not good enough at speed-reading the tcpdump output to see whether the DNS queries are correct.

If you reboot with the new ISP default route, and it works, that would rule out the firewall.
Done, and doesn't change any symptoms: Everything works, except the named on my server seems to ignore the answers to queries.

So to debug this further, I ran named by hand from the command line with the -g switch, to see debug messages. And I found some interesting error messages, such as "non-improving referral", "FORMERR resolving" and "lame server". A little web searching gave me a hint: this usually means that something upstream of me is intercepting my DNS queries, and answering them for me, unfortunately with imperfect answers. I learned a nice way to diagnose this: First, run "dig @a.root-servers.net . ns", meaning ask the A root server for the NS (name servers) information for the root of the internet. The answer that comes back must have the aa flag, must not have the ra flag nor the ad flag, and must contain a list of all root servers. And when I do the same thing with my new upstream internet provider, I get answers with the ra flag, meaning I'm not talking to the real root server. For fun, do the same thing with the +tls flag (meaning use DNS-over-TLS) and to the B root server, and then you get the real answer, meaning the interceptor in my provider doesn't know about TLS (few DNS servers do).

So the answer is: My upstream provider (a) intercepts my DNS queries, (b) is probably not doing that terribly well, otherwise I wouldn't have noticed, and (c) forgot to tell me that I should be using their DNS server. So I'm contacting their tech support.
 
In the meantime, I had set up using the 9.9.9.9 a.k.a. Quad9 server over TLS, which worked fine: Because it is TLS, the new internet provider's proxy couldn't intercept it.

But then I got the answer from my new ISP's tech support: Yes, they have their own authoritative DNS server pair, and they use that to proxy any DNS requests. But the tech who installed the connection forgot to tell me about DHCP and DNS. And it turns out their DNS server is also very very fast, with typical query times of 130-150 ms. Given that this works so well, the TLS-based remote server is going on the back burner.
 
Back
Top