local_unbound : communications error to 127.0.0.1#53: timed out

Hello,

I’m using local_unbound as my local resolver:
Code:
$ cat /etc/resolv.conf
# Generated by resolvconf
search foo.bar

nameserver 127.0.0.1
options edns0

(I also have a BIND server as authoritative nameserver for my domain names, listening on public IP A.B.C.D:53.)

Quite often, local_unbound fails / timeout for unknown reasons and I can’t find a way to debug that behavior:

Code:
$ host ips.updown.io
ips.updown.io has address 45.32.74.41
ips.updown.io has address 135.181.102.135
ips.updown.io has address 91.121.222.175
ips.updown.io has address 104.238.159.87
ips.updown.io has address 45.76.104.117
ips.updown.io has address 192.99.37.47
ips.updown.io has address 45.63.29.207
ips.updown.io has address 178.63.21.176
ips.updown.io has address 45.32.107.181
ips.updown.io has address 104.238.136.194
ips.updown.io has IPv6 address 2a01:4f8:141:441a::2
ips.updown.io has IPv6 address 2001:19f0:5801:1d8::1
ips.updown.io has IPv6 address 2001:41d0:2:85af::1
ips.updown.io has IPv6 address 2001:19f0:4400:402e::1
ips.updown.io has IPv6 address 2001:19f0:7001:45a::1
ips.updown.io has IPv6 address 2001:19f0:6c01:145::1
ips.updown.io has IPv6 address 2001:19f0:6001:2c6::1
ips.updown.io has IPv6 address 2607:5300:60:4c2f::1
ips.updown.io has IPv6 address 2001:19f0:9002:11a::1
ips.updown.io has IPv6 address 2a01:4f9:c010:d5f9::1
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached

My config looks like this:

Code:
$ cat /var/unbound/forward.conf
# Generated by resolvconf

forward-zone:
    name: "foo.bar"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "multi.uribl.com"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "spamhaus.net"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "abusix.zone"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "."
    forward-addr: A.B.C.D@53
    forward-addr: 193.110.81.9@853#dns0.eu
    forward-addr: 2a0f:fc80::@853#dns0.eu
    forward-addr: 185.253.5.9@853#dns0.eu
    forward-addr: 2a0f:fc81::@853#dns0.eu

Code:
$ cat /var/unbound/unbound.conf
# This file was generated by local-unbound-setup.
# Modifications will be overwritten.
server:
    username: unbound
    interface: 127.0.0.1
    directory: /var/unbound
    chroot: /var/unbound
    pidfile: /var/run/local_unbound.pid
    auto-trust-anchor-file: /var/unbound/root.key
    use-caps-for-id: yes
    num-threads: 2
    outgoing-range: 460
    tls-cert-bundle: "/usr/local/share/certs/ca-root-nss.crt"
    prefetch: yes
    cache-max-ttl: 86400
    cache-min-ttl: 3600

include: /var/unbound/forward.conf
include: /var/unbound/lan-zones.conf
include: /var/unbound/control.conf
include: /var/unbound/conf.d/*.conf

Any idea how I can debug?
 
I would assume a firewall problem to be consistent and not random. What I see is only transient / random.
And it happens when I query 127.0.0.1:53 -> I have a pretty open pf rule for lo0 "pass quick on $loc_if all keep state".

I’ll dig in that direction, but it feels weird.

Question: if one or more upstream DNS from my forward.conf are unreachable, is that possible that the host command issues a "communications error to 127.0.0.1#53: timed out" error, even though the problem is not with local_unbound (127.0.0.1#53)?
 

It’s all default values, and considering those pretty high defaults I think I’m good (also, nothing in my logs about reaching a limit).

I’ve also cleaned up my forward.conf: it’s way better but things are not perfect yet.

On PF’s side, set skip on lo0 doesn’t change the result but doesn’t hurt.
I’ve a bunch of rdr rules using very high ports numbers (between 44K and 58K) and they were not bound to a specific interface. In an attempt to avoid possible conflicts / unwanted traffic interceptions by these rules, I’ve restricted those rules to my WAN interface:
from
Code:
rdr inet proto …
to
Code:
rdr pass on $ext_if inet proto …

Not sure it changes anything, but it does not hurt either.

There is still something odd / wrong about my local-unbound and its forwarding behaviour. During a test, I’ve set up a forward.conf like this:

Code:
forward-zone:
    name: "foo.bar"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "multi.uribl.com"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "spamhaus.net"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "abusix.zone"
    forward-addr: A.B.C.D@53

forward-zone:
    name: "."
    forward-addr: (only invalid / unreachable DNS)
    forward-addr: (only invalid / unreachable DNS)

then:
Code:
OK:

# host ips.updown.io
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached

NOT OK:

# host foo.bar
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached

Obviously, host foo.bar should have matched the first config block and the query should have been answered by DNS A.B.C.D. A direct request to A.B.C.D is properly answered.
 
I’ve tested dig, host, drill, etc. Results are the same.
I’ve also used ktrace and tcpdump to track the origin of this
Code:
communications error to 127.0.0.1#53: timed out
message, but I have still no clue…

The service (local_unbound, not unbound) is up and is under high scrutiny ;)

You can find a partial output of a ktrace for a dig command here: https://pastebin.com/T6BFKvDH (this paste will expire in 6 months). I just removed most of the init process of the command, where it loads files/libs.
 
try to enable these unbound options and look at results

log-queries: <yes or no>
Prints one line per query to the log, with the log timestamp and IPaddress, name, type and class. Note that it takes time to print these lines which makes the server(significantly) slower.
Default: no

log-replies: <yes or no>
Prints one line per reply to the log, with the log timestamp and IPaddress, name, type, class, return code, time to resolve, from cache andresponse size. Note that it takes time to print these lines which makes the server(significantly) slower.
Default: no

log-tag-queryreply: <yes or no>
Prints the word ‘query’ and ‘reply’ withlog-queries: andlog-replies:.This makes filtering logs easier.
Default: no (backwards compatible)
 
I also had my fair share of unfortunate problems with local-unbound(8) and 9 out 10 times the problem was me misconfiguring IPFW or not thinking about which service writes into /etc/resolv.conf at what occasion. dhclient(8) usually conflicts with local-unbound(8) for instance which is annoying if you don't know what to look for.
I can’t find a way to debug that behavior
I had the same impression for a long time. So here is what I do since I read the relevant man pages thoroughly:
  • local-unbound(8) will not log anything if you do not put unbound.log into the configured chroot which defaults to /var/unbound: touch /var/unbound/unbound.log && chown unbound:unbound /var/unbound/unbound.log
  • Then I do local-unbound-control verbosity 5 and local-unbound-control log_reopen or alternatively:
Code:
# create a /etc/unbound/conf.d/logging.conf
# the default settings would include this into your configuration
server:
    logfile: /var/unbound/unbound.log
    verbosity: 5
    # verbosity from 0 to 5, defaulting to 1

Normally verbosity level 1 already tells me what is going on. If it says anything about permissions in the logfile you most likely misconfigured your firewall.
 
don't forget `tcpdump`. The packets never lie. Opening a saved tcpdump pcap in wireshark, and looking for anomalies, is worth it, if you can correlate with unbound log or time of failure.
 
Thank you all.

I’m already logging into syslog (to feed Splunk where I can analyse/correlate). Current config:

Code:
# cat conf.d/log.conf
server:
    use-syslog: yes
    log-queries: yes
#   log-replies: yes
    log-tag-queryreply: yes
    log-local-actions: yes
    log-servfail: yes
    verbosity: 2
    #verbosity: 4
    statistics-interval: 60
    statistics-cumulative: no
    extended-statistics: yes

I tried log-replies but found no benefit.
verbosity: 4 was not helpful either, I’ll try 5 with log-replies for few days.

My resolv.conf is stable (no DHCP client, etc.).
 
Back
Top