Solved Zabbix server behind firewall not reachable

jbo@

Developer
Hi,

I'm successfully running a Zabbix server on a host in a local network (192.168.8.0). The Zabbix server has a network interface with the IP 192.168.8.14. All other servers that run the Zabbix agent and are on the same LAN are able to talk to the Zabbix server without any problems.

The architecture of the local network looks like this:
Code:
ISP fiber ---> gold1 ---> bromine
gold1 is the router/firewall that connects the three different LANs to the internet. It runs pf and net/haproxy. Behind it are a number of webservers, S3 compatible nodes and so on.
bormine is the host that runs the Zabbix server.

The problem is that a Zabbix agent "on the internet" is not able to reach the Zabbix server. I used security/nmap to check for port 10051. On a local machine I see that the port is open (on gold1 and on bromine).
However, if I run the same command on a host "on the internet" agains gold1 I can see that port 10051 is filtered:
Code:
root@hydrogen1:~ # nmap -sS -p10051 zabbix.foo.bar

Starting Nmap 7.40 ( https://nmap.org ) at 2018-08-02 15:57 CEST
Nmap scan report for zabbix.foo.bar (AA.BB.CC.DD)
Host is up (0.0095s latency).
rDNS record for AA.BB.CC.DD
PORT      STATE    SERVICE
10051/tcp filtered zabbix-trapper

Nmap done: 1 IP address (1 host up) scanned in 0.80 seconds

In contrast, ports 80 and 443 are marked as open and I'm able to query websites through gold1 just as expected:
Code:
root@hydrogen1:~ # nmap -sS -p80,443 zabbix.foo.bar

Starting Nmap 7.40 ( https://nmap.org ) at 2018-08-02 15:57 CEST
Nmap scan report for zabbix.foo.bar (AA.BB.CC.DD)
Host is up (0.0097s latency).
rDNS record for AA.BB.CC.DD
PORT    STATE SERVICE
80/tcp  open  http
443/tcp open  https

Nmap done: 1 IP address (1 host up) scanned in 1.47 seconds

I assume that this is a simple firewall issue but I'm not able to figure it out. Here's a copy of /etc/pf.conf of gold1:
Code:
if_wan="igb3"
if_lan="igb1"   # Office
if_lan2="igb4"  # Servers
if_lan3="igb0"  # Management
if_loc="lo0"

# Options
set block-policy drop

# Scrub
scrub in all

# Ignore loopback interface
set skip on $if_loc

# NAT
nat on $if_wan from $if_lan:network to any -> ($if_wan) static-port
nat on $if_wan from $if_lan2:network to any -> ($if_wan) static-port
nat on $if_wan from $if_lan3:network to any -> ($if_wan) static-port

# Redirects

# Deal with bruteforcers
table <bruteforce> persist
block quick from <bruteforce>

#pass quick on $if_lan

block in log all
#antispoof for $if_wan
pass out keep state

# Block anything coming from sources that we have no back routes for
#block in log from no-route to any

pass quick on $if_lan all # Allow all traffic on internal interface(s)
pass quick on $if_lan2 all # Allow all traffic on internal interface(s)
pass quick on $if_lan3 all # Allow all traffic on internal interface(s)

pass in on $if_wan proto tcp from any to any port 10051 keep state
pass in on $if_wan proto tcp from any to any port {80, 443} keep state
pass in quick on $if_lan proto tcp from any to any port ssh flags S/SA keep state (max-src-conn 10, max-src-conn-rate 50/3600, overload <bruteforce> flush global)

The net/haproxy instance running on gold1 is configured as follows (I removed all the parts regarding port 80 and 443):
Code:
global
        log /var/run/log local0 info
        log /var/run/log local0 notice
        daemon
        maxconn 8000
        nbproc 3
        tune.ssl.default-dh-param 2048
        user nobody
        group nobody

defaults
        log global
        option httplog
        option dontlognull
        mode http
        timeout connect 5s
        timeout client 5min
        timeout server 5min
        option forwardfor
        errorfile 400 /usr/local/etc/haproxy/errorfiles/400.http
        errorfile 403 /usr/local/etc/haproxy/errorfiles/403.http
        errorfile 408 /usr/local/etc/haproxy/errorfiles/408.http
        errorfile 500 /usr/local/etc/haproxy/errorfiles/500.http
        errorfile 502 /usr/local/etc/haproxy/errorfiles/502.http
        errorfile 503 /usr/local/etc/haproxy/errorfiles/503.http
        errorfile 504 /usr/local/etc/haproxy/errorfiles/504.http

frontend zabbix
        bind *:10051
        mode tcp

        default_backend zabbix_10051

backend zabbix_10051
        server zabbix01 192.168.8.14:10051 check
        mode tcp

I'd appreciate any kind of help (and generic feedback) on this.
 
Try removing the check, if I recall correctly the default check is a HTTP GET, which is going to fail on a regular TCP port. HAProxy will then mark the backend server as 'offline' and won't route traffic to it.

This 'offline' backend server is easily spotted if you enable the web status page of HAProxy. It will be bright red, backend servers that are online are green. The page will also show you some nice statistics for each frontend, backend and individual servers.

Note that the 10051 port is used for 'Active' checks, these are done by the agent and reported to the server. The 'normal' operation is for the server to contact the agent on port 10050. Also make sure Server and ServerActive are correctly configured on the agent.
 
Thank you for your quick reply.

Unfortunately, that didn't solve my problem. I'm still getting the exact same behavior. I did ensure that everything was properly reloaded. Just to be sure I completely rebooted gold1.
If this can be of any help, here's the log file of the Zabbix agent running on the remote machine:
Code:
root@hydrogen1:~ # tail -f /var/log/zabbix_agentd.log
 37661:20180802:171945.379 IPv6 support:          YES
 37661:20180802:171945.379 TLS support:           YES
 37661:20180802:171945.379 **************************
 37661:20180802:171945.379 using configuration file: /usr/local/etc/zabbix34/zabbix_agentd.conf
 37661:20180802:171945.379 agent #0 started [main process]
 37662:20180802:171945.379 agent #1 started [collector]
 37663:20180802:171945.379 agent #2 started [listener #1]
 37664:20180802:171945.379 agent #3 started [listener #2]
 37665:20180802:171945.380 agent #4 started [listener #3]
 37666:20180802:171945.380 agent #5 started [active checks #1]
 37666:20180802:171948.400 active check configuration update from [zabbix.foo.bar:10051] started to fail (cannot connect to [[zabbix.foo.com]:10051]: [4] Interrupted system call)

My HAproxy configuration file also has a rule that on zabbix.foo.bar for the web interface on port 443. I don't think that that would be a problem as it's listening on a different port but just to be sure, here's the relevant part:
Code:
global
        log /var/run/log local0 info
        log /var/run/log local0 notice
        daemon
        maxconn 8000
        nbproc 3
        tune.ssl.default-dh-param 2048
        user nobody
        group nobody

defaults
        log global
        option httplog
        option dontlognull
        mode http
        timeout connect 5s
        timeout client 5min
        timeout server 5min
        option forwardfor
        errorfile 400 /usr/local/etc/haproxy/errorfiles/400.http
        errorfile 403 /usr/local/etc/haproxy/errorfiles/403.http
        errorfile 408 /usr/local/etc/haproxy/errorfiles/408.http
        errorfile 500 /usr/local/etc/haproxy/errorfiles/500.http
        errorfile 502 /usr/local/etc/haproxy/errorfiles/502.http
        errorfile 503 /usr/local/etc/haproxy/errorfiles/503.http
        errorfile 504 /usr/local/etc/haproxy/errorfiles/504.http

frontend http
        bind *:80
        bind *:443 ssl crt-list /usr/local/etc/haproxy/certs_list
        mode http
        redirect scheme https if !{ ssl_fc } # Redirect http requests to https

        use_backend zabbix if { hdr(host) -i zabbix.foo.bar }

backend zabbix
        server zabbix01 192.168.8.14:80 check
        rspadd Content-Security-Policy:\ upgrade-insecure-requests

frontend zabbix
        bind *:10051
        mode tcp

        default_backend zabbix_10051

backend zabbix_10051
        server zabbix01 192.168.8.14:10051
        mode tcp

Thank you regarding the comment on passive vs. active checks. I've been using Zabbix only for about half a year now. I had a network with about 20 hosts running - I'm just migrating the Zabbix server to a different host.
I did a lot of reading on active vs. passive checks. Back then I concluded that active checks are what I want to be using. I did configure everything correctly to make it work but if you recommend to use a passive checks only I'm happy to hear why.
 
Active checks are useful if the agent is buried somewhere deep in a remote network and can't be reached directly from the server. Or if you have multiple agents sitting behind a single IP address (NAT). But you do need to change the item type to "Zabbix agent (active)". The standard templates all use "Zabbix agent".

The difference between active and passive checks is easy. Passive ("Zabbix agent") checks are done from server to agent. Server contacts agent on port 10050 and requests item X and the agent checks the Server address and if it matches returns the value of the item. This is what's used by default on all the standard templates. Active ("Zabbix agent (active)") checks are a little more involved. At regular intervals the agent contacts the server on the address set by ServerActive and requests a list of items. The agent will then send the values of those items to the server.

Looking at the error message 'cannot connect to ...', check if the agent is able to make the outgoing connection ( nc -zv zabbix.foo.com 10051), it may be blocked by a local firewall. Run tcpdump(1) on the HAProxy machine and see if you can actually see the connection come in.
 
I ran the command that you suggested and it seems that there's indeed an issue:
Code:
root@hydrogen1:~ # nc -zv zabbix.foo.com 10051
nc: connect to zabbix.foo.com port 10051 (tcp) failed: Operation timed out
That remote host (hydrogen1) runs pf as well. However, I have a block out quick rule in there:
Code:
vpnclients = "10.8.0.0/24"
if_wan = "igb0"
if_vpn = "tun0"
if_loc = "lo0"

# OpenVPN by default runs on udp port 1194
udpopen = "{ 1194 }"
icmptypes = "{ echoreq, unreach }"

# Normalize all packages
scrub in all

# Don't filter at all on localhost
set skip on $if_loc

# Allow OpenVPN clients to access the internet through this OpenVPN server
nat on $if_wan inet from $vpnclients to any -> $if_wan

# Redirect SSH to machines behind this proxy
rdr on $if_wan proto tcp from any to any port 2230 -> 10.8.0.18 port 22
rdr on $if_wan proto tcp from any to any port 2231 -> 10.8.0.6 port 22
rdr on $if_wan proto tcp from any to any port 2232 -> 10.8.0.22 port 22
rdr on $if_wan proto tcp from any to any port 2233 -> 10.8.0.34 port 22
rdr on $if_wan proto tcp from any to any port 2234 -> 10.8.0.38 port 22
rdr on $if_wan proto tcp from any to any port 2235 -> 10.8.0.10 port 22

# Deal with bruteforcers
table <bruteforce> persist
block quick from <bruteforce>

block in log all
pass in on $if_wan proto udp from any to $if_wan port $udpopen
pass in on $if_wan proto tcp from any to any port 8080 keep state
pass in on $if_wan proto tcp from any to any port 22 keep state
pass in on $if_wan proto tcp from any to any port 80 keep state
pass in on $if_wan proto tcp from any to any port 443 keep state
pass in on $if_wan proto tcp from any to any port {1883, 8883} keep state
pass in on $if_wan proto tcp from any to any port {2230, 2231} keep state
pass in on $if_wan proto tcp from any to any port {10050, 10051} keep state
pass out quick
pass in on $if_vpn from any to any
pass in inet proto icmp all icmp-type $icmptypes

pass in quick on $if_wan proto tcp from any to any port ssh flags S/SA keep state (max-src-conn 10, max-src-conn-rate 50/3600, overload <bruteforce> flush global)

Any idea? :/
 
Just got back to the system just to see that it's now working. I'm not sure why it took a couple of hours of it to get to work - especially as I properly restarted the gold1 host.

Anyway, thank for your help - much appreciated!
 
For anyone who comes across this in the future: The reason why it "suddenly worked" is because there was another firewall involved that sat between my `ISP fiber` and `gold1` which blocked port 10051.
 
Back
Top