Failover with multiple WAN uplinks

jbo@ · Aug 20, 2018

Hi,

I have a FreeBSD machine which acts as a router/DHCP/DNS server. I also have two uplinks to two different ISPs. I'd like to implement failover for the two WAN uplinks to provide the LAN clients behind with a more stable internet connection.

While starting to research this came around this forum thread pretty quickly. My setup is exactly the same as the one show in the OP's post. For the sake of simplicity, I allow myself to steal his illustration (with changed IPs and interface names):

Code:

NETWORK PROVIDER 0            NETWORK PROVIDER 1
         \                            /
          \                          /
           \                        /
            \                      /
             \                    /
          Router 0            Router 1
         192.168.2.1         192.168.3.1
              \                  /
        +------\----------------/------+
        |       \              /       |
        |      igb0          igb1      |
        |  192.168.2.2   192.168.3.2   |
        |                              |
        |         FreeBSD Box          |
        |                              |
        |                    ix0       |
        |                192.168.1.1 --|------------- Switch
        |                              |
        +------------------------------+

While reading I always come around lagg() (which is also the entire content of the first post in the linked forum thread). I did read about lagg() and I think that I have a pretty good understanding to how it works. However, I fail to understand how this helps in this scenario. My FreeBSD box is connected to two routers (which is provided by each ISP) that I am running in DMZ/bridge mode. If I understand correctly, lagg() checks the port connectivity state to act upon it. However, the connection between each router and my FreeBSD box is always working, even if my ISP is dead (or somebody tripped over the fiber cable).
My question: Is this correct or am I misunderstanding? If lagg() is the way to go, how does it exactly work?

I'd appreciate any kind of help on this.

leebrown66 · Aug 20, 2018

You are correct that lagg(4) is only checking the physical link state and useless for your application.

I don't know of anything in ports, but you could do this with a cron job to fire a script which ping's out both interfaces to a known static host, then adjusts the routing table acordingly. I do this at work on a non-FreeBSD system with 3 ISPs.

Unless you can get both ISP's to send RIP/BGP at you for 0.0.0.0/0, in which case net/quagga will be your friend.

jbo@ · Aug 20, 2018

Thank you for the clarification!

leebrown66 said:
I don't know of anything in ports, but you could do this with a cron job to fire a script which ping's out both interfaces to a known static host, then adjusts the routing table acordingly. I do this at work on a non-FreeBSD system with 3 ISPs.

Would it be possible for you to share that script? Even though it's not designed for a FreeBSD machine I'm sure it would still help to get me started.
In fact, I have three ISPs as well. I reduced it to two for the sake of simplicity of the original post as the number of uplinks shouldn't matter here.

sko · Aug 20, 2018

Unfortunately FreeBSDs PF doesn't support address pools and load balancing, so if you aren't strictly bound to FreeBSD on your router you might want to have a look at OpenBSDs solution:
https://www.openbsd.org/faq/pf/pools.html
OpenBSDs routing domains are also a very interesting topic although not directly related to fallback, they can be abused for some fallback scenarios.

OTOH, FreeBSD and OpenBSD both support equal-cost multipath routing; although you need a custom kernel on FreeBSD with options RADIX_MPATH set.
You are then able to add multiple routes for each network (also for the default route or 0.0.0.0) and set their weight to prefer one route over the other. However, if one route goes down it has to be removed from the routing table, or the system will keep trying to send packets (especially for active states) through the broken link. I haven't really looked into how PF on FreeBSD can handle multiple routes and how it reacts if one link/route is down, but I suspect you'd have to at least kill all states that used the now-defunct route.

You may also want to have a look on policy-based routing using multiple FIBs; they can be (ab)used to separate or categorize traffic from/to different WAN interfaces, so they might play a role in a fallback configuration.

The major problem with WAN-fallback is the proper detection of the link-state. An active link doesn't imply a working IP connecton; which also doesn't imply that your ISP is actually routing your traffic. ISPs deploy insanely stupid and annoying stuff that prevent any easy detection. DSL lines are exceptionally notorious as they often return "something" even if the link is down; cheap DSL-routers are an abomination from hell as they usually intercept and reply to DNS or HTTP requests to _any_ address/IP when the DSL link is down! So for DSL at least a proper, normal "modem" is needed that allows you to use PPP directly from your gateway.
So whatever you do, you almost always have to tinker around to find a reliable way to detect if your WAN connection actually routes traffic to/from the outside world.

I had to build failover on 2 FreeBSD hosts and ended up using a custom script on both. Usually I check connectivity with netcat to a known-to-always-work address like googles 8.8.8.8 or cloudflares 1.1.1.1 services; although they can be local to the ISPs network, so you don't catch egress routing problems at your ISP.
Upon timeout the script cycles through the list of egress interfaces and updates the ext_if in pf.conf; deletes/updates default route and stops dhclient on the interface with the "broken" connection to prevent it from overwriting the default route with zeros upon unsuccessful requests.
I have uploaded one of the scripts as a snippet on gitlab (it's not the DSL one, as it used a wonky LTE uplink as fallback and was littered with really ugly hacks to overcome some stupidities of the LTE router and ISP):
https://gitlab.com/snippets/1746774

leebrown66 · Aug 21, 2018

joel.bodenmann said:
Would it be possible for you to share that script? Even though it's not designed for a FreeBSD machine I'm sure it would still help to get me started.
In fact, I have three ISPs as well. I reduced it to two for the sake of simplicity of the original post as the number of uplinks shouldn't matter here.

CAVEAT: LINUX commands, those switches are different (ping -t in BSD is max-TTL, but in linux is timeout in seconds)

Code:

#!/bin/sh
Link_BM=unknown
Link_Verizon=unknown
Link_WAN=unknown
Msg_BM=normal
/bin/ping -c5 -w15 -I10.1.248.254 -q www.google.com | awk '$0 ~ /packet\ loss/ {print $0;exit $6 == "100%" || $8 == "100%"}'
if [ $? == 1 ]; then
 sleep 7
 /bin/ping -c10 -w15 -I10.1.248.254 -q www.google.com | awk '$0 ~ /packet\ loss/ {print $0;exit $6 == "100%" || $8 == "100%"}'
 if [ $? == 1 ]; then
  if [ -f /tmp/lsm_bm_down ]; then
   Msg_BM=FAIL
   Link_BM=down
  else
   echo "Black Mountan failed, pass 1"
   Link_BM=up
   Msg_BM=failing
   touch /tmp/lsm_bm_down
  fi
 else
  Link_BM=up
  if [ -f /tmp/lsm_bm_down ]; then
   rm /tmp/lsm_bm_down
  fi
 fi
else
 Link_BM=up
 if [ -f /tmp/lsm_bm_down ]; then
  rm /tmp/lsm_bm_down
 fi
fi
Link_BM=up

/bin/ping -c2 -w5 -Iisp.8 -q www.google.com | awk '$0 ~ /packet\ loss/ {print $0;exit $6 == "100%" || $8 == "100%"}'
if [ $? == 1 ]; then
 if [ -f /tmp/lsm_verizon_down ]; then
  Link_Verizon=down
 else
  echo "Verizon failed, pass 1"
  Link_Verizon=up
  touch    /tmp/lsm_verizon_down
 fi
else
 Link_Verizon=up
 if [ -f /tmp/lsm_verizon_down ]; then
  rm /tmp/lsm_verizon_down
 fi
fi

/bin/ping -c2 -w5 -I10.1.5.1 -q www.google.com | awk '$0 ~ /packet\ loss/ {print $0;exit $6 == "100%" || $8 == "100%"}'
if [ $? == 1 ]; then
 sleep 7
 /bin/ping -c10 -w15 -I10.1.5.1 -q www.google.com | awk '$0 ~ /packet\ loss/ {print $0;exit $6 == "100%" || $8 == "100%"}'
 if [ $? == 1 ]; then
  Link_WAN=down
 else
  Link_WAN=up
 fi
else
 Link_WAN=up
fi

Actual_params=params-B.${Link_BM}-V.${Link_Verizon}-W.${Link_WAN}
if [ /etc/shorewall/${Actual_params} -ef /etc/shorewall/params ]; then
 /bin/echo "No action (BM: ${Msg_BM})"
else
 /bin/logger Link BM      ${Link_BM}
 /bin/logger Link Verizon ${Link_Verizon}
 /bin/logger Link WAN     ${Link_WAN}
 /bin/unlink /etc/shorewall/params
 /bin/ln -s /etc/shorewall/params-B.${Link_BM}-V.${Link_Verizon}-W.${Link_WAN} /etc/shorewall/params
 /sbin/shorewall restart
 /usr/sbin/conntrack -D
 echo "Switched to params-B.${Link_BM}-V.${Link_Verizon}-W.${Link_WAN}" | mail -s "ALERT: Server Room" -r ****@nyingma.org ****@ratnaling.org
fi

The logic is roughly like this:
Try to ping google 5 times (timeout after 15 seconds), via ISP BM (10.1.248.254 is policy routed out that provider).
If that fails all pings (it's a radio link), wait 7 seconds, try again for 10 packets.
If that fails all pings, mark the ISP as failing and wait for the cron job to re-fire. Next time around if we still fail, consider it DOWN.
If these don't fail, remove failing/down mark for ISP.

Ditto for Verizon
Ditto for WAN

Based upon the results from that I have up/down for each ISP. That makes for 8 combinations, so I have a different configuration parameter file for each:

Code:

[root@RLServices shorewall]# ls -l params*
lrwxrwxrwx. 1 root root   36 Aug 17 13:31 params -> /etc/shorewall/params-B.up-V.up-W.up
-rw-r--r--. 1 root root  276 Apr  9  2016 params-B.down-V.down-W.down
-rw-r--r--. 1 root root  267 Apr  9  2016 params-B.down-V.down-W.up
-rw-r--r--. 1 root root  267 Apr  9  2016 params-B.down-V.up-W.down
lrwxrwxrwx. 1 root root   35 Jan 16  2016 params-B.down-V.up-W.up -> params-B.down-V.up-W.up-DP.priority
-rw-r--r--. 1 root root  259 Apr 15  2016 params-B.down-V.up-W.up-DP.normal
-rw-r--r--. 1 root root  278 Dec 15  2017 params-B.down-V.up-W.up-DP.priority
-rw-r--r--. 1 root root  126 Oct 18  2017 params-B.up-V.down-W.down
-rw-r--r--. 1 root root  124 Apr  9  2016 params-B.up-V.down-W.up
-rw-r--r--. 1 root root   67 Apr  9  2016 params-B.up-V.up-W.down
-rw-r--r--. 1 root root  402 May 12 13:12 params-B.up-V.up-W.up
-rw-r--r--. 1 root root 1937 Jul 27 12:21 params.std

So params is the expected file for the firewall tool (shorewall) and I just change the link, if needed to point to the correct parameter file. The firewall is then restarted, logged, and email send out.

Hope that helps.

Failover with multiple WAN uplinks

jbo@

leebrown66

jbo@

sko

leebrown66