Solved carp / arp trouble

Recently I've been changing our routers/firewalls at work to a redundant configuration, using lagg(4), vlan(4)s, carp(4), pf(4), and pfsync(4). However, I've run into a strange problem and I'm at a loss as to how to troubleshoot it at this point.

We have two routers right now, in a redundant config as follows:

Code:
# router01's rc.conf (redacted)
hostname="router01"
dumpdev="NO"

watchdogd_enable="YES"
unbound_enable="YES"
sshd_enable="YES"
openntpd_enable="YES"
dhcpd_enable="YES"

pf_enable="YES"

ifconfig_re0="up"
ifconfig_re1="up"

cloned_interfaces="lagg0 vlan24 vlan64 vlan254 vlan300"

ifconfig_lagg0="laggproto lacp laggport re0 laggport re1"

ifconfig_vlan24="inet x.x.x.131/29 vhid 131 advskew 127 pass pw131 vlan 24 vlandev lagg0"
ifconfig_vlan24_alias0="inet x.x.x.133/32 vhid 133 advskew 0 pass pw133"
ifconfig_vlan64="inet 192.168.64.2/24 vlan 64 vlandev lagg0"
ifconfig_vlan64_alias0="inet 192.168.64.1/32 vhid 64 advskew 0 pass pw64"
ifconfig_vlan254="inet 192.168.254.251/24 vlan 254 vlandev lagg0"
ifconfig_vlan254_alias0="inet 192.168.254.254/32 vhid 254 advskew 127 pass pw254"
ifconfig_vlan300="inet 172.31.255.249/29 vlan 300 vlandev lagg0"

pfsync_enable="YES"
pfsync_syncdev="vlan300"

defaultrouter="x.x.x.129"
gateway_enable="YES"

static_routes="vl224"
route_voice="-net 192.168.224.0/24 192.168.254.2"
The second router's config (router02) is identical with the following exceptions:
  • It's using em0/em1 as it has Intel NICs instead of the crummy Realteks.
  • Anywhere router01's advskew is 0, its is 127, and vice-versa.
The goal being that router01 will be the master for 192.168.64.0/24 (which is routed out x.x.x.133) and x.x.x.133, and that router02 will be primary for 192.168.254.0/24 (routed out x.x.x.131) and x.x.x.131.

For the most part everything seemed to be working, and I've set up preemption for carp(4) and things look good in ifconfig(8) (from router01):

Code:
re0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=82099<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
  ether 00:30:48:dc:21:a6
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect (1000baseT <full-duplex,master>)
  status: active
re1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=82099<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
  ether 00:30:48:dc:21:a6
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect (1000baseT <full-duplex>)
  status: active
pflog0: flags=0<> metric 0 mtu 33160
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 1500
  pfsync: syncdev: vlan300 syncpeer: 224.0.0.240 maxupd: 128 defer: off
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
  options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
  inet6 ::1 prefixlen 128
  inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
  inet 127.0.0.1 netmask 0xff000000
  nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=82099<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
  ether 00:30:48:dc:21:a6
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect
  status: active
  laggproto lacp lagghash l2,l3,l4
  laggport: re0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
  laggport: re1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
vlan24: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=1<RXCSUM>
  ether 00:30:48:dc:21:a6
  inet x.x.x.131 netmask 0xfffffff8 broadcast x.x.x.135 vhid 131
  inet x.x.x.133 netmask 0xffffffff broadcast x.x.x.133 vhid 133  [1/1202]
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect
  status: active
  vlan: 24 parent interface: lagg0
  carp: BACKUP vhid 131 advbase 1 advskew 127
  carp: MASTER vhid 133 advbase 1 advskew 0
vlan64: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=1<RXCSUM>
  ether 00:30:48:dc:21:a6
  inet 192.168.64.2 netmask 0xffffff00 broadcast 192.168.64.255
  inet 192.168.64.1 netmask 0xffffffff broadcast 192.168.64.1 vhid 64
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect
  status: active
  vlan: 64 parent interface: lagg0
  carp: MASTER vhid 64 advbase 1 advskew 0
vlan254: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=1<RXCSUM>
  ether 00:30:48:dc:21:a6
  inet 192.168.254.251 netmask 0xffffff00 broadcast 192.168.254.255
  inet 192.168.254.254 netmask 0xffffffff broadcast 192.168.254.254 vhid 254
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect
  status: active
  vlan: 254 parent interface: lagg0
  carp: BACKUP vhid 254 advbase 1 advskew 127
vlan300: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
  options=1<RXCSUM>
  ether 00:30:48:dc:21:a6
  inet 172.31.255.249 netmask 0xfffffff8 broadcast 172.31.255.255
  nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
  media: Ethernet autoselect
  status: active
  vlan: 300 parent interface: lagg0

However, today there started to be issues routing traffic out the x.x.x.133 address. What's really odd is it's almost impossible to troubleshoot. Connecting to the router (from the private address) and running tcpdump(1) on vlan24 shows me no IP traffic at all (but I can see carp traffic), despite it being the master. Trying to ping the default gateway (x.x.x.129) gives me:
Code:
ping: sendto: Invalid argument
This is not related to PF or my PF ruleset as it happens even with PF disabled ( pfctl -d).

I suspect this is somehow related to arp(8) problems, as I get repeated dmesg(8) spam of:
Code:
arpresolve: can't allocate llinfo for x.x.x.129 on vlan24
as if it can't allocate memory for the default gateway's ARP/IP information. This appeared since inception, but things were working before (or at least appeared to). The ARP tables for the other interfaces (vlans) are fine--it's just vlan24's that are empty.

Any idea of where to start?
 
There's an error I made while copy/pasting, but I cannot edit the original post. instead of "route_voice" that should read "route_vl224" (I named it differently when adding it to the two different routers, and copied it from one router's config for one of the pastes and from the other for the other paste)

Edit: I also forgot to mention that only the 192.168.64.0/24 & x.x.x.133 network/addresses were affected. No issues happened on 192.168.254.0/24 or x.x.x.131

Edit: Okay, I've resolved the issue and I think I know what is going on.

When an IP address is in the "BACKUP" state, as far as routing/subnet information is concerned, it doesn't exist. So, while router01 was in the backup state for x.x.x.131/29, and the "MASTER" for x.x.x.133/32, it didn't know that the mask of the alias was /29, as that IP/netmask didn't "exist" because it was in the backup state. So, without the knowledge of that mask for that IP address, it couldn't figure out which interface the default gateway was on!

So to fix it, I simply swapped which was the primary (non-alias, with the /29 mask) in rc.conf on router01's vlan24 interface. Now everything appears to be working again!

If any carp gurus would comment on this (or know if this is even the preferred way to handle a configuration like this), I would love to hear your thoughts.
 
Back
Top