Network goes away after some time

wwelvaert · Apr 25, 2024

I've run into a strange issue attempting to deploy FreeBSD production servers for the first time. I've deployed 3 servers (2 web and 1 dbase) running FreeBSD 13.3 on OMC/Kamatera. When I boot the servers I can connect fine using ssh or curl (plain http to the web server home page). However, after not connecting for some time (30 minutes or so) I'm not able to connect again to the servers. Both ssh and curl time out using any variety of clients or networks.

When this happens I can still access the server via the remote console on the Kamatera cloud platform and the way to restore connectivity is simply to do ping google.com on the server. After the first couple of pings I can connect fine from the outside world using ssh or curl.

While I am unable to log in from the outside world I don't see any output from tcpdump -i vmx0. Disabling the firewall (IPFW) or flushing the rules does not appear to change anything.

Output of ifconfig vmx0:

Code:

vmx0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=4e403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
	ether 00:50:56:1c:6f:5b
	inet 103.98.214.13 netmask 0xffffff00 broadcast 103.98.214.255
	media: Ethernet autoselect
	status: active
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

My etc.conf file:

Code:

sshd_enable="YES"

# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="NO"

ifconfig_vmx0="inet 103.98.214.13 netmask 255.255.255.0"

ifconfig_vmx1_alias0="inet 172.16.0.4 netmask 255.255.255.0"

# Enable firewall.
firewall_enable="YES"
firewall_quiet="YES"
firewall_logdeny="YES"
firewall_script="/etc/ipfw.rules"

# Enable Apache web server.
apache24_enable="yes"
hostname="web24_coolcalc_com"

#fstabcheckdisks_enable="YES"

#hwaddress ether 00:50:56:1c:6f:5b
defaultrouter="103.98.214.254"

#ifconfig_vmx1_alias0="inet 172.16.0.4 netmask 255.255.255.0"
#hwaddress ether 00:50:56:3a:d9:d1

Any help is much appreciated!

mer · Apr 25, 2024

wwelvaert said:
When this happens I can still access the server via the remote console on the Kamatera cloud platform and the way to restore connectivity is simply to do ping google.com on the server. After the first couple of pings I can connect fine from the outside world using ssh or curl.

So it sounds like for a while you can access the server from the outside, then after a period of inactivity you can't so you have a way to get a console to the device where you then ping out from the server.
Is that about right?
If so it sounds like firewall type of issues. If you set up a cron job on the server to say ping google.com every 5 or 10 minutes that would keep "holes" open from the server out.

SirDice · Apr 25, 2024

I'm more inclined to consider a MAC address table timing out on the switch.

wwelvaert · Apr 25, 2024

mer said:
So it sounds like for a while you can access the server from the outside, then after a period of inactivity you can't so you have a way to get a console to the device where you then ping out from the server.
Is that about right?

Yes, that's exactly my predicament...

mer said:
If so it sounds like firewall type of issues. If you set up a cron job on the server to say ping google.com every 5 or 10 minutes that would keep "holes" open from the server out.

I agree setting up a cron job might work but it seems a bit hack-ish, I'm hoping to find a more elegant solution first.

The output of ipfw list below, not sure if anything in there stands out. I'm not well versed in firewall rules but wrote the custom rules script because I have to open Postgres port 5432 to the local network (there's probably an easier way to do that)

Code:

00005 allow ip from any to any via xl0
00010 allow ip from any to any via lo0
00101 check-state :default
00110 allow tcp from any to 8.8.8.8 53 out via vmx0 setup keep-state :default
00111 allow udp from any to 8.8.8.8 53 out via vmx0 keep-state :default
00112 allow tcp from any to 8.8.4.4 53 out via vmx0 setup keep-state :default
00113 allow udp from any to 8.8.4.4 53 out via vmx0 keep-state :default
00200 allow tcp from any to any 80 out via vmx0 setup keep-state :default
00220 allow tcp from any to any 443 out via vmx0 setup keep-state :default
00250 allow icmp from any to any out via vmx0 keep-state :default
00260 allow udp from any to any 123 out via vmx0 keep-state :default
00280 allow tcp from any to any 22 out via vmx0 setup keep-state :default
00299 deny log ip from any to any out via vmx0
00300 deny ip from 192.168.0.0/16 to any in via vmx0
00301 deny ip from 172.16.0.0/12 to any in via vmx0
00302 deny ip from 10.0.0.0/8 to any in via vmx0
00303 deny ip from 127.0.0.0/8 to any in via vmx0
00304 deny ip from 0.0.0.0/8 to any in via vmx0
00305 deny ip from 169.254.0.0/16 to any in via vmx0
00306 deny ip from 192.0.2.0/24 to any in via vmx0
00307 deny ip from 204.152.64.0/23 to any in via vmx0
00308 deny ip from 224.0.0.0/3 to any in via vmx0
00310 deny icmp from any to any in via vmx0
00315 deny tcp from any to any 113 in via vmx0
00320 deny tcp from any to any 137 in via vmx0
00321 deny tcp from any to any 138 in via vmx0
00322 deny tcp from any to any 139 in via vmx0
00323 deny tcp from any to any 81 in via vmx0
00330 deny ip from any to any frag offset in via vmx0
00332 deny tcp from any to any established in via vmx0
00410 allow tcp from any to me 22 in via vmx0 setup limit src-addr 2 :default
00420 allow tcp from 172.16.0.3 to me 5432 in via vmx1 setup limit src-addr 5 :default
00499 deny log ip from any to any in via vmx0
00999 deny log ip from any to any
65535 deny ip from any to any

wwelvaert · Apr 25, 2024

SirDice said:
I'm more inclined to consider a MAC address table timing out on the switch.

That's way out of my area of expertise. How would one go about confirming or fixing such a problem?

mer · Apr 25, 2024

MAC address thing:
It sounds like your servers are at an ISP/Hosting solution of some kind. In order for the outside world to access them, typically you have DNS entries that get you to an IP address then the IP address gets you the host (MAC address).
MAC addresses are usually populated by "arp", so in a packet capture you would see DNS lookup, then icmp/arp "who has ....".
Every bit of network gear between you (your PC) and the server may do the lookup (who has) and the result gets stored in the arp table of the device (that's simplified, switches and routers to stuff that says an ip is on a port).
The entries in the arp table usually have a timeout where they age out, so any network device in the path can lose the "who is that".

When you can't ssh to the server, have you tried to ping the ip address or do a dns lookup on it? assuming the IP address is not changing, doing a ping of the server IP from somewhere else should reestablish the plumbing for you. Since you have an alternative path for the console you can verify if the IP address changes

cy@ · Apr 25, 2024

mer said:
So it sounds like for a while you can access the server from the outside, then after a period of inactivity you can't so you have a way to get a console to the device where you then ping out from the server.
Is that about right?
If so it sounds like firewall type of issues. If you set up a cron job on the server to say ping google.com every 5 or 10 minutes that would keep "holes" open from the server out.

Yes, sounds like firewall issues. Some firewalls will time out inactive TCP sessions. At $JOB we have a 30 minute timeout for inactive TCP sessions. Configuring TCP sessions, setsockopt() or globally, works around the problem.

I doubt ICMP ECHOs will keep TCP sessions alive. Though I suppose some firewall admin may block initiating egress sessions if an IP has been inactive for a while. I don't see why one might do this but people do weird things. Captive gateways used by hotels, hospitals, and other public wireless access points do this. I suppose one could apply the same rules to wired connections. If they do that, then I don't see the point except to make life difficult.

If the server machine has a DHCP assigned address, you may want to see if it has expired. Some places place limits on IP address assignments because they haven't allocated enough NATed IPs -- a little short sighted.

SirDice · Apr 25, 2024

vmx(4) seems to imply a VMWare virtual machine. I've seen the virtual switches in VMWare do some really wonky things. Like only being able to access some hosts but not others, in the same freaking broadcast domain. Try turning the vm off. Not reboot, off. Then back on. That usually clears the weirdness on the virtual switches.

VladiBG · Apr 25, 2024

It may be caused by duplicate MAC address in the VM.

pming · Apr 25, 2024

I haven't used VMware in a while, did you install open-vm-tools? This could help with compatibility issues.

wwelvaert · Apr 26, 2024

Hi all. Thanks for many responses and suggestions. I was out of the office this afternoon but will respond back after trying these out tomorrow.

wwelvaert · Apr 26, 2024

cy@ said:
Yes, sounds like firewall issues. Some firewalls will time out inactive TCP sessions. At $JOB we have a 30 minute timeout for inactive TCP sessions. Configuring TCP sessions, setsockopt() or globally, works around the problem.

Thanks for the feedback. Just to clarify, it's not just my session that times out. When this happens, nobody in the office can connect via ssh nor curl, no matter which machine or network they are using on the client side. It's as if the server has totally gone out to lunch.

SirDice · Apr 26, 2024

VladiBG said:
It may be caused by duplicate MAC address in the VM.

That's another possibility, yes. That will cause really weird network issues. Super annoying to debug.

VladiBG · Apr 26, 2024

Yes your entire connection to the virtual switch is lost until another Gratuitous ARP is send to the virtual switch so it can learn the MAC address and update it's ARP cache. You should contact your VMware support so they can investigate why this is happening.
Here's a similar old topic with the VMware:

https://kb.vmware.com/s/article/1008184

VM Ping/ARP issue

We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source. We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules...

communities.vmware.com

Andriy · Apr 26, 2024

wwelvaert could you please grep /var/run/dmesg.boot for vmx?
If you can do a verbose boot and then grep the file, then it would be even better.

wwelvaert · Apr 26, 2024

A few things I can add from tinkering with the servers the past few hours (still working on the other suggestions & requests, thank you all)

1) We powered off the servers overnight but that did not seem to solve the issue.

2) When the problem occurs, ping does nothing.

Code:

$ ping 104.129.130.82
PING 104.129.130.82 (104.129.130.82): 56 data bytes
^C
--- 104.129.130.82 ping statistics ---
74 packets transmitted, 0 packets received, 100.0% packet loss

3) On one of the servers we disabled the firewall (IPFW) and rebooted the server. When the server is running without firewall (obviously not an option in "the real world") we were not able to duplicate the problem. However, we have a another FreeBSD server running in the same datacenter with the same firewall configuration (IPFW with the basic "workstation" firwall type) and it is not affected. The only difference between the affected and not-affected servers is that the not-affected server has been running for a long time and is running FreeBSD 13.2. The affected servers were spun up recently and are running FreeBSD 13.3 and 14.0.

4) When the issue occurs doing service ipfw stop does not fix the problem, neither does ipfw -q flush followed by ipfw -q add 00010 allow tcp from any to me in

wwelvaert · Apr 26, 2024

Below is the output of cat /var/run/dmesg.boot | grep -i "vmx"

Code:

vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x2000-0x200f mem 0xfdffc000-0xfdffcfff,0xfdffd000-0xfdffdfff,0xfdffe000-0xfdffffff irq 18 at device 0.0 on pci3
vmx0: Using 512 TX descriptors and 512 RX descriptors
vmx0: Using 2 RX queues 2 TX queues
vmx0: Using MSI-X interrupts with 3 vectors
vmx0: Ethernet address: 00:50:56:1c:6f:5b
vmx0: netmap queues/slots: TX 2/512, RX 2/512
vmx1: <VMware VMXNET3 Ethernet Adapter> port 0x3000-0x300f mem 0xfdefc000-0xfdefcfff,0xfdefd000-0xfdefdfff,0xfdefe000-0xfdefffff irq 19 at device 0.0 on pci11
vmx1: Using 512 TX descriptors and 512 RX descriptors
vmx1: Using 2 RX queues 2 TX queues
vmx1: Using MSI-X interrupts with 3 vectors
vmx1: Ethernet address: 00:50:56:3a:d9:d1
vmx1: netmap queues/slots: TX 2/512, RX 2/512
vmx0: link state changed to UP
vmx1: link state changed to UP

pming · Apr 26, 2024

wwelvaert said:
My etc.conf file:

Code:

ifconfig_vmx0="inet 103.98.214.13 netmask 255.255.255.0" ifconfig_vmx1_alias0="inet 172.16.0.4 netmask 255.255.255.0"

shouldn't there be an entry like vmx1="..." or is this supposed to work as intended?

mer · Apr 26, 2024

wwelvaert said:
When the problem occurs, ping does nothing.

Is this a ping from outside to the server or the other way?

wwelvaert said:
The only difference between the affected and not-affected servers is that the not-affected server has been running for a long time and is running FreeBSD 13.2. The affected servers were spun up recently and are running FreeBSD 13.3 and 14.0.

That is a good data point; I have nothing to say beyond that.

cy@ · Apr 26, 2024

wwelvaert said:
Thanks for the feedback. Just to clarify, it's not just my session that times out. When this happens, nobody in the office can connect via ssh nor curl, no matter which machine or network they are using on the client side. It's as if the server has totally gone out to lunch.

Does it still have an IP? Has it's DHCP lease expired?

If it still has an IP, can you ping the gateway? But before you do that, look at the arp table. Then look at the arp table after you ping the gateway.

VladiBG · Apr 26, 2024

remove this line:

00310 deny icmp from any to any in via vmx0

You don't want to block all ICMP. If you want to deny echo only it's another story.
When you deny the entire ICMP protocol you also break the MTU & MSS Path MTU Discovery as they depend of the ICMP (DF bit)

Try to allow the following

290 allow icmp from any to me icmptypes 3,4,11

And if you want to allow echo

291 allow icmp from any to me icmptypes 8

Maximum transmission unit - Wikipedia

en.wikipedia.org

wwelvaert · Monday at 6:41 PM

By way of update, we "fixed" the problem by going back to Linux on Linode (now Akamai). We've had servers running there for quite a few years, it was my personal preference to try FreeBSD (old hippie at heart) but since this is a work project I can't justify going on an expedition. I wish there were more/turn-key FreeBSD server options available but as it were we're back with Debian on Linode.

Network goes away after some time

wwelvaert

mer

SirDice

Administrator

wwelvaert

wwelvaert

mer

cy@

SirDice

Administrator

VladiBG

pming

wwelvaert

wwelvaert

SirDice

Administrator

VladiBG

VM Ping/ARP issue

Andriy

wwelvaert

wwelvaert

pming

mer

cy@

VladiBG

Maximum transmission unit - Wikipedia

wwelvaert