Network router having odd stalling

Hello,
I am running 2 identical FreeBSD 11.0 servers. They are used to connect about 850 clients via DHCP and custom authentication scripts for internet.

The servers are running a Mysql server and a PostGRE sql server
Each server has 13 network ports. One internal motherboard IGB port, and 2 HotLava Shasta 6 port Gigabit ethernet ports also using the IGB port.

The ram is 16Gb and the CPU is a Ryzen 7 1700 8 core CPU.

Under normal use, the system is forwarding 800Mbps from the internet port to the different clients. Bandwidth control is handled via dummynet and IPFW rules. The CPU load runs about 10% to 20%.

The problem I have is when I copy large files on the system (like when I backup the PostGRE sql server, or do a tar backup of the system), the network connections slow down or stop working completely. The internet usage can drop from 300Mbps down to 20Mbps during one of these backups during non peak hours. At this time, the CPU load does not seem to be any higher than normal.

I have changed from interrupt mode to polling with very little change in the problem.

I hope someone here can help me identify the cause and hopefully a solution for this problem. I can't continue delaying database backups forever.

I did notice this on the previous setup which was running a 6 core Athlon CPU and 2 Hot Lava Vesuvius cards (em driver), but the problem was very minor and did not affect the entire network. On this system the CPU load would max out during these backups.
 

I have done some network tuning:

Code:
$ cat loader.conf
kern.vty=vt
net.graph.maxdata=262140
net.graph.maxalloc=262140
net.graph.recvspace=256000
net.graph.maxdgram=256000
net.graph.threads=100
net.link.ifqmaxlen=2048
net.inet.ip.fw.dyn_buckets=1024
net.inet.ip.dummynet.io_fast=1
net.inet.tcp.ecn.enable=1
net.inet.ip.fw.dyn_buckets=65536
net.inet.ip.fw.dyn_max=65536
net.inet.ip.fw.dyn_ack_lifetime=120
net.inet.ip.fw.dyn_syn_lifetime=10
net.inet.ip.fw.dyn_fin_lifetime=2
net.inet.ip.fw.dyn_short_lifetime=10
net.inet.ip.intr_queue_maxlen=4096
kern.maxusers=1500
kern.random.harvest.mask=351
net.inet.tcp.soreceive_stream="1"
hw.igb.rx_process_limit="-1"

Code:
$ cat sysctl.conf
kern.ipc.shm_allow_removed=1
net.inet.ip.forwarding=1
kern.maxfiles=250000
kern.maxfilesperproc=32768
net.inet.ip.dummynet.hash_size=65536 #
kern.ipc.nmbclusters=12255534  # changed from 512000 10-5-2016
kern.ipc.maxsockbuf=134217728
#kern.ipc.nmbjumbop=512000 # changed from 256000 10-5-2016
kern.ipc.somaxconn=8192
net.inet.tcp.sendspace=67108864
net.inet.tcp.recvspace=67108864
net.inet.udp.recvspace=67108864
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.graph.maxdgram=128000
net.graph.recvspace=128000
net.inet.tcp.delayed_ack=0
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535
net.inet.icmp.icmplim=0
net.inet.ip.fastforwarding=1
net.inet.ip.dummynet.io_fast=1
 
Maybe it's not the CPU that's being hit hard but your I/O? If the backup chokes the buses there will be very little left for the data transfers.

To be honest I would suggest taking the DB functionality off this host and onto its own. I would use the 'old' server for the routing functions and this new host dedicated to the DBs.
 
Maybe it's not the CPU that's being hit hard but your I/O? If the backup chokes the buses there will be very little left for the data transfers.

To be honest I would suggest taking the DB functionality off this host and onto its own. I would use the 'old' server for the routing functions and this new host dedicated to the DBs.
I had been considering that option. It would be a PITA to do it as the system is running live and I don't want to take it offline to transfer the data to a different machine.

It be honest, I was hoping there was a solution to keep the setup running as is. Especially since changes can breed problems, and upset customers are not good for business.
 
Just curious, did you run vmstat -w 1? A non-zero number in 'b' column indicates a process was blocked for resources. Frequently seeing a non-zero number here would be signs of an overloaded machine I do believe. Does iostat -w 1 reveal anything?
 
Just curious, did you run vmstat -w 1? A non-zero number in 'b' column indicates a process was blocked for resources. Frequently seeing a non-zero number here would be signs of an overloaded machine I do believe. Does iostat -w 1 reveal anything?
I ran pg_dump for a few minutes and I started to see a few non zero values in the 'b' column. Is there a way to know what resources are lacking to see if anything can be changed to help this issue?

I did see ada0 reaching 150+Mbps at this time too. I forgot to mention the servers are using 2 SSD drives connected via SATA interfaces.
 
Is there a way to know what resources are lacking to see if anything can be changed to help this issue?

That part is beyond me. I do know when I do TAR on my NAS it pins the 'b' column for a few intervals sometimes, and the machine is pretty much 100% 'busy' doing what it is doing. You can always consider upgrades, but that ends up being a 'weakest link in the chain' scenario. I think you should give good consideration to this:

To be honest I would suggest taking the DB functionality off this host and onto its own. I would use the 'old' server for the routing functions and this new host dedicated to the DBs.

....or, migrating your routing/firewall/vpn functions onto devices made especially just for that. Consider the Juniper SRX series devices.
https://www.juniper.net/us/en/products-services/security/srx-series/
 
Also consider trying systat -vmstat. There is a Procs section, and bottom left hand corner has disk info. Hope this helps you find the answer you are looking for.

systat(1)
 
idprio(1) seems to have helped keep the network going without stalling. I will try the systat -vmstat output for more information.
 
idprio(1) seems to have helped keep the network going without stalling. I will try the systat -vmstat output for more information.
I spoke too soon. With more network load, the network traffic completely stalls when doing pg_dump to an external flash drive using "idprio 16 pg_dump" The HDD load is low (less than 50Mbps) and the b column in "vmstat -w 1" shows a 1 much of the time.
 
I spoke too soon. With more network load, the network traffic completely stalls when doing pg_dump to an external flash drive using "idprio 16 pg_dump" The HDD load is low (less than 50Mbps) and the b column in "vmstat -w 1" shows a 1 much of the time.

*grasping at straws *

For pg_dump, It’s likely the PSQL server processes that are creating load, and not be pg_dump tool itself, so that’s what idprio would need to act on, especially if it’s outputting to an otherwise unused SSD. (gstat shows drives fairly idle?)

Might be time for dtrace. Check out sysutils/DTraceToolkit and hotkernel or even https://github.com/brendangregg/DTrace-tools/blob/master/sched/tstates.d

See https://www.slideshare.net/brendangregg/eurobsdcon-2017-system-performance-analysis-methodologies for even more things to try. There’s some off-cpu profiling stuff in there which might help for those wait states.
 
You could try to put your database on another drive than the system drive. High activity on the drive that handle the system and log files can be the reason of your slow down, especially with UFS formatted drive. There is less problem like this one with ZFS.
 
You could try to put your database on another drive than the system drive. High activity on the drive that handle the system and log files can be the reason of your slow down, especially with UFS formatted drive. There is less problem like this one with ZFS.
Thanks for the suggestion, but the database files are already on a second separate drive (not just a partition on the primary drive).
 
The problem does seem related to the PostgreSQL server. I ran a pg_dump from a remote machine, and after a while, the server dropped network connections. The connections resumed once I stopped pg_dump. I also noticed slow network behavior which eventually took the entire network (all interfaces) offline! When I stopped the PostgreSQL server, the network immediately recovered! What is odd is that I never saw PostgreSQL using heavy CPU cycles in top. The processes were nowhere near the top of the process list.

Does anyone have any ideas about this?

I am looking into moving the PostgreSQL server off this server in the near future. but I would like to be able to complete a database backup without the entire network going down while doing the backup for the move.
 
What is odd is that I never saw PostgreSQL using heavy CPU cycles in top. The processes were nowhere near the top of the process list.

I'm guessing here; maybe it's because CPU is not the chokepoint, but your I/O. Maybe motherboard bus is maxing out. CPU is sitting there twiddling its thumb saying easy day at the office because shipping and receiving doesn't have enough staff. When I do TAR jobs on my soon-to-be-replaced my-1st-dear-to-me-FreeBSD-server-that-I-got-up-and-running the CPU does not max out, but the machine does become very sluggish. The 'b' column during vmstat -w 5 does show frequent non-zero value.

I'm not sure what more you can do. Upgrade server hardware? Offload PostgreSQL to another server? Offload networking to a device made for networking (consider Juniper which runs Junos which is FreeBSD technology)?
 
Back
Top