Swap battle stories

I get the impression from reading this thread that there may be some interesting stories to share on the subject of using or not using swap. Here's mine.

Once upon a time I was working at a place where we had several Hadoop clusters that ran on our own bare metal. Old school, I know. We suddenly started having problems with jobs failing or timing out. This was puzzling because the jobs that ran on the cluster hadn't changed, and we had actually upgraded the cluster software and hardware. We looked into the matter further, and discovered that the HDFS data nodes were dropping out and coming back due to network timeouts. This was also puzzling because our network infrastructure had not changed.

It turned out that some genius was running a Perl map/reduce job, and was relying on the fact that 32-bit Perl could use a maximum of 4GB of RAM. You can probably guess what happened at this point. The upgraded nodes had 64-bit Perl on them. The M/R job started taking insane amounts of memory. This memory pressure forced the HDFS daemons into swap because they were idle most of the time. Then the master node would send them a request which would time out because it took so long to restore the daemons' working set from spinning rust. Eventually the daemon would come back and say "I'm not dead yet!" to the master node. Everything was normal and snappy when the Perl job wasn't running.

We decided at that point that swap = bad, mmkay? and configured all Hadoop nodes without it forevermore. I wonder if that still applies in these days of fast NVMe "disks", though.
 
Similar issue with Perl, we were maintaining a large, self-written, network monitoring system that consisted of various parts, a process to scan all network devices and pull in all information, configurations, etc. A process to check all the configurations for ACLs (it had to have a standard set of ACLs), IOS versions, etc. And a nice web interface where network engineers could verify everything. It all worked quite well and was well written.

Scanning more than 10.000 routers and switches however took a lot of time. So that scanning process was rebuilt to spawn several sub-processes that could then scan several devices concurrently. That greatly improved the scanning rate. We could scan the same amount of devices in much less time. Unfortunately during rollout of a new, improved version of the scanner a bug was introduced. Instead of only spawning an X number of sub-processes it basically turned into a fork bomb and kept creating more and more processes until the machine screeched to a halt with a gazillion processes running.


This memory pressure forced the HDFS daemons into swap because they were idle most of the time. Then the master node would send them a request which would time out because it took so long to restore the daemons' working set from spinning rust.
Without swap those process would have gotten killed by OOM.

We decided at that point that swap = bad, mmkay?
With or without swap those machines would have croaked regardless. Swap wasn't the issue here, it did it's best to keep processes from getting killed due to the memory pressure. Combating symptoms doesn't help if you don't deal with the actual cause of the problem (the Perl job that ate all memory).
 
I never looked into it but sometimes resource limits can sometimes help against undecent behaviour.
 
These examples are in fact one reason I like to have ample swap configured, because it is a lot easier to identify the problem. Without swap they just get killed - then you have an issue and must start post-mortem analysis, which is not delightful: you don't know where to start and must go through all the logfiles to get an idea of the timeline and what might have happened. With swap, things become slow, and then it needs only one glance at the statistics to see what is going on, and another to catch the culprit. (Well, given that somebody indeed looks at the machines... but such things could also be detected and reported by some monitoring tool)

I am just observing that FreeBSD does a very good job managing the swap - I am actually surprized how well it does. As all of us who compile from source, I had to tackle the git matter. And it was not a two-liner - after I decided to do it properly, it grew to (the usual) 3000 lines.
Before, I was using chroot for the compiling environment. (A jail is nothing more than a chroot with network compartmentation, so why use a jail for compiling, which does not need the network?). Now I decided to change to bhyve, which should make the building entirely independent from the hosting machine (and then also provide test environments for anything).

Currently it runs. Three of them, on 8G ram. I did not bother to insert additional ram, yet (would need to power down for that):
Code:
last pid: 68538;  load averages:  4.26,  4.38,  4.40    up 4+23:47:49  12:19:24
209 processes: 2 running, 207 sleeping
CPU:  1.2% user,  0.0% nice, 98.5% system,  0.0% interrupt,  0.3% idle
Mem: 4772M Active, 384M Inact, 491M Laundry, 1991M Wired, 16M Buf, 188M Free
ARC: 782M Total, 171M MFU, 408M MRU, 5852K Anon, 12M Header, 185M Other
     189M Compressed, 466M Uncompressed, 2.47:1 Ratio
Swap: 20G Total, 9719M Used, 11G Free, 47% Inuse, 48K In

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
66189 root         20 134  i10  3131M   586M kqread   2 291:04 190.90% bhyve
61548 root         27 134  i10  3134M   471M kqread   2 398:29  97.46% bhyve
61861 root         27 134  i10  3134M  1415M kqread   2 397:28  93.12% bhyve

I am really surprized that this still works well. Paging is not slowing things noticeably, the execution times look good. (It probably finds always something to compute - but it obviousely does create some wear on the 29$-qlc drives.)

What I am generally noticing is a tendency to try and squeeze the last 0.5% of performance out of machines. For what benefit? I finally got a new (used) board yesterday (the long desired haswell+, dual-socket, regECC xeon), and I had a hard time to find all the switches in the CMOS that would disallow the OS from throttling things - all C-states and cpufreq was disabled, with the main effect that it would always draw 310mA from the wallplug when idle. Now it is at 165mA (37W, with one cpu installed), and I can probably live with that...
 
Last edited:
Reading this alongside <https://forums.freebsd.org/threads/how-much-swap.79804/> …

… have ample swap …

Always.

… FreeBSD does a very good job …

👍

For fun, two recent screenshots of GhostBSD in seamless mode in VirtualBox:
  • around 1 G memory
  • 8.50 G swap (probably entered as 8 … when I installed the system)
  • LibreOffice Writer, GIMP, Firefox after loading https://app.element.io/

2021-05-03 07:12:51 GhostBSD 13-RELEASE with 1 GB memory.png


2021-05-03 07:15:35 GhostBSD 13-RELEASE with 1 GB memory, Element.png
 
I don't understand why this continues to be debated? You need a certain amount of ram to get the 'work' done, and swap is free. So, through a due diligence engineering process determine the amount of ram required and put that in the machine. Then add in a generous amount of swap. Spinning disks are too slow for you? Then use solid-state.

I am not a server expert, but I have never had an issue with my FreeBSD machines because it had swapped out some stuff. But then again I don't cheap out on ram, and give my machines generous swap too.
 
Back
Top