Solved FreeBSD Inactive memory

Hello,

Out of 10 machines one of the machine 90% memory is being put into the Inactive memory. Lot of analysis done still we are not able to solve the issue without rebooting the OS. Someone please help me to fix the issue.
Code:
Mem: 78M Active, 1503M Inact, 160M Wired, 112M Buf, 260M Free
Thanks in advance - Sugumar
 
(Please correct me if i am wrong!)
Having inactive memory is a normal state.
Inactive means Memory who is allocated by some program but not accessed for a time. Maybe the memory owner is idle or sleeping. This memory is marked for being moved to swap, if some program wants more memory.
 
A side note: memory management is one of several behaviors which, under Windows, differs significantly from every other operating system in common use today---OS X, Linux, Android, iOS, *BSD, Solaris and Haiku all handle things like physical RAM and disk reads/writes in similar ways.[1] Windows may be the most widely used OS on the planet, but in the grand scheme of things it's the weird black sheep of the family.

[1]: I'm just mindlessly assuming that Windows is what the OP is used to, as this question is frequently asked by Windows users new to Linux.
 
Sorry for the 5 year necro-bump, but I figured this was the most appropriate thread to continue.

Code:
70 processes:  1 running, 69 sleeping
CPU:  2.8% user,  0.0% nice,  2.9% system,  0.0% interrupt, 94.2% idle
Mem: 541M Active, 21G Inact, 4281M Laundry, 4859M Wired, 454M Buf, 681M Free
ARC: 1666M Total, 964M MFU, 339M MRU, 445K Anon, 135M Header, 227M Other
     688M Compressed, 2499M Uncompressed, 3.63:1 Ratio
Swap: 4096M Total, 4073M Used, 23M Free, 99% Inuse

I'm in a situation where my FreeBSD 12.1-RELEASE system is running out of swap, yet there's 21GB of Inact memory.

There's 32GB of physical RAM, and the largest single consumer of memory is MySQL at ~8GB. Every other process is <500MB and many <100MB.

Typically I see the extra memory consumed by ZFS ARC (under Wired), and therefore available for use by other processes, but right now that nominally available memory seems to be stuck in Inact.

I temporarily ended MySQL to see what would happen, but that just made things worse - the memory freed up quickly dropped, and now there's 25GB of Inact memory. I can't even restart MySQL!

I'm struggling to understand why the system is hitting 100% swap, yet not releasing that huge amount of Inact memory?

[edit] Looking at https://wiki.freebsd.org/Memory , I'm still don't understand why the memory has not gone from the Inact to Free state. There's no unusual disk activity (there's a blip every few seconds), so it's not like the system is desperately trying to flush dirty pages to disk (also, I have sync=always on zroot, since it has an SSD SLOG). After my initial diagnostic actions 45+ minutes ago, the large majority of physical RAM is still in the Inact state.
 
If 5.7 you could try this - use tcmalloc instead of jemalloc (but sounds like you'll have to reboot first to be able to start MySQL) to see if it helps: https://forums.freebsd.org/threads/...lines-exhaust-ram-and-swap.72733/#post-464070

Yes, I had this problem previously, but I changed to tcmalloc some months ago. I don't think it's related, because the large majority of memory marked Inact cannot be attributed to any process (or ZFS ARC). If it was a process with a massive memory leak, top would reflect the allocation for that process.
 
Several hours later it seems to have fixed itself up - there's now 19G Free.

I'd still really like to know what could cause such a drastic amount of memory to be marked Inact when:

- No process is obviously using it
and
- ZFS ARC is not obviously using it
and
- Free memory is low enough that the system starts to actively swap, to the point of exhausting all swap

Code:
Mem: 332M Active, 4257M Inact, 300M Laundry, 7249M Wired, 207M Buf, 19G Free
ARC: 3381M Total, 2531M MFU, 520M MRU, 762K Anon, 141M Header, 189M Other
     993M Compressed, 3192M Uncompressed, 3.22:1 Ratio
Swap: 4096M Total, 4096M Free
 
From RTFM tuning(7), I'd try sysctl vm.overcommit=4 (or 6, set at least bit 2) and sysctl kern.ipc.shm_use_phys=1. I have no answer to your question, but I guess the VM algorithms might work sub-optimal with your ratio of 8:1 between RAM:swap. The general recommendation is 1:1 - 1:2 for systems with >4GB RAM.

.
 
From RTFM tuning(7), I'd try sysctl vm.overcommit=4 (or 6, set at least bit 2) and sysctl kern.ipc.shm_use_phys=1.

Interesting. Unsure about kern.ipc.shm_use_phys (this seems to apply to shared memory?) but vm.overcommit looks promising. On this page (which I am assuming is a patch that was incorporated into the mainstream kernel) bit 2 is described as:

"Bit 2 allows to count non-wired physical memory as swap. This is like the swap reservation on Solaris going. Additionally, free_reserved pages (exported as vm.stats.vm.v_free_target) are never allowed to be allocated (from the userspace) to help avoid deadlocks."

I don't really understand what this means. I'm trying to get my head around the concept of considering physical memory as swap.

I have no answer to your question, but I guess the VM algorithms might work sub-optimal with your ratio of 8:1 between RAM:swap. The general recommendation is 1:1 - 1:2 for systems with >4GB RAM.

Recommended swap size seems to be an age old question (and debate). I had a quick look again and there is some contention over this FAQ. I suspect that documentation (including best-practice advice) has not kept pace with major changes to FreeBSD over the years. For example, this page, which I came across when trying to solve my issue, talks about 3.0-CURRENT, and describes a 7MB kernel as being large. (It also recommends 2X swap). The handbook as a whole was last modified in 2019, but that particular page is obviously way out of date.

I also notice that the current tuning(7) man page mentions using ipfw to limit bandwidth on T1 connections. :)

Having more swap would may have helped in this instance, but I feel like that's simply hiding the symptoms of the true problem. There's still the issue of 21GB of RAM being sidelined for hours, for no apparent reason.
 
Update. My problem appears to be NFS related.

There's currently (and was earlier) a backup in progress, which rsyncs into this machine, and accesses an NFS share on another machine.

remote server ->[rsync]-> local server 1 ->[NFS]-> local server 2

rsync's delta algorithm reads the entire file at both ends when it believes it has changed, which means that for large files, a local cache can quickly fill.

Unmounting the NFS share immediately frees up the Inact memory.

The only references I can find to NFS caching seem to be related to writes. I can't see an obvious way to reduce the size of (or completely clear) an NFS cache, like you can with ZFS ARC. Perhaps I should ask this in a new thread?
 
Interesting. Unsure about kern.ipc.shm_use_phys (this seems to apply to shared memory?)
Obviously yes. If your DB uses much shared memory -- which usually DB's do? -- then "This feature allows the kernel to remove a great deal of internal memory management page-tracking overhead [...]"
but vm.overcommit looks promising. On this page (which I am assuming is a patch that was incorporated into the mainstream kernel) bit 2 is described as: "Bit 2 allows to count non-wired physical memory as swap. This is like the swap reservation on Solaris going. Additionally, free_reserved pages (exported as vm.stats.vm.v_free_target) are never allowed to be allocated (from the userspace) to help avoid deadlocks." I don't really understand what this means. I'm trying to get my head around the concept of considering physical memory as swap.
Yes, seems that's the page from the original patch. RTFM tuning(7) puts it clearer: "Bit 2 allows to count most of the physical memory as allocatable, except wired and free reserved pages"
[...outdated docs...]
[...] Having more swap would may have helped in this instance, but I feel like that's simply hiding the symptoms of the true problem. There's still the issue of 21GB of RAM being sidelined for hours, for no apparent reason.
  1. If I got it right, the basic VM algorithms did not change fundamentally in the past decades; they are general-purpose, trying to perform well for most use cases, and have several implicit assumptions. Your workload should be covered well, but you're violating one of the implicit assumptions: that swap size >= RAM is available. Thus you may gain better performance by tuning some VM knobs.
  2. The wiki says that for inactive memory "Pages are scanned by the page daemon (starting from the head of the queue) when there is a memory shortage [...]". ==> there was no memory pressure.
  3. You could also play setting kern.maxusers to 1/2 or 1 1/2 - 2 x of the current value (in loader.conf(5)). I noticed it is a prime number, so you may use e.g. primes 500 550 to grab one from a range.
  4. Test your changes (ministat(1), hwpmc(4), hwloc(7) etc.pp.)
 
Generally inactive (standby cache) usage is fine, because the memory can be immediately allocated to binaries that request it.

I think the issue that stems from people not been happy with cache usage, is when it is combined with poor swap algorithm that swaps out memory early whilst there is cache that can be freed instead, or even when there is lots of totally unused memory (seen it happen). Some of us think swap should only be used for disaster avoidance (to avoid OOM), however others feel it is preferable to swap out what its believed to be dormant memory allocation in preference to flushing some cache. Windows is by the far the worst OS for this, horrible default behaviour and untunable, Linux is the best with its tunable 'vm.swappiness' sysctl, FreeBSD default behaviour is the best of the 3 defaults, but I would like it to be tunable.
 
[...] Some of us think swap should only be used for disaster avoidance (to avoid OOM), however others feel it is preferable to swap out what its believed to be dormant memory allocation in preference to flushing some cache. Windows is by the far the worst OS for this, horrible default behaviour and untunable, Linux is the best with its tunable 'vm.swappiness' sysctl, FreeBSD default behaviour is the best of the 3 defaults, but I would like it to be tunable.
It is tunable via sysctl -d vm.swap_idle_{enabled,threshold{1,2}}
Code:
vm.swap_idle_enabled: Allow swapout on idle criteria
vm.swap_idle_threshold1: Guaranteed swapped in time for a process
vm.swap_idle_threshold2: Time before a process will be swapped out
From RTFM tuning(7): The vm.swap_idle_enabled sysctl is useful in large multi-user systems where you have lots of users entering and leaving the system and lots of idle processes. Such systems tend to generate a great deal of continuous pressure on free memory reserves. Turning this feature on and adjusting the swapout hysteresis (in idle seconds) via vm.swap_idle_threshold1 and vm.swap_idle_threshold2 allows you to depress the priority of pages associated with idle processes more quickly then the normal pageout algorithm. This gives a helping hand to the pageout daemon. Do not turn this option on unless you need it, because the tradeoff you are making is to essentially pre-page memory sooner rather than later, eating more swap and disk bandwidth. In a small system this option will have a detrimental effect but in a large system that is already doing moderate paging this option allows the VM system to stage whole processes into and out of memory more easily.
 
For those who stumble across this: inactive memory is the file cache. It is memory which is associated with a vnode (virtual file system id of a file on the system) and has known content. So instead of re-reading that part, the pages get re-activated from inactive.

Now if you have memory which is user code mapped and has not been touched for some time, that get's dropped to swap and that pages can then be used as file cache (yes, inactive memory) to speed other things up.

TL/DR: Inactive memory good. Free memory bad.
 
It is tunable via sysctl -d vm.swap_idle_{enabled,threshold{1,2}}
Code:
vm.swap_idle_enabled: Allow swapout on idle criteria
vm.swap_idle_threshold1: Guaranteed swapped in time for a process
vm.swap_idle_threshold2: Time before a process will be swapped out
From RTFM tuning(7): The vm.swap_idle_enabled sysctl is useful in large multi-user systems where you have lots of users entering and leaving the system and lots of idle processes. Such systems tend to generate a great deal of continuous pressure on free memory reserves. Turning this feature on and adjusting the swapout hysteresis (in idle seconds) via vm.swap_idle_threshold1 and vm.swap_idle_threshold2 allows you to depress the priority of pages associated with idle processes more quickly then the normal pageout algorithm. This gives a helping hand to the pageout daemon. Do not turn this option on unless you need it, because the tradeoff you are making is to essentially pre-page memory sooner rather than later, eating more swap and disk bandwidth. In a small system this option will have a detrimental effect but in a large system that is already doing moderate paging this option allows the VM system to stage whole processes into and out of memory more easily.
Thanks for this recommendation. We have poudrière running in a jail here. When it is building ports for days (say chromium, firefox, rust, and a few other memory-hog pkgs), the base OS throws swap-page error and we end up power-cycling the box.

We are now trying out "ract" to limit resources available to the jail. If it works, great; this tuning is also worth trying.
 
I'm still working through MySQL 5.6 on FreeBSD 12.1 and swap usage and inactive memory. This is my current situation on a machine with 32GB RAM:
Code:
last pid: 28620;  load averages:  0.36,  0.40,  0.40   up 11+14:17:15  10:08:08
89 processes:  1 running, 88 sleeping
CPU:  3.4% user,  0.0% nice,  0.1% system,  0.0% interrupt, 96.5% idle
Mem: 2791M Active, 23G Inact, 2018M Laundry, 2672M Wired, 1572M Buf, 1220M Free
Swap: 3840M Total, 670M Used, 3170M Free, 17% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
68732 mysql        30  20    0    10G  8780M select   6  20.3H  50.99% mysqld
So on this server I've given MySQL memory - probably too much memory - but it all works fine all day long.

Then a mysqldump process runs periodically, and although there's lots of inactive memory, usually (but not every time!) one or two percent dribbles to swap.

Every time I restart mysqld, swap goes back to near-enough zero, so mysqld definitely the process that has been given swap.

I'm finding it hard to reproduce on other machines, but I can definitely mimic what happens if swap reaches 100% - nothing good.

I think what is happening that there is sudden memory pressure during the backup and something (jemalloc? page daemon?) is deciding it hasn't got time to do the laundry or free up inactive pages so it swaps? Or allocates pages out of the swap?

Now if you have memory which is user code mapped and has not been touched for some time, that get's dropped to swap and that pages can then be used as file cache (yes, inactive memory) to speed other things up.

Does that mean some of the inactive pages that belong to MySQL are swapped out, and some of the "inactive" space is given to MySQL for the immediate memory demands?

And that's why neither inactive memory nor swap decrease in size - the swapped out pages are the older inactive MySQL pages, so still counted as part of that process? And the newer, under memory pressure, pages are in the current inactive pages?

I'm looking at trying tcmalloc as one option, and I'll also look at mysqltuner and give MySQL less memory to play with - but keen to understand what I'm seeing.
 
You have to gather memory usage data over time to make any conclusions. Swap usage normally indicates some usage spikes, but there's no way to tell what happened from a single top output.

Swapped pages get deallocated when they are no longer needed (≈ when process terminates) or when something else needs to be swapped out. Otherwise they will stay forever. Can be cleared by removing and adding the swap device with swapoff/swapon.

Having said all that, we used to have an older FreeBSD system that gradually consumed swap space over time for no apparent reason even though there was still over a hundred GB RAM free.
 
You have to gather memory usage data over time to make any conclusions. Swap usage normally indicates some usage spikes, but there's no way to tell what happened from a single top output.
Thank you.

My top output shows the current situation after running for 11 days and 14 hours.

At the start there was no swap used (or maybe ~52M or something like that.)

When mysqldump runs, then usually (90% of the time) near the end of that, swap has increased by 1% or so. I have a script running every minute that stores top & swap usage so I can tie the swap increase to mysqldump running.

Restart mysqld and the swap is released.

So my usage spikes are when mysqldump is running but it seems to also be an interaction with what else MySQL is doing at the time. When all quiet (or on a dev. copy) mysqldump works with no spike and no swap usage. On production, the load will depend on what the users are up to.

It looks like MySQL is asking for more memory to do something when mysqldump is running, and although there seems to be lots of "Inactive" RAM and even a bit of Free RAM, jemalloc/page daemon decides the only thing to do is swap.

Easiest solution is to give MySQL less memory (the swap issue only started when I tried to "optimise" things by giving InnoDB a bigger buffer pool) and/or try tcmalloc but trying to learn more about what I'm seeing. Does it sound possible that the memory request(s) are making "something" decide that swap is the only option? I think the answer is yes.
 
What is hitting swap is memory that was unused for longer than the file contens. The kernel pushes out stuff like that xterm you have on that 27th workspace and uses that memory for cache.

Inactive memory is half-free memory. It can be claimed, but the cost is a potential disc access in the near future.

All is good with your system, you won't find a problem because there is none.

Edit: there is one, but it is the reaper being summoned. Not that inactive grows.
 
Thanks - but when that swap hits 100% then a process (most likely mysqld) will get killed - that is a problem for me.

If I do nothing the swap will reach 100%. I've not had the nerve to wait that long on the production box - I have restarted mysqld to release all the swap space used - but on test machines I've seen what happens if I let swap get to 100% and along comes the reaper and chop a process gets terminated.

There is a burst of pressure on memory allocation and the OS decides to allocate it out of swap instead of RAM.

22.32 last night - swap at 14%, 23G Inactive, 684M free, mysqldump process well under way:
Code:
Tue Oct 13 22:32:00 NZDT 2020
last pid: 17648;  load averages:  1.39,  0.71,  0.38  up 11+02:41:07    22:32:00
70 processes:  2 running, 68 sleeping
CPU:  0.8% user,  0.0% nice,  0.2% system,  0.0% interrupt, 99.1% idle
Mem: 2608M Active, 23G Inact, 2172M Laundry, 2724M Wired, 1576M Buf, 684M Free
Swap: 3840M Total, 575M Used, 3265M Free, 14% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND 
17606 root          1 103    0    29M    13M CPU8     8   1:52 100.00% mysqldump
68732 mysql        30  20    0  9810M  8775M select   9  19.5H  44.87% mysqld

Two minutes later - 2% more in swap:
Code:
Tue Oct 13 22:34:00 NZDT 2020
last pid: 17682;  load averages:  1.13,  0.85,  0.48  up 11+02:43:07    22:34:00
84 processes:  2 running, 82 sleeping
CPU:  0.8% user,  0.0% nice,  0.2% system,  0.0% interrupt, 99.1% idle
Mem: 6771M Active, 19G Inact, 2207M Laundry, 2713M Wired, 1575M Buf, 692M Free
Swap: 3840M Total, 622M Used, 3217M Free, 16% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
17606 root          1  52    0    36M    20M RUN     13   3:36 100.00% mysqldump
68732 mysql        31  20    0  9819M  8735M select   9  19.5H  20.46% mysqld

Inactive has dropped - 4G released ... mysqldump finishes, then there's some gpg and tar steps - not much time passes and within ten minutes inactive becomes Free (but starts to get used immediately which is a good thing - we don't want unused RAM):
Code:
Tue Oct 13 22:44:00 NZDT 2020
last pid: 17829;  load averages:  0.94,  1.04,  0.78  up 11+02:53:07    22:44:00
53 processes:  3 running, 49 sleeping, 1 zombie
CPU:  0.8% user,  0.0% nice,  0.2% system,  0.0% interrupt, 99.1% idle
Mem: 6700M Active, 3507M Inact, 2208M Laundry, 2427M Wired, 1351M Buf, 17G Free
Swap: 3840M Total, 622M Used, 3217M Free, 16% Inuse
 
Sorry, you are right. Summoning the reaper is a problem. Inactive should be freed at once, I did not see this condition on my machines at any time. When the reaper turned up, swap was full and inactive was down to some kB. And then he took out my login process... freeing a lot of resources.
 
The kernel pushes out stuff like that xterm you have on that 27th workspace and uses that memory for cache.
In earlier FreeBSD releases, the kernel did that proactively. The reason was to free up memory in advance, so it could be made available quickly if needed. Typically, on a machine running X11, after a while (several days, IIRC) you could see that the getty(8) processes that were started by init(8) on the unused VTYs (syscons) were swapped out, even though there was plenty of free memory available. These swapped processes were displayed in square brackets in ps(1).

Current FreeBSD versions (at least 12.x) don’t do that anymore by default. I have a stable/12 machine here with three months of uptime, but the getty(8) processes have not been swapped out, even though they were never used. I think the feature can be re-activated with sysctl if needed.
 
Thanks for this recommendation. We have poudrière running in a jail here. When it is building ports for days (say chromium, firefox, rust, and a few other memory-hog pkgs), the base OS throws swap-page error and we end up power-cycling the box.
Not sure, but I think poudriere can be configured to use TMPFS (memory file system) for temporary build files. This might cause memory shortage when building large packages with many dependencies.
 
Back
Top