swapinfo avail and swap partition size mismatch by a factor of 4.

cfs · Mar 30, 2021

Zirias said:
[...]There's just no way a machine can do anything useful unencumbered while heavily swapping.

I guess the reference is too obscure. Here: https://queue.acm.org/detail.cfm?id=1814327

Snurg · Mar 30, 2021

Zirias said:
Therefore, if such a restriction doesn't solve that problem on your machine immediately, something else must be configured in an unusual way.

Yes, and this "unusual configuration" is simply having >= 32GB RAM to make this issue very obvious.
Very few desktop users have 32GB or more RAM installed.
Many users ridiculed me, just because they did not experience this issue due to their low-RAM configurations and apparently were unable to imagine such an absurd system behaviour.

I investigated that issue years ago in depth here in the forums, forgot whether it was on 10 or 11.
ZFS ARC was always cut down to a defined maximum, so this should not be a factor.

Back then, this behaviour was confirmed by several users who also had >= 32GB installed.
The only way I know of to make the system work snappy, without the annoying lags when it swapped in stuff that it swapped out "just because", was to entirely disable swap.
This is no good practice, but it is manageable when you take care that some amount of memory, say ~5-10GB, always stays free.

cfs : Very good suggestion! This also would help avoid that extreme disk thrashing while unswapping stuff.
But I definitely would prefer a sysctl like swappy='off' to restore original UNIX swapping behaviour.

zirias@ · Mar 30, 2021

cfs said:
getting my use case called "ridiculous"

You didn't describe a "use case", as there's no description how this ridiculous amount of swap should ever be used in a sane way. And yes, it's impossible.

cfs said:
The sustained random read latency of 3D Xpoint today is around the speed RAM had in the mid 90s.

We're talking about access times of around a few µs at least, as compared to modern RAM which is in the single-digit ns range here, so that's a factor of roughly 1000. Sure, if your swap device is as fast as possible, this helps a bit in situations when swapping is unavoidable. Still, heavy constant swapping kills the performance of any machine with any workload.

As mentioned above, no matter how quickly you can access a page in swap (where 1000 times slower than a memory cell is a current optimum), it can't be used from there. It must be copied to RAM first, and another one must be copied to swap to make room for that.

zirias@ · Mar 30, 2021

Snurg said:
Yes, and this "unusual configuration" is simply having >= 32GB RAM to make this issue very obvious.

That's very unlikely. My server with 64GB, many virtual machines and jails, also used for desktop stuff remotely, also doesn't use any swap unless there's only very little free RAM (around 1GB). I monitor that closely cause I sometimes run large builds on it and want to make sure it can stand the load.

Snurg · Mar 30, 2021

You and many others seem simply not to understand that this sick swap behaviour is connected with low system load/activity.
Servers like yours are typically not quasi "idling" like a desktop.

zirias@ · Mar 30, 2021

Sorry, that's nonsense as well. My server is idling most of the time, as it only "powers" a private house with two partys.

It's not that I doubt what you observe. But very much the conclusions you present.

cfs · Mar 30, 2021

Zirias said:
You didn't describe a "use case", as there's no description how this ridiculous amount of swap should ever be used in a sane way. And yes, it's impossible.

Is hard for me to see the return on investment of your strategy of digging in hard on that. But ok, let's play.

"The really short version of the story is that Varnish knows it is not running on the bare metal but under an operating system that provides a virtual-memory-based abstract machine. For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it."

Quoted from "You're Doing It Wrong".
By PH Kamp .
10 years ago.

The article is not a 1-1 mapping to our discussion, but that is the VM subsystem and is swapping. And in any case, if you are willing to make hard absolute statements, I imagine you are willing to put the effort on doing some research for a rebuttal, so I won't do that work for you ;-).

-Cristian

Snurg · Mar 30, 2021

*Sigh*
Seems of no use to talk about this topic with people that are either no developers and know the details of how FreeBSD swapping works, or have never made the experience of using a 32+GB desktop.

Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).

Mjölnir · Mar 30, 2021

cfs said:
I want to thank Snurg and Mjölnir for actually addressing my question. I have to say, coming back to the forums after several years and getting my use case called "ridiculous", in the first sentence of the first answer, by a moderator, was not a great experience.

He didn't call your use case "ridiculous", but the configuration with that unusual large amount of swap.

cfs said:
With no other content in the message.

You could have provided that. Or do it now; we're curious...

zirias@ · Mar 30, 2021

I am a developer, I graduated in operating systems at university, I had to write Linux kernel code back then. No, I didn't look deep into FreeBSD kernel code so far, but I do know how virtual memory management works
I do use a "32+GB desktop". This server has a full install of KDE and several applications, and that's used regularly. You don't want to tell me that using this from remote changes swap behavior…

All in all, your conclusions just make no sense. Something else must be wrong in your setup.

Mjölnir · Mar 30, 2021

Snurg said:
Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).

sysctl -d vm.swap_idle_{enabled,threshold{1,2}}

Code:

vm.swap_idle_enabled=Allow swapout on idle criteria
vm.swap_idle_threshold1=Guaranteed swapped in time for a process
vm.swap_idle_threshold2=Time before a process will be swapped out

EDIT to workaround your issue, you could also reduce your RAM & send me the RAM modules that you pull out

According to some info I stumbled upon in the Weltnetz, I can put in a 16 GB RAM module into the memory slot of my ThinkPad, but I only have 4GB ~~bonded~~ soldered on MB + 8 GB in the slot... Thx in advance!
EDIT2 And send more money fast! I need a new M.2 LTE modem card!

zirias@ · Mar 30, 2021

cfs said:
"The really short version of the story is that Varnish knows it is not running on the bare metal but under an operating system that provides a virtual-memory-based abstract machine. For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it."

Given the context you put this quote in, do you understand the article? Because, yes, it is possible to write a (special-purpose!) software actively "exploting" virtual memory. Doing so, it has to make sure swapping in/out is reduced to a minimum. The article describes (among other things) how that is achieved.

Now, this special-purpose software is a high-performance cache for potentially huge amounts of data. Do you want to run that? If not, the conclusion that you could in any way benefit from such a huge swap space is badly flawed.

cfs · Mar 30, 2021

Zirias said:
Given the context you put this quote in, do you understand the article? Because, yes, it is possible to write a (special-purpose!) software actively "exploting" virtual memory. Doing so, it has to make sure swapping in/out is reduced to a minimum. The article describes (among other things) how that is achieved.

Now, this special-purpose software is a high-performance cache for potentially huge amounts of data. Do you want to run that? If not, the conclusion that you could in any way benefit from such a huge swap space is badly flawed.

Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"? Progress! ;-).

I think my use case is as old as the hills. I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result. The problem is the opposite of "embarrassingly parallel", meaning, it can't work by processing pieces of input at a time. We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.

-Cristian

Snurg · Mar 30, 2021

Mjölnir said:
sysctl -d vm.swap_idle_{enabled,threshold{1,2}}
EDIT to workaround your issue, you could also reduce your RAM

The idea might be good, disable idle swapping and setting threshold to a few days...
But if I remember correctly, it was the very normal pager that I was unable to convince to stop swapping out.
I'll try that

But, honestly, I don't want to lose memory

These are old 4GB PC3-10600R, of which I bought a dozen years ago cheap for <2 euros/GB, and filled up my PC.
Non-parity RAMs are way more expensive... ouch

zirias@ · Mar 30, 2021

Snurg said:
The idea might be good, disable idle swapping and setting threshold to a few days...

The thing is, it's disabled by default. And enabling it should only reduce priority for pages of idle processes faster, so they can be swapped out earlier. It shouldn't change anything about not swapping out pages unless there's a (foreseeable) need…

Mjölnir · Mar 30, 2021

Zirias, OT, but IIUC from the conversations on the <freebsd-hackers>, s/o is currently reworking on the OOM-killer. Plus IMHO with these new low-latency NVRAM storage technologies Optane & NMVe, the BeaSD's VM swap implementation could be enhanced to honour a swap device priority to ~~stage~~ stagger swap devices. And now with that review ~~incident~~ accident... IMHO you're the perfect candidate to either help with coding or writing tests or review that stuff.
EDIT When a noob like me can get an account on Phabricator quickly, you as a proven ports(7) author (& maintainer (?)) should get it even quicker.

zirias@ · Mar 30, 2021

Mjölnir said:
Plus IMHO with these new low-latency NVRAM storage technologies Optane & NMVe, the BeaSD's VM swap implementation could be enhanced to honour a swap device priority to stage swap devices.

Nothing against making swap perform better. But it's impossible to solve a few underlying problems, so (yes, in absence of a special-purpose program carefully choosing its data structures in a way that will minimize page faults) heavy swapping will always be a performance killer. In fact, this mentioned program's design actually avoids heavy swapping while still allocating huge amounts of virtual memory.

Mjölnir said:
IMHO you're the perfect candidate to either help with coding or writing tests or review that stuff.

Probably not, cause knowing the theory and having touched some kernel code some time is far from enough to be qualified for such reviews. I'd need to invest a lot of time first to get familiar with the FreeBSD kernel

PMc · Mar 30, 2021

SirDice said:
That's a ridiculous amount, completely overkill.

I disagree. It depends on what the application wants to do.

There are scenarios where it is a lot cheaper to pull in a precomputed working set from swap than to build it anew (and the application may not be designed to write it to file). Then when there are multiple such working sets, used only occasionally and not in parallel, you have the usecase.

SirDice said:
I have 16GB of swap for a machine with 96GB of memory. That's more than enough.

On smaller machines I always had more swap in active use than memory installed. There is obviousely a delay when accessing another application, but that is faster than starting the individual applications only when used.

Sadly, with Rel.12 this doesn't work well anymore, because the knobs to tune swapping behaviour have disappeared (they are now in the NUMA stuff, individual to the numadomain, and not accessible at runtime).

Snurg said:
Edit:
The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.

That's by design of the VM system (and not only FreeBSD): every memory page is logically mapped to some diskspace.
But I have never seen FreeBSD actually make use of it. It does always shrink the ARC first - which is not what I want, because when the ARC is gone, all disk access gets slow, whereas paging out one application and pulling in another, after that delay things would continue to work normally (if the apps aren't used in parallel).
Up to Rel.11 one could set different thresholds for ARC shrinking and for pageout, but -as said above- no longer on Rel.12.

Snurg said:
There are many users who would prefer the traditional simple Unix swap method of only swapping when actually needed? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)

How do you make that happen? Mine won't swap unless needed.

Snurg said:
Your reply is a typical example of the denial of the fact that FreeBSD actually swaps "just because".
With ARC set to a sensible maximum, say, 1GB, there is still a lot of swap going on even when, say, 20% of memory are actually used (e.g. not "free memory"). It is just insane when FreeBSD is able to fill a 2GB swap partition when only ~8GB of 48GB RAM has ever been used since boot and is not "free memory".

What are you doing? My desktop has only 8G, no ARC limit, and no swapping. Except when building two llvm in parallel, and then only a few MB.
Only when running for days, over time a few things are moved to swap.

Zirias said:
You don't help the OP by making up usecases that you think would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.

On smaller scale I had these usecases, and they did work.
2G men installed and ~5G swap in use. Certainly much slower than with all-ram available, but still useable. Now given the current solidstate storage, it may well work with bigger scales also. And the price tag between 64G and 256G ram is significant.

Mjölnir · Mar 30, 2021

Zirias said:
Probably not, cause knowing the theory and having touched some kernel code some time is

enough to @least write or outline tests & constraints. The underlying theory did not change for decades IIUC.

Zirias said:
far from enough to be qualified for such reviews.

Pppffhhh... You would be surprised how many beginner's bugs even a ~~wizzard~~ guru is able to commit @4:00 A.M. (local time)....

Zirias said:
I'd need to invest a lot of time first to get familiar with the FreeBSD kernel

Not true. VM is VM for decades, and beeing a noob can even be advantageous because that noob guy asks nasty questions that the wizzard might forget when s/he's in the flow.

Snurg · Mar 30, 2021

PMc said:
What are you doing? My desktop has only 8G, no ARC limit, and no swapping. Except when building two llvm in parallel, and then only a few MB.
Only when running for days, over time a few things are moved to swap.

I hate to rebuild my desktop, and so I just sleep the PC instead of starting it up/shutting it down every day.
So there are quite long uptimes in which swap used and inactive/laundry memory grows to insane amounts.
I usually reboot only after updates.

PMc · Mar 30, 2021

Snurg said:
Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.

Then you can experience the disruptive feeling when your system suddenly starts to swap in gigabytes that the friendly VM swapped out through the night, and you have no idea whether it will become responsive again in seconds, or whether it is better to go smoke one or make a tea, as this swap-in can sometimes take quite a while.

Ah, now it becomes clearer. (I don't leave the desktop on over night anymore.)

Yes it does that - it does move out things to swap after a very long time (hours/days), so this is not related to vm.swap_idle* (which talk about a few seconds).
I don't know where this is controlled or if it can be tuned, but I would agree that it should be fixed or made tuneable: if the machine can run with this working set today, it should also be able to run with it tomorrow in the same fashion.

Anyway, I can confirm the effect from a different angle: I used to tune my server (contrary to the desktop) to page out ASAP and not shrink the ARC. Then with R.12 the respective knob was gone. And consequentially, at the first day after boot things were not to my liking (under load the ARC was too small to fill up the L2ARC), and I considered putting in another RAM. But, after 2-3 days, it had levelled out, swap usage was at nominal as before, and behaviour also.

Snurg said:
Whatever, when I got enough time and brain free and am in the mood, I'll look at the memory/swap management code and try to find out what needs to be patched to implement a swappiness sysctl, or maybe a build option for zero swappiness (probably easier).

Wow, have fun with that. (If You make it tuneable in both directions, I'm interested.)

Mjölnir · Mar 30, 2021

Snurg said:
I hate to rebuild my desktop, and so I just sleep the PC instead of starting it up/shutting it down every day.
So there are quite long uptimes in which swap used and inactive/laundry memory grows to insane amounts.
I usually reboot only after updates.

Aha. That whole suspend/resume stuff is not 100% sound (very likely also due to broken ACPI BIOSes). I have numerous issues, unfortunately you guys keep posting interesting stuff here & I don't find the time to write 1/2-way qualified bug reports (I also have the pride to @least try to find a fix or workaround)... So would you agree to just reboot once a week?

Snurg · Mar 30, 2021

Mjölnir said:
Aha. That whole suspend/resume stuff is not 100% sound (very likely also due to broken ACPI BIOSes). I have numerous issues, unfortunately you guys keep posting interesting stuff here & I don't find the time to write 1/2-way qualified bug reports (I also have the pride to @least try to find a fix or workaround)...

Yes, that's sometimes hard to find all suspend/resume breakers.
But it's sort of challenge, too.

Mjölnir said:
So would you agree to just reboot once a week?

Uh-uh. Question: how often should one update for a reasonably safe machine?

Code:

% uptime
7:45PM  up 35 days,  5:54, 8 users, load averages: 0.36, 0.66, 0.64
%

I think you are right, I should update and reboot more often... last reboot was when uptime >60 days.

Edit: Actual uptime is way less, as I sleep the PC multiple times every day.

Crivens · Mar 30, 2021

cfs said:
Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"? Progress! ;-).

I think my use case is as old as the hills. I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result.

I smell FORTRAN...

cfs said:
We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.

-Cristian

Do you read it all at once? If yes, mmap will not change much. If you re-read it multiple times a tmpfs may help.
FreeBSD is good at caching file content, ZFS also. We need more detailes what you are doing.
Untill then, swapping on multiple NVMEs is your second best way. Best one is much more RAM.

And there is a difference in VM detailes between FreeBSD and Linux. I did read those codebases, some time ago.

zirias@ · Mar 30, 2021

cfs said:
Ok, so we moved away from "ridiculous" and "impossible" to "huge" and "if not"? Progress! ;-).

No, the fact that you can have a single application working in a huge virtual address space perform in an acceptable manner if it is written in a way aware of the problem (simplest possible case "sequential processing") just doesn't invalidate the generic reasoning. Add some other memory-hungry processes to the picture and you're in "heavy swapping" scenario again. And the use-cases made up in this thread still just won't work.

So, instead of acting offended by people telling you how bad that idea is, you could just have explained quickly that this is for a special-purpose host/application.

cfs said:
I think my use case is as old as the hills. I have a big data set and a program that works by loading it, doing processing for a long time, and then writing the result. The problem is the opposite of "embarrassingly parallel", meaning, it can't work by processing pieces of input at a time. We could re-write the program to use mmap, but that is not a trivial change that will be more expensive than paying $134 for 1 Tb of nvme.

TBH, reading this article about the design of that web cache, mmap(2) was the very first thing I had in mind as a means to make something like this, taking advantage of the 64bit address space, work without a very weird system configuration. Just maybe, there might be a small performance penalty having to pass the filesystem (although I don't think it would be relevant, but you typically have to test such things to be sure…)

In any case, if you can't make sure there is some "access pattern" (to the memory pages) in your processing that isn't completely random, the resulting performance will be just as bad as your common memory-pressure situation.

swapinfo avail and swap partition size mismatch by a factor of 4.

Administrator