swapinfo avail and swap partition size mismatch by a factor of 4.

zirias@ · Mar 30, 2021

PMc said:
Yes it does that - it does move out things to swap after a very long time (hours/days), so this is not related to vm.swap_idle* (which talk about a few seconds).
I don't know where this is controlled or if it can be tuned, but I would agree that it should be fixed or made tuneable: if the machine can run with this working set today, it should also be able to run with it tomorrow in the same fashion.

I still think what you see here is the effect of other things, one candidate being the "periodic" jobs. They are I/O heavy, so ARC (and also other caches) will want RAM. Maybe some of them also need RAM directly. Then of course, if free memory runs short, pages of all these long idling processes are swapped out first.

This shouldn't even be a problem if whatever needs the RAM would reliably give it back when needed – then, you'd of course notice when using those idle applications, but only shortly. But that's what I desribed earlier: ARC (up to 12.x) was very reluctant to really give back memory. Maybe it isn't the only example.

Mjölnir · Mar 30, 2021

Zirias said:
And the use-cases made up in this thread still just won't work

I think you're referring to what I came up with? Please elaborate, and keep in mind that @one place I explicitely noted locality (i.e. in terms of memory access). On the other use cases -- well, these are expicitely mentioned in tuning(7), and FMLU the mathematical facts underlying CS didn't change in the last decades...

Snurg · Mar 30, 2021

Caveats regarding mmap and ZFS.

zirias@ · Mar 30, 2021

Sorry Mjölnir, the two cases I've seen from you were

tmpfs, which doesn't make sense, because if all, it can only swap out file contents, and if that was something happening regularly, using tmpfs would be moot, cause a regular on-disk filesystem would perform better.
heavy multi-user systems, well, partially, cause adding a lot of swap there only serves to prevent the OOM killer, so the system keeps running (somehow), but performance will be bad. The tunables you mention are a partial mitigation, cause if you swap out idle processes more agressively, the chances that actually used processes get stalled by heavy swapping are slightly lower.

Did I overlook something else?

zirias@ · Mar 30, 2021

Snurg said:
Caveats regarding mmap and ZFS.

Oh well, that ARC has to play nice by actually knowing about mmapped files is obvious, but do you say that this issue is still unresolved? Hard to believe, mmap(2) isn't exactly an "exotic" thing…

richardtoohey2 · Mar 30, 2021

Not sure if an example of what you are talking about, but MySQL imports and exports can cause MySQL to eat up swap (and then get killed by OOM) on machines with 32G RAM, most of that in the inactive state. And I see this on machines without ZFS, so don't think related to ZFS/ARC. This is on 12.x, sometimes 11.x. In my brief testing so far 13.x seems to handle it a lot better - swap doesn't get touched.

Mjölnir · Mar 30, 2021

I also mentioned

running applications with large RAM footage & high RAM locality, who perform computations on chunks of subsets of their data (@least ~seconds/chunk).
On that "heavy multi-user systems", the manpage tuning(7) & I explicitely noted lots of idle processes. You might have overlooked that small but important detail.
on tmpfs(5), I refined to occasional spikes of using large sizes, which is not unreasonable IMHO.

cfs · Mar 31, 2021

Zirias said:
In any case, if you can't make sure there is some "access pattern" (to the memory pages) in your processing that isn't completely random, the resulting performance will be just as bad as your common memory-pressure situation.

I honestly think you are wrong. I believe you are underestimating what these nvme technologies can do today. They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices. This is literally swapping from RAM to slower RAM. The latency response at high queue depths does not resemble what you are used to for storage at all. They behave like RAM.

Earlier SSDs from 10 to 5 years ago did not behave like that. At low queue depths they had great latency response but that dropped very quickly as the queue depths increased. They were constrained by their controllers and the SATA bus and protocols. Until 3 years ago or so Intel Optane / 3D Xpoint was the only really different one in terms of actual sustained random read behavior resembling RAM when used over nvme/PCIe. They are still number one, but some of the higher end commodity drives are getting close. And by the way these drives are not good at everything, 3D Xpoint is slower for sequential write performance, so if your use case is accelerating a database redo log you want something else. But yay sustained random read.

-Cristian

zirias@ · Mar 31, 2021

cfs said:
I believe you are underestimating what these nvme technologies can do today. They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices. This is literally swapping from RAM to slower RAM. The latency response at high queue depths does not resemble what you are used to for storage at all. They behave like RAM.

The limiting factor is not only the transfer rate but indeed the access time. Looking for numbers, I found these are in the single-digit µs range for those modern drives, which is awesome, but still a factor 1000 from modern RAM (single-digit ns range). So, the performance loss from swapping is still substantial, when compared to just accessing physical RAM.

The other question would be whether this could still be "acceptable" in practice with this kind of modern hardware. A definitive answer to this is only possible by testing/benchmarking, but let's do the maths for a hypothetical scenario anyways:

Let's assume this (uncommon) scenario of a single application working in a huge allocation and let's assume it needs 1GB of pages currently swapped out. In this scenario, we'll probably have many "superpages" of size 2M (on amd64), but as FreeBSD splits up superpages into regular 4k pages again under memory pressure, we'll also have some of them.

A lot of assumptions needed here, I'll assume further to find 128MB in 32768 regular pages and 896MB in 448 superpages (and a similar assumption for the pages that must be swapped out to make room). With this, we must transfer a total of 66432 pages to/from swap in order to swap in this 1GB of memory. With an access time of 5µs, we'd end up in sum with 0.53s just waiting for transfer to start. Even assuming some of the pages could be arranged contiguously on the swap device, say it's 0.4s. Add the transfer itself for 2GB at 4GB/s (0.5s), this makes for a total of 0.9s.

Now, if our application has 300GB to process and can organize its accesses in a way that every piece of memory has to be swapped only once, the additional processing time due to swapping will be ~4.5 minutes. With random and repeated accesses causing every piece to be swapped, say, 5 times, you're already at 22.5 minutes…

Yes, this is nowhere near the catastrophic figures with old drives. But it's still substantial, although it does look "acceptable", depending on the usecase.

Side note: your "typical" memory pressure scenario caused by just lots of processes will still be a lot worse, because you can assume to have much more regular 4k pages and most of the time the access time for each and every one of them.

—
If you achieved one thing, it's inducing me a wish for new hardware, although I don't need it for my private desktop. I rest my case swap can't "replace" RAM, still these speeds are awesome of course.

crypt47 · Feb 15, 2022

Snurg said:
Edit:
The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.

I would be happy to read about it if somebody points me to this well-known issue.

PMc · Feb 15, 2022

crypt47 said:
I would be happy to read about it if somebody points me to this well-known issue.

I explained it a couple of times here already. It is not a specific FreeBSD issue, rather a behaviour of (certain kind of) VMM designs. I learned about it when I was working with AIX some 25 years ago. At that time it was just impossible to run AIX with less swap than ram.

The native behaviour of the VMM is that it expects every memory location to be backed by some file location, and that it can, at any time, re-fetch the memory contents from file. For this to be possible, it is necessary that all modified (aka "dirty") memory locations have a place in swap where they can be put (and later re-fetched). Which basically translates to: you need at least as much swap as you have mem.

Over time this behaviour was gradually remedied on public demand: with memory installments getting bigger and bigger, and swap not even intended to be used because of its slowness, people did not want to reserve huge swapspaces for no practical benefit. And so the VMM was somehow tuned to cope with the situation of not having (enough) swapspace. But this is only an add-on, and it doesn't work well under all conditions.

AFAIK linux does have a different VMM design (not derived from the BSD lineage) that does not have this behaviour.

crypt47 · Feb 16, 2022

PMc said:
I explained it a couple of times here already. It is not a specific FreeBSD issue, rather a behaviour of (certain kind of) VMM designs.

Sorry, didn't read that far.

> The native behaviour of the VMM is that it expects every memory location to be backed by some file location

Yes, I've read about it in old design-and-implementation-of-the-freebsd-operating-system. I put SSD for swap so in case this scenario is on by default it's ok. Currently have 47Gb swaped out with 32Gb of RAM, the browser works just fine.^)

> AFAIK linux does have a different VMM design (not derived from the BSD lineage) that does not have this behaviour.

Yes, there are some tweaks to affect pressure and vm overcommit. What puzzled me is that FreeBSD has different pagers (algorithms or services to move pages) for different types of vm pages, but Linux afaik declares just one. Can't get my head around, but ok... The problem I'm currently tring to solve for fun and profit is to calculate how much swap is used per process. Procstat gives me a lot of VM kernel mapping via kinfo_vmentry struct, but it's seems that it's not what I want. The numbers doesn't correspond to swapinfo output. The FreeBSD design book doesn't mention of kinfo_vmetry at all. My guess that if I want to get the usage of physical memory on SSD, the VM structures don't help me much.

PMc · Feb 16, 2022

crypt47 said:
Sorry, didn't read that far.

> The native behaviour of the VMM is that it expects every memory location to be backed by some file location

Yes, I've read about it in old design-and-implementation-of-the-freebsd-operating-system. I put SSD for swap so in case this scenario is on by default it's ok. Currently have 47Gb swaped out with 32Gb of RAM, the browser works just fine.^)

Ups. I don't ask what You're doing.

(never used more than 4gig for a browser)

But then, if we run a VM with, say, 15gig memory, the guest will access these memory pages in a random fashion. Then, when they are not quickly used again, the host will see them as dirty+idle, and will after some time write them to swap.
So if we have 32Gig installed, and start two VM with 15Gig each, it should perfectly fit. But instead, after a day or so, we have 30Gig in swap.

crypt47 said:
Yes, there are some tweaks to affect pressure and vm overcommit. What puzzled me is that FreeBSD has different pagers (algorithms or services to move pages) for different types of vm pages, but Linux afaik declares just one. Can't get my head around, but ok...

Yes, there are various threads and each of them tries to manage a specific item. It's like a corporation with multiple intelligences each doing their certain job, and the final outcome should be the thing we want...

The effect is, on Linux (didn't use it since 1995, so my knowledge is not current) things run fine as long as memory suffices, but when it starts to move things out, there is a remarkable performance hit. On Berkeley the transition is smooth - it starts to pageout real early, and then gradually increases - you don't notice the precise point where phys mem is full.

crypt47 said:
The problem I'm currently tring to solve for fun and profit is to calculate how much swap is used per process.

Oh yeah. Thats a nice one.

It's quite difficult. I never bothered to get really through with that.

crypt47 said:
Procstat gives me a lot of VM kernel mapping via kinfo_vmentry struct, but it's seems that it's not what I want. The numbers doesn't correspond to swapinfo output.

No, on that high level the figures won't match. Swapinfo shows what is actually written out - and the pager decides on that by it's own discretion. Then there is vm.swap_reserved, that is what has been counted to potentially be written out. It should be possible to sum up the pages that make up this figure, and then step by step get deeper into the mesh.