swapinfo avail and swap partition size mismatch by a factor of 4.

Zirias

Daemon

Reaction score: 1,181
Messages: 2,140

Yes it does that - it does move out things to swap after a very long time (hours/days), so this is not related to vm.swap_idle* (which talk about a few seconds).
I don't know where this is controlled or if it can be tuned, but I would agree that it should be fixed or made tuneable: if the machine can run with this working set today, it should also be able to run with it tomorrow in the same fashion.
I still think what you see here is the effect of other things, one candidate being the "periodic" jobs. They are I/O heavy, so ARC (and also other caches) will want RAM. Maybe some of them also need RAM directly. Then of course, if free memory runs short, pages of all these long idling processes are swapped out first.

This shouldn't even be a problem if whatever needs the RAM would reliably give it back when needed – then, you'd of course notice when using those idle applications, but only shortly. But that's what I desribed earlier: ARC (up to 12.x) was very reluctant to really give back memory. Maybe it isn't the only example.
 

Mjölnir

Daemon

Reaction score: 1,488
Messages: 2,114

And the use-cases made up in this thread still just won't work
I think you're referring to what I came up with? Please elaborate, and keep in mind that @one place I explicitely noted locality (i.e. in terms of memory access). On the other use cases -- well, these are expicitely mentioned in tuning(7), and FMLU the mathematical facts underlying CS didn't change in the last decades...
 

Zirias

Daemon

Reaction score: 1,181
Messages: 2,140

Sorry Mjölnir, the two cases I've seen from you were
  • tmpfs, which doesn't make sense, because if all, it can only swap out file contents, and if that was something happening regularly, using tmpfs would be moot, cause a regular on-disk filesystem would perform better.
  • heavy multi-user systems, well, partially, cause adding a lot of swap there only serves to prevent the OOM killer, so the system keeps running (somehow), but performance will be bad. The tunables you mention are a partial mitigation, cause if you swap out idle processes more agressively, the chances that actually used processes get stalled by heavy swapping are slightly lower.
Did I overlook something else? :-/
 

richardtoohey2

Well-Known Member

Reaction score: 261
Messages: 499

Not sure if an example of what you are talking about, but MySQL imports and exports can cause MySQL to eat up swap (and then get killed by OOM) on machines with 32G RAM, most of that in the inactive state. And I see this on machines without ZFS, so don't think related to ZFS/ARC. This is on 12.x, sometimes 11.x. In my brief testing so far 13.x seems to handle it a lot better - swap doesn't get touched.
 

Mjölnir

Daemon

Reaction score: 1,488
Messages: 2,114

I also mentioned
  • running applications with large RAM footage & high RAM locality, who perform computations on chunks of subsets of their data (@least ~seconds/chunk).
  • On that "heavy multi-user systems", the manpage tuning(7) & I explicitely noted lots of idle processes. You might have overlooked that small but important detail.
  • on tmpfs(5), I refined to occasional spikes of using large sizes, which is not unreasonable IMHO.
 
OP
C

cfs

New Member

Reaction score: 3
Messages: 6

In any case, if you can't make sure there is some "access pattern" (to the memory pages) in your processing that isn't completely random, the resulting performance will be just as bad as your common memory-pressure situation.

I honestly think you are wrong. I believe you are underestimating what these nvme technologies can do today. They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices. This is literally swapping from RAM to slower RAM. The latency response at high queue depths does not resemble what you are used to for storage at all. They behave like RAM.

Earlier SSDs from 10 to 5 years ago did not behave like that. At low queue depths they had great latency response but that dropped very quickly as the queue depths increased. They were constrained by their controllers and the SATA bus and protocols. Until 3 years ago or so Intel Optane / 3D Xpoint was the only really different one in terms of actual sustained random read behavior resembling RAM when used over nvme/PCIe. They are still number one, but some of the higher end commodity drives are getting close. And by the way these drives are not good at everything, 3D Xpoint is slower for sequential write performance, so if your use case is accelerating a database redo log you want something else. But yay sustained random read.

-Cristian
 

Zirias

Daemon

Reaction score: 1,181
Messages: 2,140

I believe you are underestimating what these nvme technologies can do today. They can give you 800 Mb/s of sustained random reads, today, at commodity hardware prices. This is literally swapping from RAM to slower RAM. The latency response at high queue depths does not resemble what you are used to for storage at all. They behave like RAM.
The limiting factor is not only the transfer rate but indeed the access time. Looking for numbers, I found these are in the single-digit µs range for those modern drives, which is awesome, but still a factor 1000 from modern RAM (single-digit ns range). So, the performance loss from swapping is still substantial, when compared to just accessing physical RAM.

The other question would be whether this could still be "acceptable" in practice with this kind of modern hardware. A definitive answer to this is only possible by testing/benchmarking, but let's do the maths for a hypothetical scenario anyways:

Let's assume this (uncommon) scenario of a single application working in a huge allocation and let's assume it needs 1GB of pages currently swapped out. In this scenario, we'll probably have many "superpages" of size 2M (on amd64), but as FreeBSD splits up superpages into regular 4k pages again under memory pressure, we'll also have some of them.

A lot of assumptions needed here, I'll assume further to find 128MB in 32768 regular pages and 896MB in 448 superpages (and a similar assumption for the pages that must be swapped out to make room). With this, we must transfer a total of 66432 pages to/from swap in order to swap in this 1GB of memory. With an access time of 5µs, we'd end up in sum with 0.53s just waiting for transfer to start. Even assuming some of the pages could be arranged contiguously on the swap device, say it's 0.4s. Add the transfer itself for 2GB at 4GB/s (0.5s), this makes for a total of 0.9s.

Now, if our application has 300GB to process and can organize its accesses in a way that every piece of memory has to be swapped only once, the additional processing time due to swapping will be ~4.5 minutes. With random and repeated accesses causing every piece to be swapped, say, 5 times, you're already at 22.5 minutes…

Yes, this is nowhere near the catastrophic figures with old drives. But it's still substantial, although it does look "acceptable", depending on the usecase.

Side note: your "typical" memory pressure scenario caused by just lots of processes will still be a lot worse, because you can assume to have much more regular 4k pages and most of the time the access time for each and every one of them.


If you achieved one thing, it's inducing me a wish for new hardware, although I don't need it for my private desktop. I rest my case swap can't "replace" RAM, still these speeds are awesome of course.
 
Top