swapinfo avail and swap partition size mismatch by a factor of 4.

Hello,

TL;DR:

My problem: I allocated 256G for swap during installation of 12.2 (zfs install option, single disk). Real memory is 64G. The partition was created correctly for that size; however, swap available as reported by both swapinfo and top shows only 64G.

The last line in dmesg output is telling:

Code:
WARNING: reducing swap size to maximum of 65536MB per unit

but googling I have not been able to find a reference to a limit or how to change it.

The system is a fresh 12.2 install plus an update from source to 12-STABLE.

Longer version:

Rich (BB code):
root@caleuche 00:04:46 /usr/home/cfs
# swapinfo -h
Device          512-blocks     Used    Avail Capacity
/dev/nvd1p3      134217728       0B      64G     0%

root@caleuche 00:08:10 /usr/home/cfs
# diskinfo -t nvd1p3
nvd1p3
        4096            # sectorsize
        274877906944    # mediasize in bytes (256G)
        67108864        # mediasize in sectors
        0               # stripesize
        210763776       # stripeoffset
        4177            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        SHGP31-1000GM-2 # Disk descr.
        AD0AN99691060B214       # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Seek times:
        Full stroke:      250 iter in   0.008899 sec =    0.036 msec
        Half stroke:      250 iter in   0.008701 sec =    0.035 msec
        Quarter stroke:   500 iter in   0.018557 sec =    0.037 msec
        Short forward:    400 iter in   0.015540 sec =    0.039 msec
        Short backward:   400 iter in   0.015513 sec =    0.039 msec
        Seq outer:       2048 iter in   0.063381 sec =    0.031 msec
        Seq inner:       2048 iter in   0.063434 sec =    0.031 msec

Transfer rates:
        outside:       102400 kbytes in   0.065011 sec =  1575118 kbytes/sec
        middle:        102400 kbytes in   0.064456 sec =  1588681 kbytes/sec
        inside:        102400 kbytes in   0.064432 sec =  1589272 kbytes/sec


root@caleuche 00:14:46 /usr/home/cfs
# gpart list nvd1
Geom name: nvd1
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 244190640
first: 6
entries: 128
scheme: GPT
Providers:
1. Name: nvd1p1
   Mediasize: 209715200 (200M)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 24576
   Mode: r0w0e0
   efimedia: HD(1,GPT,6d0640a8-900d-11eb-8ce6-a0369f3e8c80,0x6,0xc800)
   rawuuid: 6d0640a8-900d-11eb-8ce6-a0369f3e8c80
   rawtype: c12a7328-f81f-11d2-ba4b-00a0c93ec93b
   label: efiboot0
   length: 209715200
   offset: 24576
   type: efi
   index: 1
   end: 51205
   start: 6
2. Name: nvd1p2
   Mediasize: 524288 (512K)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 209739776
   Mode: r0w0e0
   efimedia: HD(2,GPT,6d0d0cfa-900d-11eb-8ce6-a0369f3e8c80,0xc806,0x80)
   rawuuid: 6d0d0cfa-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: gptboot0
   length: 524288
   offset: 209739776
   type: freebsd-boot
   index: 2
   end: 51333
   start: 51206
3. Name: nvd1p3
   Mediasize: 274877906944 (256G)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 210763776
   Mode: r1w1e0
   efimedia: HD(3,GPT,6d12c01c-900d-11eb-8ce6-a0369f3e8c80,0xc900,0x4000000)
   rawuuid: 6d12c01c-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: swap0
   length: 274877906944
   offset: 210763776
   type: freebsd-swap
   index: 3
   end: 67160319
   start: 51456
4. Name: nvd1p4
   Mediasize: 725115469824 (675G)
   Sectorsize: 4096
   Stripesize: 0
   Stripeoffset: 210763776
   Mode: r1w1e1
   efimedia: HD(4,GPT,6d161641-900d-11eb-8ce6-a0369f3e8c80,0x400c900,0xa8d4400)
   rawuuid: 6d161641-900d-11eb-8ce6-a0369f3e8c80
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: zfs0
   length: 725115469824
   offset: 275088670720
   type: freebsd-zfs
   index: 4
   end: 244190463
   start: 67160320
Consumers:
1. Name: nvd1
   Mediasize: 1000204886016 (932G)
   Sectorsize: 4096
   Mode: r2w2e3


root@caleuche 00:15:36 /usr/home/cfs
# nvmecontrol identify nvme1ns1
Size:                        244190646 blocks
Capacity:                    244190646 blocks
Utilization:                 244190646 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #01
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read Not Reported
Optimal I/O Boundary:        0 blocks
NVM Capacity:                0 bytes
Globally Unique Identifier:  00000000000000000000000000000000
IEEE EUI64:                  ffffffffffffffff
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Best
LBA Format #01: Data Size:  4096  Metadata Size:     0  Performance: Best

root@caleuche 00:15:56 /usr/home/cfs
# uname -a
FreeBSD caleuche 12.2-STABLE FreeBSD 12.2-STABLE r369525 GENERIC  amd64

Code:
root@caleuche 00:29:44 /usr/src
# dmesg | tail
ums0 on uhub0
ums0: <KYE Optical Mouse, class 0/0, rev 1.10/0.00, addr 1> on usbus0
ums0: 3 buttons and [XYZ] coordinates ID=0
uhid0 on uhub0
uhid0: <NOVATEK USB Keyboard, class 0/0, rev 1.10/1.12, addr 2> on usbus0
Security policy loaded: MAC/ntpd (mac_ntpd)
WARNING: autofs_trigger_one: cv_wait_sig for /n/ failed with error 4
Accounting enabled
Accounting disabled
WARNING: reducing swap size to maximum of 65536MB per unit

Code:
root@caleuche 00:29:48 /usr/src
# dmesg | head -n 60
---<<BOOT>>---
Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.2-STABLE r369525 GENERIC amd64
FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2)
VT(efifb): resolution 1024x768
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz (4200.20-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906e9  Family=0x6  Model=0x9e  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0xc000000<IBPB,STIBP>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 68719476736 (65536 MB)
avail memory = 66711527424 (63621 MB)
[...]

Thanks in advance,

-Cristian
 
The last line in dmesg output is telling:

Code:
WARNING: reducing swap size to maximum of 65536MB per unit
I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for total swap, this would be ridiculously large for almost any machine.

A sane swap size for 64GB of RAM is somewhere in the range 8GB to 16GB. A reason for this is: Swap is extremely slow when compared to RAM. So, you NEVER want to have things in swap that are regularly accessed, your machine will be busy swapping in and out and not being able any more to achieve anything else. Swap is only for storing memory pages that aren't accessed for a reasonable period of time, so the RAM is free for something else.

If you ever run into a situation where 64GB + 16GB swap isn't enough, you want the OOM killer, that way, ONE process is sacrificed, but the machine keeps working. If it can just swap more instead, it will do so, and become completely unusable.
 
Or this? I didn't calculate..
Hm, this seems to be the maximum for total swap. On my 8GB RAM desktop, it's set to 15656752, which is a bit short of 60GB, on my server with 64GB RAM, the value is 130478656, which would be nearly 498 GB.

I calculated with the standard page size of 4k here, guess that's what's meant.

Still, just don't try to use silly amounts of swap ;)
 
IIrc the maximum supported swap partition size is 64G on 12.2.
In 13 it has been increased, either 128 or 256G.

If you want more than this, you need multiple swap partitions.

Edit:
The maximum swap size had been increased recently because of the well-known issue that FreeBSD by default has very high swappiness, likes to swap out literally the whole RAM in some scenarios. And for this reason it can be bad if the swap is smaller than RAM.
 
I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for total swap, this would be ridiculously large for almost any machine.
Please RTFM tuning(7). Notable exceptions to "almost" below.
A sane swap size for 64GB of RAM is somewhere in the range 8GB to 16GB. A reason for this is: Swap is extremely slow when compared to RAM.
This gap reduced significantly with NVMe devices, which the author OP has.
So, you NEVER want to have things in swap that are regularly accessed, your machine will be busy swapping in and out and not being able any more to achieve anything else. Swap is only for storing memory pages that aren't accessed for a reasonable period of time, so the RAM is free for something else.
Not correct. There are two notable use cases that need a lot of swap, even on systems with huge RAM, and the system will not "not being able any more to achieve anything else":
  1. When a tmpfs(5) is used heavily in terms of size. tmpfs(5) resides in uses swapspace when RAM is exhausted.
  2. Systems where many users with a huge accumulated working set in RAM log in & out frequently "and lots of idle processes", e.g. a number cruncher server of a R&D department or a large build(7) server. Then the admin may want to tune the VM system according to tuning(7), i.e. tune vm.swap_idle_enabled & the two vm.swap_idle__threshold{1,2} to swap out processes very early.
If you ever run into a situation where 64GB + 16GB swap isn't enough,
WhenEVER (;)) you're tempted to use total terms like any, (n)ever, all, none, etc., please carefully question yourself & search for notable, reasonable exceptions.
you want the OOM killer, that way, ONE process is sacrificed, but the machine keeps working. If it can just swap more instead, it will do so, and become completely unusable.
Not correct. See above.
To the genuine Q, all I found that you may succeed by increasing kern.maxswzone. FMLU, the #swap devices is limited (hardwired) to 4.
EDIT corrected statement about tmpfs(5). I didn't initially write what I thought... typical mismatch between thoughts - words.
 
When a tmpfs(5) is used heavily in terms of size. tmpfs(5) resides in swapspace.
No, tmpfs(5) uses memory, it can use the total amount of memory which includes swap. It will use RAM first if that's available and will only move things to swap if there are more pressing needs for the memory. If you need to literally store gigabytes of data then you shouldn't be using tmpfs(5) for that. As it directly competes with things like file and process cache that's stored in memory. As the disks are fast enough it makes more sense to just create a "real" filesystem for it in that case.
 
This gap reduced significantly with NVMe devices, which the author OP has.
No. The difference in access time is still several magnitudes. Add to that that swap can't be physically addressed by the CPU, so you always have the copying effort.
When a tmpfs(5) is used heavily in terms of size. tmpfs(5) resides in swapspace.
Plain out wrong. It can swap out actual file contents, but not metadata. And then, for accessing these contents, they must be swapped back in.
Systems where many users with a huge accumulated working set in RAM log in & out frequently "and lots of idle processes", e.g. a number cruncher server of a R&D department or a large build(7) server. Then the admin may want to tune the VM system according to tuning(7), i.e. tune vm.swap_idle_enabled & the two vm.swap_idle__threshold{1,2} to swap out processes very early.
If you have huge amounts of RAM actually accessed frequently, nothing will help. The machine will just constantly swap in/out.
WhenEVER (;)) you're tempted to use total terms like any, (n)ever, all, none, etc., please carefully question yourself & search for notable, reasonable exceptions.
Please know what you claim before expressing strong opinions.
 
If you have huge amounts of RAM actually accessed frequently, nothing will help. The machine will just constantly swap in/out.
Please file in a bug report on the manual page tuning(7)? The important factor is the statement "lots of idle processes", which should better read "significant portions of the working set in RAM not accessed for a long time", where long time is FMLU in the order of @least seconds. If you have e.g. a bunch of engineers doing fluid dynamics calculations with huge working set in RAM but high locality (only a small data subset is accessed for a few seconds, then the next subset etc.; another example I could imagine would be weather forecast?), the system will benefit from huge swapspace and not constantly swap in & out.

A similar effect will be e.g. when the use case generally makes "normal" use of a tmpfs(5), but occasionally peaks of size usage occur. Then "swap space is the saving grace of UNIX and even if you do not normally use much swap, it can give you more time to recover from a runaway program before being forced to reboot". This statement from tuning(7) remains true. It recommends swap = 2*RAM if RAM <= 4GB, else swap = RAM.
Please know what you claim before expressing strong opinions.
Please try to understand the implications & demands of the two use cases I gave as exceptions.
 
I can't find that documented right now, but it's pretty straight forward: the size of a single swap space (you can have multiple) is limited to 64GB. It probably doesn't need much documentation because even for total swap, this would be ridiculously large for almost any machine.
It was 64GB per instance/unit (maximum 4?), so you can have 256GB, just make it 4 partitions on 4 disks (or 4 files).

But, as you and SirDice said, 256GB is just wasted space. Even if you save kernel dumps 64GB is more than enough.
 
Please file in a bug report on the manual page tuning(7)?
Well, I personally don't care what ancient stuff is in there. The current text block about swap dates back to 2014, and is only a slight rewrite of the earlier block from 2009, when it was first changed from unconditionally recommending 2x RAM size. The 1x RAM size recommendation is still there, but pretty much obsolete with systems with 64GB (and more) RAM.
"swap space is the saving grace of UNIX and even if you do not normally use much swap, it can give you more time to recover from a runaway program before being forced to reboot"
THIS statement remained unchanged since the introduction of tuning(7) with FreeBSD 5, in 2003. It's not wrong now, but desperately needs a warning that you can indeed "overdo" it.
Please try to understand the implications & demands of the two use cases I gave as exceptions.
You don't help the OP by making up usecases that you think would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.
 
To the use cases Mjölnir described I want to add the typical desktop usage.
It is just insanely bad user experience if one has to wait seconds to minutes if one switches to another Firefox window, and sees that in top constantly swreading, while the rest of the computer acts sluggish-to-unusable, despite almost all memory being free.

There are many users who would prefer the traditional simple Unix swap method of only swapping when actually needed? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)

Why is the idea of introducing a swappiness option being rejected so much by some people?
 
Well, I personally don't care what ancient stuff is in there. The current text block about swap dates back to 2014, and is only a slight rewrite of the earlier block from 2009, when it was first changed from unconditionally recommending 2x RAM size. The 1x RAM size recommendation is still there, but pretty much obsolete with systems with 64GB (and more) RAM.
When such a large system is used by a large #of users concurrently or runs a reasonable large #of processes with a lot of idle processes, the statements in tuning(7) remain true. They are not obsolete.
THIS statement remained unchanged since the introduction of tuning(7) with FreeBSD 5, in 2003. It's not wrong now, but desperately needs a warning that you can indeed "overdo" it.
Please file in a bug report on tuning(7) and/or talk to the VM wizzards on <freebsd-hackers>.
You don't help the OP by making up usecases that you think would profit from ridiculous amounts of swap; they won't. Of course, to understand why, you must understand how virtual memory and swap work in general.
Please enlighten me or point me to the right direction (links into the Weltnetz).
 
To the use cases Mjölnir described I want to add the typical desktop usage.
It is just insanely bad user experience if one has to wait seconds to minutes if one switches to another Firefox window, and sees that in top constantly swreading, while the rest of the computer acts sluggish-to-unusable, despite almost all memory being free.

There are many users who would prefer the traditional simple Unix swap method of only swapping when actually needed? (e.g. swappiness = 0, while default FreeBSD behaves like swappiness=99)

Why is the idea of introducing a swappiness option being rejected so much by some people?
There seem to be at least two misunderstandings.
  1. This is exactly the problem that will be more likely if you just add more swap. You should avoid it and limit swap to a sane size.
  2. FreeBSD doesn't swap "just because". It does so only when it's necessary. You don't happen to use ZFS on 12.x or earlier? ARC is always wired (think about it, what good would be a cache for a disk stored on a disk), and the old implementation was very reluctant to give back memory, so the system swaps other things instead. If you're not on 13 yet, limit the amount of ARC to prevent that. If that doesn't help, your system simply has too little RAM for your workload. Swap is never a replacement for RAM.
Mjölnir just no. I already explained in depth why such amounts of swap make no sense at all and do harm. If you don't want to understand, that isn't my problem.

Side note: at least the installer does a somewhat sane thing and never automatically suggests a swap partition larger than 4GB:
It's far from a "perfect" implementation though, doesn't seem to look at actual RAM size at all.
 
FreeBSD doesn't swap "just because". It does so only when it's necessary. You don't happen to use ZFS on 12.x or earlier? ARC is always wired (think about it, what good would be a cache for a disk stored on a disk), and the old implementation was very reluctant to give back memory, so the system swaps other things instead. If you're not on 13 yet, limit the amount of ARC to prevent that. If that doesn't help, your system simply has too little RAM for your workload. Swap is never a replacement for RAM.
Your reply is a typical example of the denial of the fact that FreeBSD actually swaps "just because".

With ARC set to a sensible maximum, say, 1GB, there is still a lot of swap going on even when, say, 20% of memory are actually used (e.g. not "free memory"). It is just insane when FreeBSD is able to fill a 2GB swap partition when only ~8GB of 48GB RAM has ever been used since boot and is not "free memory".
This insane behaviour only stops when one deactivates swapping and just takes care that sufficient RAM is free.

I haven't tried FreeBSD 13, so I cannot yet tell whether this insane swappiness has become better.
 
I don't know how you get your system to do anything like that. My desktop with 8GB RAM, ARC limited to 3G (and nothing else "tuned"), chromium with a dozen tabs running, as well as several other applications, leaving right now less than 1G of free RAM, not a single byte of swap is used.

Yes, this is an observed fact, not "denial" of facts. And no, this wasn't different on 11 and 12. My HDD is slow, I notice every bit of having to wait for swap immediately.
 
Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.

Then you can experience the disruptive feeling when your system suddenly starts to swap in gigabytes that the friendly VM swapped out through the night, and you have no idea whether it will become responsive again in seconds, or whether it is better to go smoke one or make a tea, as this swap-in can sometimes take quite a while.

I repeat, this is very disruptive and annoying.

And this is commonly observed experience if you use a PC with plenty of RAM, and not one with only puny amount of RAM like the majority of "desktop" users.
 
Just leave a few big memory eaters like LibreOffice, several tab-filled Firefox windows and the like idle for a long time, say, overnight.
I do that regularly, and had exactly this problem UNTIL I restricted ARC to a sane maximum, back on 11.x. Since then, it never happened again. It was always ARC that forced other things to get swapped out. It probably happens during periodic(8) jobs in the middle of the night. Swapping back in will only happen once you request the data in the swapped pages.

Therefore, if such a restriction doesn't solve that problem on your machine immediately, something else must be configured in an unusual way.

Finally, it's possible this is not needed any more on 13. OpenZFS' ARC very quickly returns memory. I have to try whether this is good enough to not run into a heavy swapping situation, even without restricting ARC.
 
Just a sidenote from someone who's got no clue about the VM system & what all that swap is about (I wonder how I managed to pass the univ tests...): a multitasking OS is well capable of performing other tasks while waiting for some memory pages to get swapped in, espc. when it runs on multi-core HW.
 
Oh, wow. Then please, provoke a memory congestion with lots of swap activity. Observe. Then ask yourself, how many of your processes get anywhere without the need for disk I/O?

Spoiler: very few.
 
Oh, wow. Then please, provoke a memory congestion with lots of swap activity.
1st I have to look up what that is... So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?
 
So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?
Yes, but then you don't have memory congestion.
 
I want to thank Snurg and Mjölnir for actually addressing my question. I have to say, coming back to the forums after several years and getting my use case called "ridiculous", in the first sentence of the first answer, by a moderator, was not a great experience. With no other content in the message. I imagine you guys have to tell newbies "don't do that!" a lot, and this was my first post from a new account. But still, not great. PH Kamp wrote "You're Doing It Wrong" more than 10 years ago.

Snurg, an idea in case is useful: these days you can get a 16 Gb Intel optane drive (3D Xpoint memory on nvme/M.2 form factor) on ebay for around the price of a large starbucks coffee. If you have a free PCIe x1 slot you can buy an adapter from M.2 to PCIe x1 for the cost of a second large coffee. The sustained random read latency of 3D Xpoint today is around the speed RAM had in the mid 90s. I am guessing swapping to that would make your life happier.

-Cristian
 
1st I have to look up what that is... So it's not possible that the tasks who access only pages that are present in RAM get CPU runtime slices while some pages are swapped in or out?
That's not the point. Sure this is possible. Just find some process doing anything useful that doesn't need disk I/O. Also remember that every execve(2) needs disk I/O and memory pages to actually load the binary into. There's just no way a machine can do anything useful unencumbered while heavily swapping. If you don't believe it, test it.
 
Back
Top