Server layout: rootfs on USB flash drive or NVMe?

aiui you want your special vdevs to have the exact same redundancy as your main pool, which would complicate adding one of those to our 8-way raidz3
 
If it were my system, I would get a pair of quality NVMe SSDs as large as I could afford. Consider heatsinks as the design stage.
I bought 2 poor-to-medium quality NVMe drives, used, 256 GB. The board has PCIe 3 x 4 lanes IIRC, and NVMe 1.3. So plugging in hyper-dyper-throw-your-money-at-me NVMe drives just doesn't make sense. These are already much faster than the rotating SATA HDD, that's all I want; and since reads can be done in parallel even the cheap "slow" ones will saturate the PCI bus. I have to keep an eye on performance/money.
Heat sinks: yes these are on my list. The NVMe will get one, but the RAM not. This is the 1st time in my life that I do what "modders" do to pimp up their hardware... ;) They have blinkin' lights on their RAM DIMM's and NVMe and so on, crazy!!!
Mirror them using ZFS. But see the caveat bellow about copy-on-write file systems -- you may wish to use a GEOM mirror for the whole disk, and have multiple types of file systems.
Why should I want a GEOM mirror? I would use that for swap, but ATM I think I don't need to mirror swap, because that box is not "mission critical": if it crashes, it crashes and I loose some fresh data, let's say the last 15 minutes. Ok, be it so, I don't care.
I would also install swap here, but keep watch and move it to a SATA mirror if the swap gets too intensive.
If the swap get's too extensive I can double the RAM from 2 x 8 to 2 x 16 GB. For the time beeing, I think 16 GB is pretty much more than I need with sufficient headroom.
[...] A SLOG never needs to be larger than main memory.
THIS is the kind of information I need. I didn't find this anywhere.
I would always consider putting a special VDEV on the SSDs. This is essentially a cache for the file system metadata on the SATA disks. They are a challenge to size correctly.
Why do I have to worry about the correct seizing? Can't I use ZVOLs for these (zpool cache, log, special and dedup vdev).
So put it adjacent to the swap space, and move the swap to SATA if more space is required. Plan for this when you partition the disks.
The NVMe's are big enough to hold 2 x 32 GB swap, i.e. 2 x max RAM. I seldomly needed more tham this, and that was when a program went completely crazy; more swap would only increase the ETA of crash. I'm 100% NOT going to swap to the SATA disks.
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems. And you don't want VMs swapping madly to any underlying SSD.
OK thx I'll keep that in mind and read why this is so.
I would never build a NAS without 100% redundancy. I would mirror the SATA disks... and migrate storage between SSD and SATA as needs dictate.
But I dont need redundancy for all the data. Some is "scratch" data, so why should I mirror it? If it's gone it's gone, I can download or create again.
 
L2ARC requires memory for its allocation tables; roughly 100MB per 1GB stored in L2ARC as a rule of thumb. So L2ARC is always the wrong tool against memory pressure on such low-end systems as you will worsen the problem.
Ok. So my naive plan to have 2 x 64 = 128 GB L2ARC cache (50/50 for the mirror and the scratch) is nonsense, because this would need 2 x 6.4 = 13 GB of RAM. And I have only 16 GB total. So either I cut down the size of the cache drastically or skip it completely.

This also means I gain much space on the NVMe SSD, 1st candidate for a reasonable usage is the base OS. Then the decision is clear: no OS on the internal USB thumb drive, instead install to the NVMe SSDs. Mirror most, and put on the striped "scratch" zpool what does not need redundancy: /var/cache, /var/crash, /var/obj (historically /usr/obj) etc. pp.
 
please put "zpool list -v". We are here now in the swamp.
Alan, I'm still waiting for the 1st 6 TB HDD to arrive, and ATM I have the 1st NVMe but the 2nd comes tomorrow or thursday or friday. Ok? And I still have to "shoot" a 2nd SATA HDD with 6 TB, which is not so easy because some nerds are paying extraordinarily high prices for used hardware. Can you believe that they even pay 5 €/TB for a DEFECTIVE HDD?!!! Yes, they do! That's CRAZY!!! I have time. I bid, and when others bid more, be it so, I'll bid in the next auction. And no SMR please and no such crap like Barracuda or "Green". ATM I have two old SATA SSD with 256 GB in the box, just for testing. zpool list -v would show s/th like that: 158 GB DATA, 158 GB SCRATCH. The box is switched off, so I can't produce a real command output, ok? Cheers.
 
Why should I want a GEOM mirror? I would use that for swap, but ATM I think I don't need to mirror swap, because that box is not "mission critical": if it crashes, it crashes and I loose some fresh data, let's say the last 15 minutes. Ok, be it so, I don't care.
You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.
Why do I have to worry about the correct seizing [of a special VDEV]? Can't I use ZVOLs for these (zpool cache, log, special and dedup vdev).
ZFS terminology can get intense. I don't think that you mean ZVOL, which is a specialized ZFS dataset that presents as a raw block device (virtual disk) rather than a file system. As an example, I use ZVOLs to provision iSCSI storage from my ZFS server to my KVM server for running Windows VMs.

VDEVs may be either for storage or support. The support VDEV classes include the special VDEV. Expanding VDEVs is a lot easier these days than it used to be, but you still need to size them, and special VDEVs are tricky to size. You don't really know how big they need to be until you have instantiated them. So you need to have a contingency plan to grow, if required. Placing the metadata from a slow disk on an SSD special VDEV can deliver significant benefits in some circumstances (e.g. metadata intense applications like find(1)).
 
I have space. But if you use them MIRROR on diferrent drives. Or not use them at all.
Code:
special                     -      -      -        -         -      -      -      -         -
  mirror-6              31.5G  4.20G  27.3G        -        1G    56%  13.3%      -    ONLINE
    gpt/SSD_A_SPECIAL     33G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
 
You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.
What is the disadvantage of providing a ZVOL to a non-COW fs VM? I'd like to have it all on ZFS if at all possible, so that zpool(8) and geom(4) do not interfere. Unfortunately we should not swap to a zvol, although this is explicitely handled in one of the rc/service(8) scripts in /etc/rc.d.
ZFS terminology can get intense. I don't think that you mean ZVOL, which is a specialized ZFS dataset that presents as a raw block device (virtual disk) rather than a file system.
Oh I think my understanding was ok but maybe the wording was not 100% accurate.
VDEVs may be either for storage or support.
It's all "storage" & *data". Metadata (the support vdev) also needs to land on a physical storage device eventually, and it's a kind of data itself.... Maybe the most accurate terms are userdata and metadata.
The special vdev holds both kinds of data: userdata (small files) AND metadata, hence the name, while cache and log are (more or less) userdata-only and dedup is metadata-only; although I strongly guess they all need to store some internal metadata.

Now what I wanted to do is to use a zvol from a zpool on the faster devices as a special, cache and log vdev for a zpool on the slower devices, because I want to avoid using partitions, because they are not flexible: the size is fixed, changing it requires to copy all data to another device and copying back after resizing. In contrast, zvols can be resized easily; that's the LVM side of ZFS.
Why should that not be possible? Does it have any disadvantages to use a ZVOL as a special, cache or log vdev for another zpool(8) on different (and significantly slower) devices?

If is does have disadvantages or is not possible at all, that would be a HUGE disappointment. The main motivation to use a LVM is to get this flexibility to resize partitions (or whatever you call these: parts of the size of a storage device) on demand and not getting stuck to your wrong guess at creation time.
 
Gone rephrase, zvol is anything that is not cache log special.
Just going to paste my setup, it just my config, zpool list -v ,currently on linux mint , does not matter, doing zpool import -a,
Code:
zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
FOURTERRA     4.55T  4.40T   149G        -         -     2%    96%  1.00x    ONLINE  -
  sdf1        4.55T  4.40T   149G        -         -     2%  96.8%      -    ONLINE
KEEP           104G  1.31M   104G        -      858G     0%     0%  1.00x    ONLINE  -
  mirror-0    72.5G   112K  72.5G        -      858G     0%  0.00%      -    ONLINE
    sdd         73G      -      -        -      858G      -      -      -    ONLINE
    sdc         73G      -      -        -      880G      -      -      -    ONLINE
special           -      -      -        -         -      -      -      -         -
  mirror-1    31.5G  1.20M  31.5G        -      900G     0%  0.00%      -    ONLINE
    sdd         32G      -      -        -      900G      -      -      -    ONLINE
    sdc         32G      -      -        -      922G      -      -      -    ONLINE
logs              -      -      -        -         -      -      -      -         -
  sdd           32G      0  31.5G        -      900G     0%  0.00%      -    ONLINE
cache             -      -      -        -         -      -      -      -         -
  sdc           64G      0  64.0G        -         -     0%  0.00%      -    ONLINE
SSD            530G   211G   319G        -      430G     0%    39%  1.00x    ONLINE  /mnt/SSD
  indirect-0      -      -      -        -         -      -      -      -    ONLINE
  indirect-1      -      -      -        -         -      -      -      -    ONLINE
  mirror-2     498G   207G   291G        -      430G     0%  41.5%      -    ONLINE
    sdd        500G      -      -        -      430G      -      -      -    ONLINE
    sdc        501G      -      -        -      452G      -      -      -    ONLINE
  indirect-3      -      -      -        -         -      -      -      -    ONLINE
  indirect-4      -      -      -        -         -      -      -      -    ONLINE
  indirect-7      -      -      -        -         -      -      -      -    ONLINE
special           -      -      -        -         -      -      -      -         -
  mirror-6    31.5G  4.21G  27.3G        -      900G    56%  13.4%      -    ONLINE
    sdd         33G      -      -        -      898G      -      -      -    ONLINE
    sdc         34G      -      -        -      920G      -      -      -    ONLINE
TREETERRA     2.53T   211G  2.33T        -     2.91T     0%     8%  1.00x    ONLINE  -
  sde         2.54T   211G  2.33T        -     2.91T     0%  8.14%      -    ONLINE
logs              -      -      -        -         -      -      -      -         -
  sdc           32G      0  31.5G        -      922G     0%  0.00%      -    ONLINE
cache             -      -      -        -         -      -      -      -         -
  sdc           64G  1.62G  62.4G        -         -     0%  2.53%      -    ONLINE
zvol is the real data is here it is , 1.sdf1, 2.mirror-0, 3.mirror-2, 4.sde. PS : must do better job in getting gpt labels going.
 
Status to satisfy Alain De Vos
Storage:
  1. 2 x SATA III rotating HDD 6 TB (datacenter quality), used, 1 arrived a minute ago, one still missing because crazy <censored> bid moon-prices for used hardware
  2. 2 x NVMe 1.3 M.2 SSD 256 GB (poor to medium quality, consumer-grade), used, 1 already built in, ETA of the other: today - tomorrow
    IMHO quality matches capability of mainboard (PCI 3 x 4) when accessed parallel. Cheap heat sinks will be bought today.
  3. RAM: 2 x 8 = 16 GB DDR 4, 3.2 GHz eff. freq., MB/CPU can handle 2.933 GHz max. => no heat sink needed IMHO (location: Berlin, GER, no special cooling facilities)
    IMHO it's reasonable to account for approx. 50% of the RAM to support (cache & log) & manage the NV storage, and the other 50% are for the OS, services and VMs
  4. I want to make 2 zpools: one mirrored for DATA and one striped for SCRATCH: 2 x 6 TB raw = 66% mirrored + 2 x 33% striped = 4 TB DATA +4 TB SCRATCH
  5. Most likely the OS will be be on a 3rd and 4th zpool on the SSDs, 50/50 mirrored/striped like the big zpools on the HDDs.
Decision to upgrade to 2 x 16 = 32 GB RAM is at ~50%, because:
  • L2ARC cache needs 1/10 of it's size in the RAM. I did not expect that it's so much.
  • a cache size to handle average/unforeseen access pattern with sufficient effectivity usually is in the lower one-digit % range. A smaller cache size (in the permille-range) will only be effective with very special access patterns. Since this machine is most likely going to be a general purpose home server, I would need L2ARC cache sizes of 2 x 40-120 GB (1-3%) for 2 x 4 TB zpools => 2 x 4-12 GB RAM for L2ARC, but I want to give only ~8 GB total, i.e. for ARC + what the L2ARC needs in the RAM. So I can try to start with the lowest reasonable value of 2 x 40 GB L2ARC which consumes 2 x 4 GB RAM, plus the ARC will take approx. 1-2 GB, plus other ZFS RAM usage. Sum is at least 9-10 GB, more likely 12 GB(?), 4-7 GB RAM left for OS + services + VMs. That's not much, but might be ok for the time beeing.
 
one mirrored for DATA and one striped for SCRATCH: , good thinking. L2ARC size , in memory or disk.
I tried to be as precise as possible but here it is again:
L2ARC on NV (non-volatile) storage: 1-3% of what it caches = 2 x 40-120 GB on 2 NV (fast, NVMe) SSDs to cache 2 x 4 TB zpools (MIRROR and SCRATCH) on 2 rotating HDDs (6 TB each) => the L2ARC eats 1/10 of it's NV-size in RAM = 2 x 4-12 GB of RAM.
This means I can start with the 16 GB RAM I already have, but the ZFS will eat approx. 10-12 GB of that. Then I have 4-6 GB left for OS + applications, not much but I'll see, maybe it's perfectly enough for some time.
 
ZFS do ..., in sysctl conf you can specify min and max, depends on your situtation. Its ok for now, but over time , we have a better idea.But then must have partition free , zpool free, to implement this newer better idea.
Now lets say,i have 10GB of ram , you dont want special ,log ,cache device on a specific zpool to have 10 GB.
You take a multiple of 10GB. 2 or 4 , no ? not 10% of it. I'm not all knowing ...:)
 
My rule of thumb. SSD is bad , they die.
1% die, one in hunderd.
You make zfs mirror , there is no performance.
% both die , one in then-thousand.
So I have chosen for this setup.
Ok, I lose 50% possible data usage. But i have speed & reliability combined.
And SSD has for me 10 X writing speed compared to spinning drive.
If both die at same time I'll let you know.
 
Status to satisfy Alain De Vos


root@freebsd:~ # diskinfo -vciStw /dev/ada0
/dev/ada0
512 # sectorsize
6001175126016 # mediasize in bytes (5.5T)
11721045168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
11628021 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
HGST HUS726060ALE610 # Disk descr.
K1HBXDRB # Disk ident.
ahcich0 # Attachment
No # TRIM/UNMAP support
7200 # Rotation rate in RPM
Not_Zoned # Zone Mode

I/O command overhead:
time to read 10MB block 0.049873 sec = 0.002 msec/sector
time to read 20480 sectors 5.130729 sec = 0.251 msec/sector
calculated command overhead = 0.248 msec/sector

Seek times:
Full stroke: 250 iter in 4.847092 sec = 19.388 msec
Half stroke: 250 iter in 3.573786 sec = 14.295 msec
Quarter stroke: 500 iter in 3.478386 sec = 6.957 msec
Short forward: 400 iter in 1.040928 sec = 2.602 msec
Short backward: 400 iter in 2.685817 sec = 6.715 msec
Seq outer: 2048 iter in 0.078152 sec = 0.038 msec
Seq inner: 2048 iter in 0.522972 sec = 0.255 msec

Transfer rates:
outside: 102400 kbytes in 0.437139 sec = 234250 kbytes/sec
middle: 102400 kbytes in 0.510745 sec = 200491 kbytes/sec
inside: 102400 kbytes in 0.920451 sec = 111250 kbytes/sec

Asynchronous random reads:
sectorsize: 938 ops in 3.491822 sec = 269 IOPS
4 kbytes: 791 ops in 3.647868 sec = 217 IOPS
32 kbytes: 735 ops in 3.783458 sec = 194 IOPS
128 kbytes: 685 ops in 3.778035 sec = 181 IOPS
1024 kbytes: 411 ops in 4.413781 sec = 93 IOPS

Synchronous random writes:
0.5 kbytes: 18713.5 usec/IO = 0.0 Mbytes/s
1 kbytes: 19467.2 usec/IO = 0.1 Mbytes/s
2 kbytes: 21087.1 usec/IO = 0.1 Mbytes/s
4 kbytes: 13919.8 usec/IO = 0.3 Mbytes/s
8 kbytes: 14450.4 usec/IO = 0.5 Mbytes/s
16 kbytes: 14185.1 usec/IO = 1.1 Mbytes/s
32 kbytes: 13914.2 usec/IO = 2.2 Mbytes/s
64 kbytes: 14132.5 usec/IO = 4.4 Mbytes/s
128 kbytes: 14946.7 usec/IO = 8.4 Mbytes/s
256 kbytes: 15920.7 usec/IO = 15.7 Mbytes/s
512 kbytes: 18183.0 usec/IO = 27.5 Mbytes/s
1024 kbytes: 22316.1 usec/IO = 44.8 Mbytes/s
2048 kbytes: 31420.9 usec/IO = 63.7 Mbytes/s
4096 kbytes: 43258.7 usec/IO = 92.5 Mbytes/s
8192 kbytes: 68586.7 usec/IO = 116.6 Mbytes/s

root@freebsd:~ # diskinfo -vciStw /dev/ada1
/dev/ada1
512 # sectorsize
256060514304 # mediasize in bytes (238G)
500118192 # mediasize in sectors
0 # stripesize
0 # stripeoffset
496149 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
SanDisk SD8TB8U256G1001 # Disk descr.
171344425156 # Disk ident.
ahcich1 # Attachment
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
Not_Zoned # Zone Mode

I/O command overhead:
time to read 10MB block 0.035386 sec = 0.002 msec/sector
time to read 20480 sectors 4.328172 sec = 0.211 msec/sector
calculated command overhead = 0.210 msec/sector

Seek times:
Full stroke: 250 iter in 0.041690 sec = 0.167 msec
Half stroke: 250 iter in 0.064857 sec = 0.259 msec
Quarter stroke: 500 iter in 0.094409 sec = 0.189 msec
Short forward: 400 iter in 0.063015 sec = 0.158 msec
Short backward: 400 iter in 0.044831 sec = 0.112 msec
Seq outer: 2048 iter in 0.093662 sec = 0.046 msec
Seq inner: 2048 iter in 0.411172 sec = 0.201 msec

Transfer rates:
outside: 102400 kbytes in 0.273888 sec = 373875 kbytes/sec
middle: 102400 kbytes in 0.256306 sec = 399522 kbytes/sec
inside: 102400 kbytes in 0.292901 sec = 349606 kbytes/sec

Asynchronous random reads:
sectorsize: 140210 ops in 3.002973 sec = 46690 IOPS
4 kbytes: 220256 ops in 3.001690 sec = 73377 IOPS
32 kbytes: 46323 ops in 3.008458 sec = 15398 IOPS
128 kbytes: 12068 ops in 3.032022 sec = 3980 IOPS
1024 kbytes: 1494 ops in 3.278153 sec = 456 IOPS

Synchronous random writes:
0.5 kbytes: 1144.1 usec/IO = 0.4 Mbytes/s
1 kbytes: 1087.5 usec/IO = 0.9 Mbytes/s
2 kbytes: 1121.6 usec/IO = 1.7 Mbytes/s
4 kbytes: 766.2 usec/IO = 5.1 Mbytes/s
8 kbytes: 761.8 usec/IO = 10.3 Mbytes/s
16 kbytes: 1107.0 usec/IO = 14.1 Mbytes/s
32 kbytes: 1016.7 usec/IO = 30.7 Mbytes/s
64 kbytes: 1449.3 usec/IO = 43.1 Mbytes/s
128 kbytes: 1560.8 usec/IO = 80.1 Mbytes/s
256 kbytes: 2240.4 usec/IO = 111.6 Mbytes/s
512 kbytes: 3912.4 usec/IO = 127.8 Mbytes/s
1024 kbytes: 6466.4 usec/IO = 154.6 Mbytes/s
2048 kbytes: 11962.9 usec/IO = 167.2 Mbytes/s
4096 kbytes: 23648.5 usec/IO = 169.1 Mbytes/s
8192 kbytes: 47014.8 usec/IO = 170.2 Mbytes/s

diskinfo: /dev/nvme0: ioctl(DIOCGMEDIASIZE) failed, probably not a disk.
root@freebsd:~ # diskinfo -vciStw /dev/nda0
/dev/nda0
512 # sectorsize
256060514304 # mediasize in bytes (238G)
500118192 # mediasize in sectors
0 # stripesize
0 # stripeoffset
SK hynix BC511 HFM256GDJTNI-82A0A # Disk descr.
CY04N08281060530A # Disk ident.
nvme0 # Attachment
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM

I/O command overhead:
time to read 10MB block 0.016807 sec = 0.001 msec/sector
time to read 20480 sectors 2.627173 sec = 0.128 msec/sector
calculated command overhead = 0.127 msec/sector

Seek times:
Full stroke: 250 iter in 0.027117 sec = 0.108 msec
Half stroke: 250 iter in 0.031419 sec = 0.126 msec
Quarter stroke: 500 iter in 0.053075 sec = 0.106 msec
Short forward: 400 iter in 0.035027 sec = 0.088 msec
Short backward: 400 iter in 0.042218 sec = 0.106 msec
Seq outer: 2048 iter in 0.268218 sec = 0.131 msec
Seq inner: 2048 iter in 0.203938 sec = 0.100 msec

Transfer rates:
outside: 102400 kbytes in 0.146975 sec = 696717 kbytes/sec
middle: 102400 kbytes in 0.146425 sec = 699334 kbytes/sec
inside: 102400 kbytes in 0.145927 sec = 701721 kbytes/sec

Asynchronous random reads:
sectorsize: 533723 ops in 3.000711 sec = 177866 IOPS
4 kbytes: 536045 ops in 3.000729 sec = 178638 IOPS
32 kbytes: 80728 ops in 3.004943 sec = 26865 IOPS
128 kbytes: 20242 ops in 3.017207 sec = 6709 IOPS
1024 kbytes: 2645 ops in 3.149376 sec = 840 IOPS

Synchronous random writes:
0.5 kbytes: 2575.5 usec/IO = 0.2 Mbytes/s
1 kbytes: 2470.7 usec/IO = 0.4 Mbytes/s
2 kbytes: 594.1 usec/IO = 3.3 Mbytes/s
4 kbytes: 493.1 usec/IO = 7.9 Mbytes/s
8 kbytes: 508.1 usec/IO = 15.4 Mbytes/s
16 kbytes: 525.0 usec/IO = 29.8 Mbytes/s
32 kbytes: 551.7 usec/IO = 56.6 Mbytes/s
64 kbytes: 527.0 usec/IO = 118.6 Mbytes/s
128 kbytes: 655.5 usec/IO = 190.7 Mbytes/s
256 kbytes: 823.8 usec/IO = 303.5 Mbytes/s
512 kbytes: 1184.0 usec/IO = 422.3 Mbytes/s
1024 kbytes: 1863.2 usec/IO = 536.7 Mbytes/s
2048 kbytes: 3222.2 usec/IO = 620.7 Mbytes/s
4096 kbytes: 5842.9 usec/IO = 684.6 Mbytes/s
8192 kbytes: 11272.3 usec/IO = 709.7 Mbytes/s
Result summary of these (naive) diskinfo(8) tests:
  • The SATA HDD makes about 110-235 MB/s read, up to 115 MB/s@8 MB random writes and 270 IOPS@512 B
  • The SATA SSD makes ~ 350-400 MB/s read, up to 170 MB/s@2-8 MB random writes and ~75k IOPS@4 kB
    (the 1st run gave up to ~260 MB/s@1 MB random writes)
  • The NVMe SSD makes ~700 MB/s read, up to 700 MB/s@4-8 MB random writes and 180k IOPS@4 kB and 512 B
Nothing special, all as expected. These results strengthen my decision to create two L2ARC cache vdevs in addition to the log and special vdevs on the NVMe SSDs for the 2 zpool(8)s on the rotating HDDs, although the L2ARC cache vdevs will eat much precious RAM and against the advise of sko. My experience is that storage access speed and IOPS can never be too high and latency can never be too low. Likely his advise was motivated by the fact that often (usually?) the availabilty of free RAM is valued higher than the speed of storage access. Or his guess of my use cases implies that I'll need so much RAM that there's not enough left to manage a reasonably sized L2ARC cache.
  • I cut my 1st estimate for the L2ARC cache size by 1/2, below 1% (my estimate for the smallest reasonable cache size for average common access patterns) to 0.5% to reduce it's RAM usage.
With that smaller cache, my extrapolation / guess is that ZFS with L2ARC caches of (2 x 20 GB to serve 2 x 4 TB zpools) will use about 2 x 2 = 4 + ARC + other ZFS usage sums up to a total of 6-8 GB RAM, so there'll be 8-10 GB RAM left for the OS and applications (excl. what BIOS and onboard-GPU take). This matches my guess of a ~50/50 ZFS/other RAM usage. It still smells like I'm going to double the RAM soon, but this setup looks ok as a starting point.

Now i'll start the installation of a very basic FreeBSD onto a md(4) that will eventually be copied to a ZFS on the internal USB thumb drive.
  • Why do I use a md(4) in-memory disk?
    Non-volatile flash memory (used on SSDs etc.) can be overwrtten about 1000 (cheap consumer devices) to 100 000 times (expensive professional devices). Thus every write access brings the SSD nearer to death, and that's why we want to reduce the amount of write accesses. So as long I'll be installing the basic software, not only the data of the filesystem changes (this does not matter because that will be wrtten to the flash drive anyway), but more importantly it's metadata changes very often (this does matter IIUC). Thus I wait to copy it over to the USB thumb drive until the wizzards in this forum have told me which basic tools plus maybe some goodies from the ports(7) I should install (other thread).
  • Why do I not install onto the NVMe SSD? Because
  1. I'd like to have the base OS separated from the storage devices, so changing them will be easier and
  2. It's not yet clear how many partitions I'll have on the NVMe SSDs, and their sizes. I'll follow gpw928 advise and
1. reserve swap space on the HDDs (+1 partition, size = 1.5 x RAM size) and​
2. reserve space on the NVMe SSDs for 2 geom(4) devices (2 partitions), one is a mirror and the other is striped.​
  • Why do I choose ZFS and not UFS on the (slow) internal USB 2.x flash drive?
    It can have a small L2ARC cache on the NVMe SSDs, a feature that geom(4) does not offer (the geom_cache module does s/th else), and this will speed up read access drastically when the ZFS cache module starts to work; i.e. sometime during the boot process.
 
[...] I would also install swap here, but keep watch and move it to a SATA mirror if the swap gets too intensive.
Reasoning: We want to minimize the amount of writes to NV RAM, because it wears out, and consumer-grade NV RAM -- which I have, even worse, they're already used -- wears out very quickly. So I'll have 2 extra swap partitions on the rotating devices, whether these are mirrored or not is another topic. ATM my decision is that I dont need the safety of a mirrored swap; this can easily be changed if I flip my decision.
And I constantly watch out for a good opportunity to "shoot" a 3rd NVMe M.2 SSD, to have a spare at hand so I can replace a broken one ASAP.
[...] They are a challenge to size correctly. So put it adjacent to the swap space, and move the swap to SATA if more space is required. Plan for this when you partition the disks.
Ok when I had my private 14 CPU Sun E4500 machine that burns 2kW electric power and blows you off because it sounds like a 737 starts beside you, I did all this but I had ~20+ disks in the 2 arrays so I never cared about partitions or disks because it was easy and natural to just use whole disks for ANYTHING. But times have changed and on the single-vdev zpools on my laptops it was nonsense to take care of these support vdevs. That's why I never needed to know about the preliminaries.
  • All your answers so far do not tell explicitely, but I read as if they imply that support vdevs have to live on partitions or whole disks and there might be a SILENT ASSUMPTION that it's crystal clear that they could never live on ZVOLs, because there's a kind of unspoken assumption that I know this. No, I don't. According to my limited understanding, I can (regularly) use a ZVOL like a disk or partition, but there are a few exceptions like I can, but shall not use a ZVOL as swap device and some other use cases are possible but should be avoided (horribly bad performance).
  • Q: Can the support vdevs live on ZVOLs or do they have to live on raw partitions (whole disk does not apply to my setup)?
  • Q: If they can't live on a ZVOL (on another zpool, JFTR), then WTH is the LVM part of ZFS good for??? You're asking me to burn fixed partition sizes into my mini-mainframe like in 1986??? WTF I did that 40 years ago! I'm not gonna do that kind of <censored> now in 2026 with a reasonably modern OS and advanced filesystem!
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.
Ok I have dark memory that I understood the caveats of a COW fs housing another COW fs long time ago, but I can't rebuild that reasoning. Would you be so kind and give me some keywords or links or a short outline of the reasons?
 
Now what I wanted to do is to use a zvol from a zpool on the faster devices as a special, cache and log vdev for a zpool on the slower devices, because I want to avoid using partitions, because they are not flexible: the size is fixed, changing it requires to copy all data to another device and copying back after resizing. In contrast, zvols can be resized easily; that's the LVM side of ZFS.
Why should that not be possible? Does it have any disadvantages to use a ZVOL as a special, cache or log vdev for another zpool(8) on different (and significantly slower) devices?
Interesting idea.

I see that you have posted above asking about the viability of using ZVOLs for VDEVs. I'm not certain of the answer. ZFS capabilities change all the time. Maybe it's possible. I suggest you plug in a thumb drive and try it.

VDEVs used for "support" generally furnish superior hardware performance. I have never heard of a ZVOL being used to furnish a "support" VDEV -- where you are putting the ZFS stack in the way of the underlying fast hardware. e.g. ZVOLs can be sparse, which is a red flag for the latency (absolutely) required by a log VDEV. You are going to a place where I would be intrepid, and would want to see a whole lot of prior art.
 
mjolnir: Why would I want a geom mirror?
You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.
You didn't tell WHY in your comments above.
[...] Expanding VDEVs is a lot easier these days than it used to be, but you still need to size them, [...]
I understand that my OS wants a fixed-size swap device, and i give it. I understand that for some use cases (DB stuff and VMs) it will be better to use a geom device with no ZFS between geom and the physical storage device, so I'll create two geoms: a mirror and a stripe on the NVMe SSDs (beside the zpools). But I refuse to burn in fixed sizes for vdevs where everyone (e.g. you) tells me and I read everywhere that it's hard to guess in advance how much will be needed. This is the most natural use case that you want to throw at a LVM, isn't it?
 
Ok I have dark memory that I understood the caveats of a COW fs housing another COW fs long time ago, but I can't rebuild that reasoning. Would you be so kind and give me some keywords or links or a short outline of the reasons?
Google: VM client cow file system on VM server cow file system amplifies rewrites
Performance Penalty: This recursive behavior can cause excessive I/O overhead and flash endurance issues, with some studies showing up to 29.5x write amplification and 71% performance degradation in high-write workloads.
 
Google: VM client cow file system on VM server cow file system amplifies rewrites
WRITE AMPLIFICATION!!! Ya ya ya now I remember! But only the term... the words....
Ok thx a lot I'll quickly find good explanations and likely I only have to read only halfway (the 1st few paragraphs of pages) and then I can reconstruct the rest.
 
You didn't tell WHY in your comments above.
I believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW. ZFS is COW. If your client is COW, then you may want UFS on the server side, and you may want a mirror for redundancy on the server. [You can still have ZFS, and all its advantages, on the client side.]
 
I see that you have posted above asking about the viability of using ZVOLs for VDEVs. I'm not certain of the answer. ZFS capabilities change all the time. Maybe it's possible. I suggest you plug in a thumb drive and try it.
THANKS a lot! FYI the following seems to be some kind of personal diary/log, so I delete it from the forum and put it where it belongs: the preliminary mirror (256 GB SATA SSD + 6 TB SATA HDD, what a funny combination) on my shiny "new" server ;)
 
I believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW.
Yes, exactly! Note that the point of interest is OUTSIDE your quotation marks :)
And you had this silent assumption that I knew that COW/COW should be avoided, but I didn't.
Silent assumptions, unspoken agreements, whatever you call it: these beasts are one of the most nasty insidious pitfalls in human communication (and sadly in software development, too).
 
I understand that my OS wants a fixed-size swap device, and i give it. I understand that for some use cases (DB stuff and VMs) it will be better to use a geom device with no ZFS between geom and the physical storage device, so I'll create two geoms: a mirror and a stripe on the NVMe SSDs (beside the zpools). But I refuse to burn in fixed sizes for vdevs where everyone (e.g. you) tells me and I read everywhere that it's hard to guess in advance how much will be needed. This is the most natural use case that you want to throw at a LVM, isn't it?
ZFS offers a lot of things that you don't get from LVM. Proximity to hardware using "support" (log, special, cache) VDEVs is one of them.

At it's most basic, the building block for a VDEV is a hardware device of fixed size. Support VDEVs need to be close to the hardware to get the required performance. Adding an "LVM-like" PV, VG, and LV layer (for functionaliy at the cost of performance) is for another forum.

However, the ZFS development team has been working hard at improving VDEV expansion options, but they mostly apply to the VDEVs you are most likely to use on the data storage side of things. I'd have to check the latest docs but last time I expanded a mirror (which you are likely to use for "support" VDEVs), the easiest way was to add an extra, larger, mirror (or two, and removing the original VDEV component(s) after the re-silver). Knowing you may need to do that for a "support" VDEV is simply fore-warned -- and may impact your disk partitioning decisions.

For instance, I always leave a 10% to 20% portion of each SSD with an unused UFS file system initialised with newfs -t. This provides immediate over-provisioning. But it also admits that I might later want to change my mind about the layout...

[At its most simple, "mirror" might mean a single disk (slice). "support" VDEVs should always have a redundancy at least the equal of the "storage" VDEVs with which they are associated.]
 
Back
Top