Server layout: rootfs on USB flash drive or NVMe?

gpw928 · Mar 27, 2026

I said:

gpw928 said:
I believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW. ZFS is COW. If your client is COW, then you may want UFS on the server side, and you may want a mirror for redundancy on the server. [You can still have ZFS, and all its advantages, on the client side.]

You said:

Mjölnir said:
Yes, exactly! Note that the point of interest is OUTSIDE your quotation marks
And you had this silent assumption that I knew that COW/COW should be avoided, but I didn't.
Silent assumptions, unspoken agreements, whatever you call it: these beasts are one of the most nasty insidious pitfalls in human communication (and sadly in software development, too).

I had raised the issue of "avoiding COW/COW" in a an earlier post:

gpw928 said:
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.

atax1a · Mar 27, 2026

Mjölnir · Mar 27, 2026

gpw928 said:
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.

See, You wrote "You don't want CoW/CoW" several times (3? 4?), but not WHY. Thatŝ the POI. Why should I (or anyone else) blindly follow your advise, when it doesn't include at least a link where I can read an explanation or you give a short outline of the reasoning yourself, even a few keywords are often enough: in this case, write amplification would have triggered my memory instantly.
When I asked for an explanation WHY (two times), you just repeated *I already wrote that you don't want CoW/Cow" or "see my explanation above on CoW/CoW" -- which was NOT an explanation, instead a statement like "it is a fact" -- which, naturally, makes you suspicious, and is a reason NOT to follow the advise of s/o who is either not able or not willing to answer the question about the WHY.

Fortunately the fog has lifted.

Alain De Vos · Mar 27, 2026

Cow btrfs, xfs , zfs a good thing. Other thing is mirror. Must look into it.

Mjölnir · Mar 27, 2026

Status update & a nice argument to reuse & buy used parts:

When you buy your equipment in internet auctions, watch out for the details!
I searched for M.2 NVMe 1.3 SSD, because the target board has only PCI 3 x 4, so pluging in more modern faster devices wouldn't make sense.
I decided 2 SSDs à 256 GB would fit my needs. But on the photo of the device you can see it's actually 1024 GB (when you enlarge it), so I clicked "buy now" quickly and payed a little bit more. Now I have a 1 TB NVMe 1.3 SSD for the price of 256 GB.

NOTE: NO ONE WILL BE INTERESTED! THIS IS KIND OF PERSONAL LOG!
If you read so far, You may consider to read more interesting topics in the forum.
OK now I'm going to shuffle the 3 SATA SSD's I have and do the following:

build the 1 TB NVMe SSD (arrived today) into the server beside it's 256 GB counterpart and create the partitions for swap, the support vdevs (2x cache, log, special) for the 2 zpools on the bigger HDDs (only one ATM), a geom mirror, and maybe a zpool (mirrored).
I'll decide later how to use the free ~700 GB of the bigger SSD.
Install FreeBSD 15-STABLE on my old laptop with one of the 2 small SATA SSDs that I bought to revive it, plus KDE so I have a GUI.
install FreeBSD-15-STABLE onto the other old SATA SSD in an external USB 3.x SATA case; this one will go into my main laptop later, plus KDE so I can start with a GUI quickly.
pull the 1 TB SATA SSD out of my main laptop and put in into the server.
Put the SATA SSD from step 3 into my main laptop, hopefully the GUI doesn't need any more configuration so I have graphical Internet access etc. instantly. Time w/o working GUI & internet access is zero because of the revived old laptop.
Log into the server via ssh(1)
The server still has no OS installed, it is booted from the tweaked FBSD installer that starts sshd(8)
Create the partitions on the HDD: swap, a mirrored and a striped zpool
Mirror (resilver) the zpool on the HDD from the 1 TB SATA SSD (has been in my main laptop). The mirror side on the HDD will naturally be much larger than needed for now.
export the it: zfs sharenfs=on pool/data-mirror
Mount the old data via NFS from the server on the two laptops

JohnK · Mar 28, 2026

Mjölnir said:
NOTE: NO ONE WILL BE INTERESTED! THIS IS KIND OF PERSONAL LOG!
If you read so far, You may consider to read more interesting topics in the forum.

*blink-blink* ...I'm interested in why you'd--or anyone--would think that.

I'm fairly sure future and present guests & users would thank you for your human generated content/update/information (you must certainly realize that there may be others attempting the same/similar thing and they may very well appreciate all the information they can get).

Mjölnir · Mar 29, 2026

Status now: Two small mirrors on the NVMe SSDs, one ZFS and one UFS/GEOM. The one half ot the data mirror on the big HDD is already supported by a striped cache and mirrored log and special on the NVMe.

 

root@freezer:~ # gmirror status

Name    Status  Components

mirror/igloo  COMPLETE  nda1p12 (ACTIVE)

                                       nda0p12 (ACTIVE)

root@freezer:~ # zpool iostat -v

capacity     operations     bandwidth

pool                   alloc   free   read  write   read  write

---------------------  -----  -----  -----  -----  -----  -----

data                    504K  3.37T      0      1    333  29.4K

gpt/data1                0  3.33T      0      0     66  4.19K

special                    -      -      -      -      -      -

mirror-2              504K  47.5G      0      1    133  17.2K

gpt/dataspecial0       -      -      0      0     66  8.61K

gpt/dataspecial1       -      -      0      0     66  8.61K

logs                       -      -      -      -      -      -

mirror-1                 0  11.5G      0      0    133  8.03K

gpt/datalog0           -      -      0      0     66  4.01K

gpt/datalog1           -      -      0      0     66  4.01K

cache                      -      -      -      -      -      -

gpt/datacache0           0  12.0G      0      0  1.28K    122

gpt/datacache1           0  12.0G      0      0  1.28K    122

---------------------  -----  -----  -----  -----  -----  -----

freezer                2.73G  54.3G      0      0    872  97.1K

gpt/freezer          2.73G  54.3G      0      0    872  97.1K

cache                      -      -      -      -      -      -

gpt/bootcache0        701M  1.31G      0      0    556  19.8K

gpt/bootcache1        689M  1.32G      0      0    536  20.5K

---------------------  -----  -----  -----  -----  -----  -----

icebox                  444K  55.5G      0      0    242  1.57K

mirror-0              444K  55.5G      0      0    242  1.57K

gpt/zfsmirror0         -      -      0      0    121    803

gpt/zfsmirror1         -      -      0      0    121    803

---------------------  -----  -----  -----  -----  -----  -----

scratch                 576K  2.14T      0      0    607  16.2K

gpt/scratch1             0  2.09T      0      0     42  2.66K

special                    -      -      -      -      -      -

gpt/scratchspecial0   208K  23.5G      0      0     42  3.73K

gpt/scratchspecial1   368K  23.5G      0      0     42  4.66K

logs                       -      -      -      -      -      -

gpt/scratchlog0          0  5.50G      0      0     42  2.55K

gpt/scratchlog1          0  5.50G      0      0    438  2.55K

cache                      -      -      -      -      -      -

gpt/scratchcache0        0  12.0G      0      0    834     77

gpt/scratchcache1       4K  12.0G      0      0    834     99

---------------------  -----  -----  -----  -----  -----  -----

root@freezer:~ # top|head -8

last pid: 91409;  load averages:    0.30,    0.32,    0.26  up 0+03:23:21    09:42:58

51 processes:  2 running, 47 sleeping, 2 waiting

CPU:  0.1% user,  0.0% nice,  0.3% system,  0.0% interrupt, 99.7% idle

Mem: 14M Active, 136M Inact, 1448M Wired, 266K Buf, 14G Free

ARC: 641M Total, 162M MFU, 353M MRU, 128K Anon, 14M Header, 91M Other

445M Compressed, 1064M Uncompressed, 2.39:1 Ratio

Swap: 16G Total, 16G Free

So the ZFS is smart enough not to occupy RAM as long the zpools are empty.

Alain De Vos · Mar 29, 2026

vfs.zfs.txg.timeout=5.
So i worst condition i lose 5seconds of data.

Mjölnir · Mar 30, 2026

Alain De Vos said:
For performance reasons gone remove mirror of cache

A zpool(8) cache vdev can not be a ZFS mirror -- RTFM zpoolconcepts(8); it could be on a geom(4) mirror though, IIUC. Please read carefully the description of my setup:

All L2ARC caches are striped over partitions on 2 NVMe SSDs

Alain De Vos said:
[...] and log & use stripe.

* The redundancy level of the intent log and special vdev should match the zpool(8), thus i use a mirrored log and special for a mirrored zpool and for my striped scratch data zpool I use striped log and special vdevs.
Please do not repeat this bad advise to others, as it violates common sense and practice (*).

Alain De Vos said:
vfs.zfs.txg.timeout=5.

This seems to be already the default in FreeBSD 15 (I'm on -STABLE):

 

root@freezer:~ # sysctl vfs.zfs.txg.timeout

vfs.zfs.txg.timeout: 5

root@freezer:~ # fgrep vfs.zfs.txg.timeout /{boot{/defaults,}/loader,etc/sysctl}.conf

root@freezer:~ #

Now my plan is to stay with -STABLE on at least one of my laptops, and on the server I will go to 15.2-RELEASE, but keep a boot environment for -STABLE, so I can help with testing and bug reports.

ETA of the 2nd HDD is in 3 days, so I have enough time to polish the setup of my old laptop and save the data from my main laptop to the existing half of my data mirror. When the 2nd HDD arrives and joins the mirror, resilvering starts and my data will be safe. 2nd, I constantly try to "shoot" another 256 GB NVMe M,2 2280 SSD to have a spare. Since they can only be overwritten about 1000 times, these are consumables and experience shows that they die suddenly w/o any warning before.

Alain De Vos · Mar 30, 2026

I dont advice nobody nothing everyone must use his own "intelligence", but yes about mirror i used wrong wording.
Here my current setup,

Code:

NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $

Mjölnir · Mar 30, 2026

Alain De Vos said:

Here my current setup,

Code:

NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $

This is ok if the log and cache vdevs are on devices that are faster than SSD_A and SSD_B. But your log vdev has a lower redundency level than the zpool has, and that violates common practice, advise and common sense, too. It might be ok though, if the services and data it houses are not "mission critical", i.e. you can easily handle data loss of a few minutes.

Mjölnir · Mar 30, 2026

Alain De Vos said:

Here my current setup,

Code:

NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $

The sizes of your support vdevs seem strange to me.

I computed the sum of the sizes of all small files in my data:
find / -type f -size 1 -print > files.1blk.list ... find / -type f -size 16 -print > files.16blk.list wc -l files.1blk.list (result/2000 = MB on disk of all 1-block files) ... wc -l files.16blk.list (result/125 = MB on disk of all 16-block (7.5-8 kB) files)
The sum is ~0.4% for my data, the 1/2-filled 1 TB SSD in my laptop. Of course this number highly depends on the type of data. Then a estimated a factor of 1.5 to account for filesystem metadata for these small files, plus the special vdev also holds other metadata of it's zpool IIUC, I guess those parts that are accessed more often go into the special while less frequently accessed metadata is stored on the slower device(s). Your special vdevs size is ~1/6 of the userdata zpool, and I guess this could be oversized. Of course it depends how high you set special_small_blocks; I extrapolated from what I already have to the storage size of my NAS box and set zfs set
zfs set special_small_blocks=7.5k data zfs set special_small_blocks=7.5k scratch
Your log vdevs are 64 GB combined. Remember that the intent log never needs to be larger than the size of your RAM. I'd guess that 50% of RAM size is more than sufficient for common workloads.
The combined size of your L2ARC cache is 128 GB.
3.1. This will occupy up to 10% of it's size in RAM, ie. you'll have to account for ~13 GB RAM just to manage the L2ARC, plus ZFS needs RAM for the ARC and other uses.
3.2. Your L2ARC cache size is ~25% of the data it caches, and IMHO usually a size in the range of 1-digit % is usually fully sufficient for common workloads.

Mjölnir · Mar 30, 2026

atax1a said:
aiui you want your special vdevs to have the exact same redundancy as your main pool, which would complicate adding one of those to our 8-way raidz3

Assumed that you machine has 2 M.2 (or other to connect fast NMRAM) you can use a mirror and always have a spare that you glue inside the case of the machine. You can compute the propability of failure and maybe it's low enough for you.

Alain De Vos · Mar 30, 2026

The high usage of my special devices drops down when poudriere build finishes. It think build jails open a lot of little files.

Mjölnir · Mar 30, 2026

Alain De Vos said:
The high usage of my special devices drops down when poudriere build finishes. It think build jails open a lot of little files.

Yes, IIUC it creates a new fresh jail(8) for every ports(7) it builds, installs the build-dep ports, make install and deletes the build-jail afterwards. Which means a lot of stress for the underlying storage: a lot of files are created and deleted when the build has finished, i.e. a lot of blocks occupied and freed shortly after. This wears out the SSDs. We want to minimize the # of writes to the NVRAM, and maybe we can also speed up the build(7):

IIRC poudriere uses thin jails, if not, tweak it to do that
Use ccache and tweak the builds to use the commonly created intermediate files (mainly *.obj). ccache(1) can also be used with distcc, so you can let other machines in you network help to compile. If needed, limit their resources so that the distcc(1) does not hinders their nomal operation.
portfind -d ccache sccache claims to be usable with poudriere.

Mjölnir · Mar 30, 2026

Alain De Vos said:
The high usage of my special devices drops down when poudriere build finishes. It think build jails open a lot of little files.

PS.: It's not available in FBSD-14 (OpenZFS 2.2), but on 15 (OpenZFS 2.4) you may also try zpool-prefetch -t brt <zpool> because I assume that on installation of a new jail some files are copied from a template, and the smartness of ZFS will use the zpool feature block_cloning and can thus skip to copy the metadata & datablocks of that file, instead just update the cloned_blocks counter.

Alain De Vos · Mar 31, 2026

Fying , currently building,

Code:

                         capacity     operations     bandwidth
pool                   alloc   free   read  write   read  write
SSD                     128G   402G      0     38      0   754K
  mirror-0              121G   379G      0      1      0  17.6K
    gpt/SSD_B              -      -      0      0      0  8.80K
    gpt/SSD_A              -      -      0      0      0  8.80K
special                    -      -      -      -      -      -
  mirror-1             6.83G  22.7G      0     35      0   729K
    gpt/SSD_A_SPECIAL      -      -      0     16      0   364K
    gpt/SSD_B_SPECIAL      -      -      0     18      0   364K
logs                       -      -      -      -      -      -
  gpt/SSD_VNME_LOG      248K  63.5G      0      0      0  7.60K
cache                      -      -      -      -      -      -
  gpt/SSD_VNME_CACHE   84.6G  43.4G      0      1      0  11.2K

I already tuned poudriere with,

Code:

USE_TMPFS="wrkdir"

& only send files < 16KB to special device.

Alain De Vos · Mar 31, 2026

FYI, on SSD's, & Marketing,

Code:

cat  done.txt 
FYI, about SSD cache,

fio --name=ssd_true_speed \
    --directory=/path/to/poudriere/data \
    --size=50G \
    --rw=write \
    --bs=1M \
    --direct=1 \
    --ioengine=posixaio \
    --refill_buffers \
    --buffer_compress_percentage=0 \
    --runtime=300 \
    --group_reporting

root@myfreebsd:/SSD/home/x # ./test
ssd_true_speed: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.40
Starting 1 process
ssd_true_speed: Laying out IO file (1 file / 51200MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=116MiB/s][w=116 IOPS][eta 00m:00s]
ssd_true_speed: (groupid=0, jobs=1): err= 0: pid=49727: Tue Mar 31 10:30:16 2026
  write: IOPS=112, BW=112MiB/s (117MB/s)(32.8GiB/300007msec); 0 zone resets
    slat (nsec): min=618, max=40205k, avg=8529.48, stdev=293808.48
    clat (nsec): min=548, max=106622k, avg=8840778.62, stdev=5284838.93
     lat (usec): min=134, max=106629, avg=8849.31, stdev=5286.35
    clat percentiles (usec):
     |  1.00th=[   167],  5.00th=[  5407], 10.00th=[  5866], 20.00th=[  6259],
     | 30.00th=[  6521], 40.00th=[  6783], 50.00th=[  7177], 60.00th=[  7963],
     | 70.00th=[  9634], 80.00th=[ 11076], 90.00th=[ 13042], 95.00th=[ 17433],
     | 99.00th=[ 28181], 99.50th=[ 38536], 99.90th=[ 53740], 99.95th=[ 66323],
     | 99.99th=[100140]
   bw (  KiB/s): min=10240, max=2301952, per=99.99%, avg=114705.10, stdev=98747.89, samples=599
   iops        : min=   10, max= 2248, avg=112.02, stdev=96.43, samples=599
  lat (nsec)   : 750=0.04%, 1000=0.01%
  lat (usec)   : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%, 250=2.61%
  lat (usec)   : 500=0.20%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.46%, 4=0.73%, 10=70.04%, 20=22.04%, 50=3.62%
  lat (msec)   : 100=0.13%, 250=0.01%
  cpu          : usr=0.95%, sys=0.05%, ctx=34432, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,33608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
  WRITE: bw=112MiB/s (117MB/s), 112MiB/s-112MiB/s (117MB/s-117MB/s), io=32.8GiB (35.2GB), run=300007-300007msec

AI Verdict
For a cheap drive under a heavy build load, 112 MiB/s is actually "normal" behavior, even if it’s disappointing. 
The "fast" speeds you see in 30-second benchmarks are just a marketing illusion provided by the cache.
root@myfreebsd:/SSD/home/x #