Server layout: rootfs on USB flash drive or NVMe?

I said:
I believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW. ZFS is COW. If your client is COW, then you may want UFS on the server side, and you may want a mirror for redundancy on the server. [You can still have ZFS, and all its advantages, on the client side.]
You said:
Yes, exactly! Note that the point of interest is OUTSIDE your quotation marks :)
And you had this silent assumption that I knew that COW/COW should be avoided, but I didn't.
Silent assumptions, unspoken agreements, whatever you call it: these beasts are one of the most nasty insidious pitfalls in human communication (and sadly in software development, too).
I had raised the issue of "avoiding COW/COW" in a an earlier post:
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.
 
1774576172734.png
 
gpw928 said:
I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.
  1. See, You wrote "You don't want CoW/CoW" several times (3? 4?), but not WHY. Thatŝ the POI. Why should I (or anyone else) blindly follow your advise, when it doesn't include at least a link where I can read an explanation or you give a short outline of the reasoning yourself, even a few keywords are often enough: in this case, write amplification would have triggered my memory instantly.
  2. When I asked for an explanation WHY (two times), you just repeated *I already wrote that you don't want CoW/Cow" or "see my explanation above on CoW/CoW" -- which was NOT an explanation, instead a statement like "it is a fact" -- which, naturally, makes you suspicious, and is a reason NOT to follow the advise of s/o who is either not able or not willing to answer the question about the WHY.
Fortunately the fog has lifted.
 
Status update & a nice argument to reuse & buy used parts:
  • When you buy your equipment in internet auctions, watch out for the details!
    I searched for M.2 NVMe 1.3 SSD, because the target board has only PCI 3 x 4, so pluging in more modern faster devices wouldn't make sense.
    I decided 2 SSDs à 256 GB would fit my needs. But on the photo of the device you can see it's actually 1024 GB (when you enlarge it), so I clicked "buy now" quickly and payed a little bit more. Now I have a 1 TB NVMe 1.3 SSD for the price of 256 GB. :)
NOTE: NO ONE WILL BE INTERESTED! THIS IS KIND OF PERSONAL LOG!
If you read so far, You may consider to read more interesting topics in the forum.
OK now I'm going to shuffle the 3 SATA SSD's I have and do the following:
  1. build the 1 TB NVMe SSD (arrived today) into the server beside it's 256 GB counterpart and create the partitions for swap, the support vdevs (2x cache, log, special) for the 2 zpools on the bigger HDDs (only one ATM), a geom mirror, and maybe a zpool (mirrored).
    I'll decide later how to use the free ~700 GB of the bigger SSD.
  2. Install FreeBSD 15-STABLE on my old laptop with one of the 2 small SATA SSDs that I bought to revive it, plus KDE so I have a GUI.
  3. install FreeBSD-15-STABLE onto the other old SATA SSD in an external USB 3.x SATA case; this one will go into my main laptop later, plus KDE so I can start with a GUI quickly.
  4. pull the 1 TB SATA SSD out of my main laptop and put in into the server.
  5. Put the SATA SSD from step 3 into my main laptop, hopefully the GUI doesn't need any more configuration so I have graphical Internet access etc. instantly. Time w/o working GUI & internet access is zero because of the revived old laptop.
  6. Log into the server via ssh(1)
    The server still has no OS installed, it is booted from the tweaked FBSD installer that starts sshd(8)
  7. Create the partitions on the HDD: swap, a mirrored and a striped zpool
  8. Mirror (resilver) the zpool on the HDD from the 1 TB SATA SSD (has been in my main laptop). The mirror side on the HDD will naturally be much larger than needed for now.
  9. export the it: zfs sharenfs=on pool/data-mirror
  10. Mount the old data via NFS from the server on the two laptops
 
Last edited:
NOTE: NO ONE WILL BE INTERESTED! THIS IS KIND OF PERSONAL LOG!
If you read so far, You may consider to read more interesting topics in the forum.
*blink-blink* ...I'm interested in why you'd--or anyone--would think that.

I'm fairly sure future and present guests & users would thank you for your human generated content/update/information (you must certainly realize that there may be others attempting the same/similar thing and they may very well appreciate all the information they can get).
 
Status now: Two small mirrors on the NVMe SSDs, one ZFS and one UFS/GEOM. The one half ot the data mirror on the big HDD is already supported by a striped cache and mirrored log and special on the NVMe.

root@freezer:~ # gmirror status
Name Status Components
mirror/igloo COMPLETE nda1p12 (ACTIVE)
nda0p12 (ACTIVE)
root@freezer:~ # zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
--------------------- ----- ----- ----- ----- ----- -----
data 504K 3.37T 0 1 333 29.4K
gpt/data1 0 3.33T 0 0 66 4.19K
special - - - - - -
mirror-2 504K 47.5G 0 1 133 17.2K
gpt/dataspecial0 - - 0 0 66 8.61K
gpt/dataspecial1 - - 0 0 66 8.61K
logs - - - - - -
mirror-1 0 11.5G 0 0 133 8.03K
gpt/datalog0 - - 0 0 66 4.01K
gpt/datalog1 - - 0 0 66 4.01K
cache - - - - - -
gpt/datacache0 0 12.0G 0 0 1.28K 122
gpt/datacache1 0 12.0G 0 0 1.28K 122
--------------------- ----- ----- ----- ----- ----- -----
freezer 2.73G 54.3G 0 0 872 97.1K
gpt/freezer 2.73G 54.3G 0 0 872 97.1K
cache - - - - - -
gpt/bootcache0 701M 1.31G 0 0 556 19.8K
gpt/bootcache1 689M 1.32G 0 0 536 20.5K
--------------------- ----- ----- ----- ----- ----- -----
icebox 444K 55.5G 0 0 242 1.57K
mirror-0 444K 55.5G 0 0 242 1.57K
gpt/zfsmirror0 - - 0 0 121 803
gpt/zfsmirror1 - - 0 0 121 803
--------------------- ----- ----- ----- ----- ----- -----
scratch 576K 2.14T 0 0 607 16.2K
gpt/scratch1 0 2.09T 0 0 42 2.66K
special - - - - - -
gpt/scratchspecial0 208K 23.5G 0 0 42 3.73K
gpt/scratchspecial1 368K 23.5G 0 0 42 4.66K
logs - - - - - -
gpt/scratchlog0 0 5.50G 0 0 42 2.55K
gpt/scratchlog1 0 5.50G 0 0 438 2.55K
cache - - - - - -
gpt/scratchcache0 0 12.0G 0 0 834 77
gpt/scratchcache1 4K 12.0G 0 0 834 99
--------------------- ----- ----- ----- ----- ----- -----
root@freezer:~ # top|head -8
last pid: 91409; load averages: 0.30, 0.32, 0.26 up 0+03:23:21 09:42:58
51 processes: 2 running, 47 sleeping, 2 waiting
CPU: 0.1% user, 0.0% nice, 0.3% system, 0.0% interrupt, 99.7% idle
Mem: 14M Active, 136M Inact, 1448M Wired, 266K Buf, 14G Free
ARC: 641M Total, 162M MFU, 353M MRU, 128K Anon, 14M Header, 91M Other
445M Compressed, 1064M Uncompressed, 2.39:1 Ratio
Swap: 16G Total, 16G Free
So the ZFS is smart enough not to occupy RAM as long the zpools are empty.
 
For performance reasons gone remove mirror of cache
A zpool(8) cache vdev can not be a ZFS mirror -- RTFM zpoolconcepts(8); it could be on a geom(4) mirror though, IIUC. Please read carefully the description of my setup:
  • All L2ARC caches are striped over partitions on 2 NVMe SSDs
[...] and log & use stripe.
  • * The redundancy level of the intent log and special vdev should match the zpool(8), thus i use a mirrored log and special for a mirrored zpool and for my striped scratch data zpool I use striped log and special vdevs.
  • Please do not repeat this bad advise to others, as it violates common sense and practice (*).
vfs.zfs.txg.timeout=5.
This seems to be already the default in FreeBSD 15 (I'm on -STABLE):

root@freezer:~ # sysctl vfs.zfs.txg.timeout
vfs.zfs.txg.timeout: 5
root@freezer:~ # fgrep vfs.zfs.txg.timeout /{boot{/defaults,}/loader,etc/sysctl}.conf
root@freezer:~ #

Now my plan is to stay with -STABLE on at least one of my laptops, and on the server I will go to 15.2-RELEASE, but keep a boot environment for -STABLE, so I can help with testing and bug reports.

ETA of the 2nd HDD is in 3 days, so I have enough time to polish the setup of my old laptop and save the data from my main laptop to the existing half of my data mirror. When the 2nd HDD arrives and joins the mirror, resilvering starts and my data will be safe. 2nd, I constantly try to "shoot" another 256 GB NVMe M,2 2280 SSD to have a spare. Since they can only be overwritten about 1000 times, these are consumables and experience shows that they die suddenly w/o any warning before.
 
I dont advice nobody nothing everyone must use his own "intelligence", but yes about mirror i used wrong wording.
Here my current setup,
Code:
NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $
 
Here my current setup,
Code:
NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $
This is ok if the log and cache vdevs are on devices that are faster than SSD_A and SSD_B. But your log vdev has a lower redundency level than the zpool has, and that violates common practice, advise and common sense, too. It might be ok though, if the services and data it houses are not "mission critical", i.e. you can easily handle data loss of a few minutes.
 
Here my current setup,
Code:
NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD                      530G   131G   399G        -         -     0%    24%  1.00x    ONLINE  -
  mirror-0               500G   117G   383G        -         -     0%  23.3%      -    ONLINE
    gpt/SSD_B            501G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_A            502G      -      -        -         -      -      -      -    ONLINE
special                     -      -      -        -         -      -      -      -         -
  mirror-1              29.5G  14.0G  15.5G        -         -    37%  47.6%      -    ONLINE
    gpt/SSD_A_SPECIAL     30G      -      -        -         -      -      -      -    ONLINE
    gpt/SSD_B_SPECIAL     34G      -      -        -         -      -      -      -    ONLINE
logs                        -      -      -        -         -      -      -      -         -
  gpt/SSD_LOG             32G    92K  31.5G        -         -     0%  0.00%      -    ONLINE
  gpt/SSD_log_stripe      32G    12K  31.5G        -         -     0%  0.00%      -    ONLINE
cache                       -      -      -        -         -      -      -      -         -
  gpt/SSD_CACHE           64G  16.9G  47.1G        -         -     0%  26.3%      -    ONLINE
  gpt/SSD_cache_stripe    64G  16.8G  47.2G        -         -     0%  26.3%      -    ONLINE
x@myfreebsd:/SSD/home/x $
The sizes of your support vdevs seem strange to me.

  1. I computed the sum of the sizes of all small files in my data:

    find / -type f -size 1 -print > files.1blk.list
    ...
    find / -type f -size 16 -print > files.16blk.list
    wc -l files.1blk.list (result/2000 = MB on disk of all 1-block files)
    ...
    wc -l files.16blk.list (result/125 = MB on disk of all 16-block (7.5-8 kB) files)

    The sum is ~0.4% for my data, the 1/2-filled 1 TB SSD in my laptop. Of course this number highly depends on the type of data. Then a estimated a factor of 1.5 to account for filesystem metadata for these small files, plus the special vdev also holds other metadata of it's zpool IIUC, I guess those parts that are accessed more often go into the special while less frequently accessed metadata is stored on the slower device(s). Your special vdevs size is ~1/6 of the userdata zpool, and I guess this could be oversized. Of course it depends how high you set special_small_blocks; I extrapolated from what I already have to the storage size of my NAS box and set zfs set

    zfs set special_small_blocks=7.5k data
    zfs set special_small_blocks=7.5k scratch
  2. Your log vdevs are 64 GB combined. Remember that the intent log never needs to be larger than the size of your RAM. I'd guess that 50% of RAM size is more than sufficient for common workloads.
  3. The combined size of your L2ARC cache is 128 GB.
    3.1. This will occupy up to 10% of it's size in RAM, ie. you'll have to account for ~13 GB RAM just to manage the L2ARC, plus ZFS needs RAM for the ARC and other uses.
    3.2. Your L2ARC cache size is ~25% of the data it caches, and IMHO usually a size in the range of 1-digit % is usually fully sufficient for common workloads.
 
aiui you want your special vdevs to have the exact same redundancy as your main pool, which would complicate adding one of those to our 8-way raidz3
Assumed that you machine has 2 M.2 (or other to connect fast NMRAM) you can use a mirror and always have a spare that you glue inside the case of the machine. You can compute the propability of failure and maybe it's low enough for you.
 
The high usage of my special devices drops down when poudriere build finishes. It think build jails open a lot of little files.
Yes, IIUC it creates a new fresh jail(8) for every ports(7) it builds, installs the build-dep ports, make install and deletes the build-jail afterwards. Which means a lot of stress for the underlying storage: a lot of files are created and deleted when the build has finished, i.e. a lot of blocks occupied and freed shortly after. This wears out the SSDs. We want to minimize the # of writes to the NVRAM, and maybe we can also speed up the build(7):
  • IIRC poudriere uses thin jails, if not, tweak it to do that
  • Use ccache and tweak the builds to use the commonly created intermediate files (mainly *.obj). ccache(1) can also be used with distcc, so you can let other machines in you network help to compile. If needed, limit their resources so that the distcc(1) does not hinders their nomal operation.
    portfind -d ccache sccache claims to be usable with poudriere.
 
Back
Top