I bought 2 poor-to-medium quality NVMe drives, used, 256 GB. The board has PCIe 3 x 4 lanes IIRC, and NVMe 1.3. So plugging in hyper-dyper-throw-your-money-at-me NVMe drives just doesn't make sense. These are already much faster than the rotating SATA HDD, that's all I want; and since reads can be done in parallel even the cheap "slow" ones will saturate the PCI bus. I have to keep an eye on performance/money.If it were my system, I would get a pair of quality NVMe SSDs as large as I could afford. Consider heatsinks as the design stage.
Why should I want a GEOM mirror? I would use that for swap, but ATM I think I don't need to mirror swap, because that box is not "mission critical": if it crashes, it crashes and I loose some fresh data, let's say the last 15 minutes. Ok, be it so, I don't care.Mirror them using ZFS. But see the caveat bellow about copy-on-write file systems -- you may wish to use a GEOM mirror for the whole disk, and have multiple types of file systems.
If the swap get's too extensive I can double the RAM from 2 x 8 to 2 x 16 GB. For the time beeing, I think 16 GB is pretty much more than I need with sufficient headroom.I would also install swap here, but keep watch and move it to a SATA mirror if the swap gets too intensive.
THIS is the kind of information I need. I didn't find this anywhere.[...] A SLOG never needs to be larger than main memory.
Why do I have to worry about the correct seizing? Can't I use ZVOLs for these (zpool cache, log, special and dedup vdev).I would always consider putting a special VDEV on the SSDs. This is essentially a cache for the file system metadata on the SATA disks. They are a challenge to size correctly.
The NVMe's are big enough to hold 2 x 32 GB swap, i.e. 2 x max RAM. I seldomly needed more tham this, and that was when a program went completely crazy; more swap would only increase the ETA of crash. I'm 100% NOT going to swap to the SATA disks.So put it adjacent to the swap space, and move the swap to SATA if more space is required. Plan for this when you partition the disks.
OK thx I'll keep that in mind and read why this is so.I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems. And you don't want VMs swapping madly to any underlying SSD.
But I dont need redundancy for all the data. Some is "scratch" data, so why should I mirror it? If it's gone it's gone, I can download or create again.I would never build a NAS without 100% redundancy. I would mirror the SATA disks... and migrate storage between SSD and SATA as needs dictate.
Ok. So my naive plan to have 2 x 64 = 128 GB L2ARC cache (50/50 for the mirror and the scratch) is nonsense, because this would need 2 x 6.4 = 13 GB of RAM. And I have only 16 GB total. So either I cut down the size of the cache drastically or skip it completely.L2ARC requires memory for its allocation tables; roughly 100MB per 1GB stored in L2ARC as a rule of thumb. So L2ARC is always the wrong tool against memory pressure on such low-end systems as you will worsen the problem.
Alan, I'm still waiting for the 1st 6 TB HDD to arrive, and ATM I have the 1st NVMe but the 2nd comes tomorrow or thursday or friday. Ok? And I still have to "shoot" a 2nd SATA HDD with 6 TB, which is not so easy because some nerds are paying extraordinarily high prices for used hardware. Can you believe that they even pay 5 €/TB for a DEFECTIVE HDD?!!! Yes, they do! That's CRAZY!!! I have time. I bid, and when others bid more, be it so, I'll bid in the next auction. And no SMR please and no such crap like Barracuda or "Green". ATM I have two old SATA SSD with 256 GB in the box, just for testing. zpool list -v would show s/th like that: 158 GB DATA, 158 GB SCRATCH. The box is switched off, so I can't produce a real command output, ok? Cheers.please put "zpool list -v". We are here now in the swamp.
You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.Why should I want a GEOM mirror? I would use that for swap, but ATM I think I don't need to mirror swap, because that box is not "mission critical": if it crashes, it crashes and I loose some fresh data, let's say the last 15 minutes. Ok, be it so, I don't care.
ZFS terminology can get intense. I don't think that you mean ZVOL, which is a specialized ZFS dataset that presents as a raw block device (virtual disk) rather than a file system. As an example, I use ZVOLs to provision iSCSI storage from my ZFS server to my KVM server for running Windows VMs.Why do I have to worry about the correct seizing [of a special VDEV]? Can't I use ZVOLs for these (zpool cache, log, special and dedup vdev).
special - - - - - - - - -
mirror-6 31.5G 4.20G 27.3G - 1G 56% 13.3% - ONLINE
gpt/SSD_A_SPECIAL 33G - - - - - - - ONLINE
gpt/SSD_B_SPECIAL 34G - - - - - - - ONLINE
What is the disadvantage of providing a ZVOL to a non-COW fs VM? I'd like to have it all on ZFS if at all possible, so that zpool(8) and geom(4) do not interfere. Unfortunately we should not swap to a zvol, although this is explicitely handled in one of the rc/service(8) scripts in /etc/rc.d.You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.
Oh I think my understanding was ok but maybe the wording was not 100% accurate.ZFS terminology can get intense. I don't think that you mean ZVOL, which is a specialized ZFS dataset that presents as a raw block device (virtual disk) rather than a file system.
It's all "storage" & *data". Metadata (the support vdev) also needs to land on a physical storage device eventually, and it's a kind of data itself.... Maybe the most accurate terms are userdata and metadata.VDEVs may be either for storage or support.
zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
FOURTERRA 4.55T 4.40T 149G - - 2% 96% 1.00x ONLINE -
sdf1 4.55T 4.40T 149G - - 2% 96.8% - ONLINE
KEEP 104G 1.31M 104G - 858G 0% 0% 1.00x ONLINE -
mirror-0 72.5G 112K 72.5G - 858G 0% 0.00% - ONLINE
sdd 73G - - - 858G - - - ONLINE
sdc 73G - - - 880G - - - ONLINE
special - - - - - - - - -
mirror-1 31.5G 1.20M 31.5G - 900G 0% 0.00% - ONLINE
sdd 32G - - - 900G - - - ONLINE
sdc 32G - - - 922G - - - ONLINE
logs - - - - - - - - -
sdd 32G 0 31.5G - 900G 0% 0.00% - ONLINE
cache - - - - - - - - -
sdc 64G 0 64.0G - - 0% 0.00% - ONLINE
SSD 530G 211G 319G - 430G 0% 39% 1.00x ONLINE /mnt/SSD
indirect-0 - - - - - - - - ONLINE
indirect-1 - - - - - - - - ONLINE
mirror-2 498G 207G 291G - 430G 0% 41.5% - ONLINE
sdd 500G - - - 430G - - - ONLINE
sdc 501G - - - 452G - - - ONLINE
indirect-3 - - - - - - - - ONLINE
indirect-4 - - - - - - - - ONLINE
indirect-7 - - - - - - - - ONLINE
special - - - - - - - - -
mirror-6 31.5G 4.21G 27.3G - 900G 56% 13.4% - ONLINE
sdd 33G - - - 898G - - - ONLINE
sdc 34G - - - 920G - - - ONLINE
TREETERRA 2.53T 211G 2.33T - 2.91T 0% 8% 1.00x ONLINE -
sde 2.54T 211G 2.33T - 2.91T 0% 8.14% - ONLINE
logs - - - - - - - - -
sdc 32G 0 31.5G - 922G 0% 0.00% - ONLINE
cache - - - - - - - - -
sdc 64G 1.62G 62.4G - - 0% 2.53% - ONLINE
I tried to be as precise as possible but here it is again:one mirrored for DATA and one striped for SCRATCH: , good thinking. L2ARC size , in memory or disk.
root@freebsd:~ # diskinfo -vciStw /dev/ada0
/dev/ada0
512 # sectorsize
6001175126016 # mediasize in bytes (5.5T)
11721045168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
11628021 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
HGST HUS726060ALE610 # Disk descr.
K1HBXDRB # Disk ident.
ahcich0 # Attachment
No # TRIM/UNMAP support
7200 # Rotation rate in RPM
Not_Zoned # Zone Mode
I/O command overhead:
time to read 10MB block 0.049873 sec = 0.002 msec/sector
time to read 20480 sectors 5.130729 sec = 0.251 msec/sector
calculated command overhead = 0.248 msec/sector
Seek times:
Full stroke: 250 iter in 4.847092 sec = 19.388 msec
Half stroke: 250 iter in 3.573786 sec = 14.295 msec
Quarter stroke: 500 iter in 3.478386 sec = 6.957 msec
Short forward: 400 iter in 1.040928 sec = 2.602 msec
Short backward: 400 iter in 2.685817 sec = 6.715 msec
Seq outer: 2048 iter in 0.078152 sec = 0.038 msec
Seq inner: 2048 iter in 0.522972 sec = 0.255 msec
Transfer rates:
outside: 102400 kbytes in 0.437139 sec = 234250 kbytes/sec
middle: 102400 kbytes in 0.510745 sec = 200491 kbytes/sec
inside: 102400 kbytes in 0.920451 sec = 111250 kbytes/sec
Asynchronous random reads:
sectorsize: 938 ops in 3.491822 sec = 269 IOPS
4 kbytes: 791 ops in 3.647868 sec = 217 IOPS
32 kbytes: 735 ops in 3.783458 sec = 194 IOPS
128 kbytes: 685 ops in 3.778035 sec = 181 IOPS
1024 kbytes: 411 ops in 4.413781 sec = 93 IOPS
Synchronous random writes:
0.5 kbytes: 18713.5 usec/IO = 0.0 Mbytes/s
1 kbytes: 19467.2 usec/IO = 0.1 Mbytes/s
2 kbytes: 21087.1 usec/IO = 0.1 Mbytes/s
4 kbytes: 13919.8 usec/IO = 0.3 Mbytes/s
8 kbytes: 14450.4 usec/IO = 0.5 Mbytes/s
16 kbytes: 14185.1 usec/IO = 1.1 Mbytes/s
32 kbytes: 13914.2 usec/IO = 2.2 Mbytes/s
64 kbytes: 14132.5 usec/IO = 4.4 Mbytes/s
128 kbytes: 14946.7 usec/IO = 8.4 Mbytes/s
256 kbytes: 15920.7 usec/IO = 15.7 Mbytes/s
512 kbytes: 18183.0 usec/IO = 27.5 Mbytes/s
1024 kbytes: 22316.1 usec/IO = 44.8 Mbytes/s
2048 kbytes: 31420.9 usec/IO = 63.7 Mbytes/s
4096 kbytes: 43258.7 usec/IO = 92.5 Mbytes/s
8192 kbytes: 68586.7 usec/IO = 116.6 Mbytes/s
root@freebsd:~ # diskinfo -vciStw /dev/ada1
/dev/ada1
512 # sectorsize
256060514304 # mediasize in bytes (238G)
500118192 # mediasize in sectors
0 # stripesize
0 # stripeoffset
496149 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
SanDisk SD8TB8U256G1001 # Disk descr.
171344425156 # Disk ident.
ahcich1 # Attachment
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
Not_Zoned # Zone Mode
I/O command overhead:
time to read 10MB block 0.035386 sec = 0.002 msec/sector
time to read 20480 sectors 4.328172 sec = 0.211 msec/sector
calculated command overhead = 0.210 msec/sector
Seek times:
Full stroke: 250 iter in 0.041690 sec = 0.167 msec
Half stroke: 250 iter in 0.064857 sec = 0.259 msec
Quarter stroke: 500 iter in 0.094409 sec = 0.189 msec
Short forward: 400 iter in 0.063015 sec = 0.158 msec
Short backward: 400 iter in 0.044831 sec = 0.112 msec
Seq outer: 2048 iter in 0.093662 sec = 0.046 msec
Seq inner: 2048 iter in 0.411172 sec = 0.201 msec
Transfer rates:
outside: 102400 kbytes in 0.273888 sec = 373875 kbytes/sec
middle: 102400 kbytes in 0.256306 sec = 399522 kbytes/sec
inside: 102400 kbytes in 0.292901 sec = 349606 kbytes/sec
Asynchronous random reads:
sectorsize: 140210 ops in 3.002973 sec = 46690 IOPS
4 kbytes: 220256 ops in 3.001690 sec = 73377 IOPS
32 kbytes: 46323 ops in 3.008458 sec = 15398 IOPS
128 kbytes: 12068 ops in 3.032022 sec = 3980 IOPS
1024 kbytes: 1494 ops in 3.278153 sec = 456 IOPS
Synchronous random writes:
0.5 kbytes: 1144.1 usec/IO = 0.4 Mbytes/s
1 kbytes: 1087.5 usec/IO = 0.9 Mbytes/s
2 kbytes: 1121.6 usec/IO = 1.7 Mbytes/s
4 kbytes: 766.2 usec/IO = 5.1 Mbytes/s
8 kbytes: 761.8 usec/IO = 10.3 Mbytes/s
16 kbytes: 1107.0 usec/IO = 14.1 Mbytes/s
32 kbytes: 1016.7 usec/IO = 30.7 Mbytes/s
64 kbytes: 1449.3 usec/IO = 43.1 Mbytes/s
128 kbytes: 1560.8 usec/IO = 80.1 Mbytes/s
256 kbytes: 2240.4 usec/IO = 111.6 Mbytes/s
512 kbytes: 3912.4 usec/IO = 127.8 Mbytes/s
1024 kbytes: 6466.4 usec/IO = 154.6 Mbytes/s
2048 kbytes: 11962.9 usec/IO = 167.2 Mbytes/s
4096 kbytes: 23648.5 usec/IO = 169.1 Mbytes/s
8192 kbytes: 47014.8 usec/IO = 170.2 Mbytes/s
diskinfo: /dev/nvme0: ioctl(DIOCGMEDIASIZE) failed, probably not a disk.
root@freebsd:~ # diskinfo -vciStw /dev/nda0
/dev/nda0
512 # sectorsize
256060514304 # mediasize in bytes (238G)
500118192 # mediasize in sectors
0 # stripesize
0 # stripeoffset
SK hynix BC511 HFM256GDJTNI-82A0A # Disk descr.
CY04N08281060530A # Disk ident.
nvme0 # Attachment
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
I/O command overhead:
time to read 10MB block 0.016807 sec = 0.001 msec/sector
time to read 20480 sectors 2.627173 sec = 0.128 msec/sector
calculated command overhead = 0.127 msec/sector
Seek times:
Full stroke: 250 iter in 0.027117 sec = 0.108 msec
Half stroke: 250 iter in 0.031419 sec = 0.126 msec
Quarter stroke: 500 iter in 0.053075 sec = 0.106 msec
Short forward: 400 iter in 0.035027 sec = 0.088 msec
Short backward: 400 iter in 0.042218 sec = 0.106 msec
Seq outer: 2048 iter in 0.268218 sec = 0.131 msec
Seq inner: 2048 iter in 0.203938 sec = 0.100 msec
Transfer rates:
outside: 102400 kbytes in 0.146975 sec = 696717 kbytes/sec
middle: 102400 kbytes in 0.146425 sec = 699334 kbytes/sec
inside: 102400 kbytes in 0.145927 sec = 701721 kbytes/sec
Asynchronous random reads:
sectorsize: 533723 ops in 3.000711 sec = 177866 IOPS
4 kbytes: 536045 ops in 3.000729 sec = 178638 IOPS
32 kbytes: 80728 ops in 3.004943 sec = 26865 IOPS
128 kbytes: 20242 ops in 3.017207 sec = 6709 IOPS
1024 kbytes: 2645 ops in 3.149376 sec = 840 IOPS
Synchronous random writes:
0.5 kbytes: 2575.5 usec/IO = 0.2 Mbytes/s
1 kbytes: 2470.7 usec/IO = 0.4 Mbytes/s
2 kbytes: 594.1 usec/IO = 3.3 Mbytes/s
4 kbytes: 493.1 usec/IO = 7.9 Mbytes/s
8 kbytes: 508.1 usec/IO = 15.4 Mbytes/s
16 kbytes: 525.0 usec/IO = 29.8 Mbytes/s
32 kbytes: 551.7 usec/IO = 56.6 Mbytes/s
64 kbytes: 527.0 usec/IO = 118.6 Mbytes/s
128 kbytes: 655.5 usec/IO = 190.7 Mbytes/s
256 kbytes: 823.8 usec/IO = 303.5 Mbytes/s
512 kbytes: 1184.0 usec/IO = 422.3 Mbytes/s
1024 kbytes: 1863.2 usec/IO = 536.7 Mbytes/s
2048 kbytes: 3222.2 usec/IO = 620.7 Mbytes/s
4096 kbytes: 5842.9 usec/IO = 684.6 Mbytes/s
8192 kbytes: 11272.3 usec/IO = 709.7 Mbytes/s
Reasoning: We want to minimize the amount of writes to NV RAM, because it wears out, and consumer-grade NV RAM -- which I have, even worse, they're already used -- wears out very quickly. So I'll have 2 extra swap partitions on the rotating devices, whether these are mirrored or not is another topic. ATM my decision is that I dont need the safety of a mirrored swap; this can easily be changed if I flip my decision.[...] I would also install swap here, but keep watch and move it to a SATA mirror if the swap gets too intensive.
Ok when I had my private 14 CPU Sun E4500 machine that burns 2kW electric power and blows you off because it sounds like a 737 starts beside you, I did all this but I had ~20+ disks in the 2 arrays so I never cared about partitions or disks because it was easy and natural to just use whole disks for ANYTHING. But times have changed and on the single-vdev zpools on my laptops it was nonsense to take care of these support vdevs. That's why I never needed to know about the preliminaries.[...] They are a challenge to size correctly. So put it adjacent to the swap space, and move the swap to SATA if more space is required. Plan for this when you partition the disks.
Ok I have dark memory that I understood the caveats of a COW fs housing another COW fs long time ago, but I can't rebuild that reasoning. Would you be so kind and give me some keywords or links or a short outline of the reasons?I would try place my important VMs on the SSDs. But beware, you don't want both the hypervisor and the VM client using copy-on-write file systems.
Interesting idea.Now what I wanted to do is to use a zvol from a zpool on the faster devices as a special, cache and log vdev for a zpool on the slower devices, because I want to avoid using partitions, because they are not flexible: the size is fixed, changing it requires to copy all data to another device and copying back after resizing. In contrast, zvols can be resized easily; that's the LVM side of ZFS.
Why should that not be possible? Does it have any disadvantages to use a ZVOL as a special, cache or log vdev for another zpool(8) on different (and significantly slower) devices?
mjolnir: Why would I want a geom mirror?
You didn't tell WHY in your comments above.You might want to provision storage to a VM which is not ZFS on the server side, e.g. using a mirror'd UFS file system. See my comments re COW client file systems provisioned on COW server file systems above.
I understand that my OS wants a fixed-size swap device, and i give it. I understand that for some use cases (DB stuff and VMs) it will be better to use a geom device with no ZFS between geom and the physical storage device, so I'll create two geoms: a mirror and a stripe on the NVMe SSDs (beside the zpools). But I refuse to burn in fixed sizes for vdevs where everyone (e.g. you) tells me and I read everywhere that it's hard to guess in advance how much will be needed. This is the most natural use case that you want to throw at a LVM, isn't it?[...] Expanding VDEVs is a lot easier these days than it used to be, but you still need to size them, [...]
Google: VM client cow file system on VM server cow file system amplifies rewritesOk I have dark memory that I understood the caveats of a COW fs housing another COW fs long time ago, but I can't rebuild that reasoning. Would you be so kind and give me some keywords or links or a short outline of the reasons?
Performance Penalty: This recursive behavior can cause excessive I/O overhead and flash endurance issues, with some studies showing up to 29.5x write amplification and 71% performance degradation in high-write workloads.
WRITE AMPLIFICATION!!! Ya ya ya now I remember! But only the term... the words....Google: VM client cow file system on VM server cow file system amplifies rewrites
I believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW. ZFS is COW. If your client is COW, then you may want UFS on the server side, and you may want a mirror for redundancy on the server. [You can still have ZFS, and all its advantages, on the client side.]You didn't tell WHY in your comments above.
THANKS a lot! FYI the following seems to be some kind of personal diary/log, so I delete it from the forum and put it where it belongs: the preliminary mirror (256 GB SATA SSD + 6 TB SATA HDD, what a funny combination) on my shiny "new" serverI see that you have posted above asking about the viability of using ZVOLs for VDEVs. I'm not certain of the answer. ZFS capabilities change all the time. Maybe it's possible. I suggest you plug in a thumb drive and try it.
Yes, exactly! Note that the point of interest is OUTSIDE your quotation marksI believe that I did. I said "You might want to provision storage to a VM which is not ZFS on the server side". It's all about avoiding COW/COW.
ZFS offers a lot of things that you don't get from LVM. Proximity to hardware using "support" (log, special, cache) VDEVs is one of them.I understand that my OS wants a fixed-size swap device, and i give it. I understand that for some use cases (DB stuff and VMs) it will be better to use a geom device with no ZFS between geom and the physical storage device, so I'll create two geoms: a mirror and a stripe on the NVMe SSDs (beside the zpools). But I refuse to burn in fixed sizes for vdevs where everyone (e.g. you) tells me and I read everywhere that it's hard to guess in advance how much will be needed. This is the most natural use case that you want to throw at a LVM, isn't it?
newfs -t. This provides immediate over-provisioning. But it also admits that I might later want to change my mind about the layout...