ZFS Choosing drives for ZFS Virtualization server

nerozero · Sep 20, 2021

Hello,

could you please recommend me a reliable and fast disk for a server which should host 3-4 virtual machines, which use a large data blocks ( eg 2TB volume for VM: zfs create -sV2T -o volblocksize=4k -o volmode=dev zvmpool/zcloud-disk0 )

At the moment I found those quite promising:

Western Digital 12TB Ultrastar DC HC520, Datasheet
Seagate Exos X16 14TB (ST14000NM001G), Datasheet

At the moment I have issues like Snapshot send is quite slow with very high IOPS (snapshot send speed averaging @ 5-6MB/s, IOPS ~ 350 .. 400 )

Thanks in advance

sko · Sep 20, 2021

if you need high IOPs, especially for windows guests who completely trash their filesystems (particularly on boot), go for flash storage. nothing else will work here.

If you have to use spinning disks, don't use few big disks, go for many small disks and combine them in many mirror vdevs. the more zfs can spread the load over all disks, the better the pool will perform; mirrors roughly add up the iops of their providers, so they are the fastest option here.

rootbert · Sep 20, 2021

avoid zvols but use normal datasets [https://jrs-s.net/2016/06/16/psa-snapshots-are-better-than-zvols/]

sko · Sep 20, 2021

rootbert said:
avoid zvols but use normal datasets [https://jrs-s.net/2016/06/16/psa-snapshots-are-better-than-zvols/]

If you want to run VMs you need some kind of block storage for them, which is zvols on zfs.

BTW: the statement in that article "snapshots are better than zvols" is completely bogus. It's like saying rsync is better than a disk... they have nothing to do with each other. The author seems to have some misunderstandings of the ZFS terminology, so I'd take the statements made in this blog entry with a very big grain of salt. especially because zvol snapshots work just like any other snapshots on zfs:

Code:

[root@vhost1 ~]# zfs list -r nvme-zones
NAME                                                    USED  AVAIL  REFER  MOUNTPOINT
nvme-zones                                              315G   545G    24K  /nvme-zones
nvme-zones/140b4070-5212-edc0-d133-ef581dd1f930-disk0  64.0G   545G  23.0G  -
nvme-zones/7abf5fdb-42fc-6897-fc58-a4b63e2e45ba-disk0   157G   545G   135G  -
nvme-zones/9f8f2d3e-5b25-c5e9-af53-cef3aeabb48a-disk0  32.4G   545G  17.0G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  61.4G   545G  53.1G  -

[root@vhost1 ~]# zfs list -rt snapshot nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0 | tail
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_06.25.00--5d  1.93M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_07.25.00--5d  2.38M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_08.25.00--5d  2.19M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_09.25.00--5d  2.19M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_10.25.00--5d  2.47M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_11.25.00--5d  2.03M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_12.25.00--5d  2.40M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_13.25.00--5d  2.64M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_14.25.00--5d  2.58M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_15.25.00--5d  2.12M      -  53.1G  -

the *-disk0 datasets are vdevs for KVM and bhyve VMs on smartOS of which we take a lot of snapshots (currently >200 in total for those 4 zvols). snapshots of vdevs also occupy only the changed blocks, not the whole size of the vdev. The statement they made is true for linux LVM and snapshots, which are horribly inefficient and slow, as they copy the whole filesystem AND while doing so writing all changes into another filesystem during the snapshot, which then has to be committed to the original one, so they need _A LOT_ of free space (and an eternity to finish). We've used this crap a few years, and it is pretty much useless in production for snapshot-based backups; I hope I never have to touch this stuff ever again...

usdmatt · Sep 20, 2021

If you want to run VMs you need some kind of block storage for them, which is zvols on zfs.

You don't necessarily need zvols to run VMs. I have more bhyve machines using basic file images than zvols as I just find it quicker and easier for small machines, rather than having a separate dataset for every disk.

The issue they discuss seems to be related to zvols reserving their entire provisioned space by default. If this reservation is inherited by snapshots, the snapshots will require the pool to have $reservation available space. They appear to show this in an example, although I've no idea if this is still the case, and I pretty much always create zvols as sparse so wouldn't see this issue either way.

Regarding LVM. Yes, I used this for guest image storage once and used snapshots each night to pull copies off. Glad I'm not trying to do that anymore.

For VM hosting, unless you need massive amounts of space, you just can't beat SSD these days. The performance is an order of magnitude better than traditional disks, although even so, that throughput seems incredibly slow. I can fairly easily flatline gigabit ethernet when sending ZFS snapshots, even off a basic pool of HDDs.

sko · Sep 20, 2021

usdmatt said:
You don't necessarily need zvols to run VMs. I have more bhyve machines using basic file images than zvols as I just find it quicker and easier for small machines, rather than having a separate dataset for every disk.

of course file-based images are always possible, but zvols have a bunch of benefits, like e.g. using different block size than the underlying pool (important for windows guests) without performance loss, and they are far easier to handle IMHO, especially regarding snapshots for backup/cloning/migration purposes.

usdmatt said:
The issue they discuss seems to be related to zvols reserving their entire provisioned space by default.

I have no idea how one comes up with such an idea - you should never ever let a pool run more than ~80-85% full. provisioning 100% with one zvol would render that pool completely useless as zfs couldn't even write any metadata. That's a supreme example of "shooting yourself in the foot".

gpw928 · Sep 21, 2021

I recommend a close look at Backblaze Drive Stats for Q2 2021.

Backblaze has 8359 of the ST14000NM001G drives, but they are only 6 months old. Failure rate is 1.58%.

They also have 8400 of the WDC WUH721414ALE6L4 (Ultrastar DC HC530 14TB) of similar age. Failure rate is 0.49%.

nerozero · Sep 21, 2021

Thank you guys !

sko said:
especially for windows guests who completely trash their filesystems

Thanks god - no windows, only BSD servers hosted so far. Thank you for information about speeding up IOPS, unfortunately SSD-s is not an option at the moment - too expensive....
Someone may found this visualized illustration of ZFS pools vs Performance and Fail Tolerance: Link

gpw928 said:
I recommend a close look at Backblaze Drive Stats for Q2 2021.

Thank you! Found this site quite informative !

usdmatt · Sep 21, 2021

provisioning 100% with one zvol would render that pool completely useless as zfs couldn't even write any metadata

Sorry, what I actually mean is that the zvol reserves the amount of space you give it, not someone creating a zvol the size of the entire pool. For example if you create a 1TB volume on a 10TB pool, the zvol will reserve 1TB of space. You have to use the sparse option to alter this (or remove the reservation afterwards).

They seem to suggest that a 1TB zvol that has reserved 1TB of space, will want to reserve space for each snapshot, although I haven't bothered to try and clarify whether they mean each snapshot will want to reserve another 1TB, or just enough for the actual data referenced.

Either way I'd never heard of this issue before this thread, and couldn't replicate it, so either they were doing something stupid or it's an old problem that doesn't exist anymore.

sko · Sep 23, 2021

usdmatt said:
They seem to suggest that a 1TB zvol that has reserved 1TB of space, will want to reserve space for each snapshot, although I haven't bothered to try and clarify whether they mean each snapshot will want to reserve another 1TB, or just enough for the actual data referenced.

The snapshots don't reserve any more space than they actually need.
As said: we have hundreds (or even thousands) of snapshots on that one machine alone, many of them for zvols with 100-500GB size. If they would reserve the space of their parent dataset/zvol that host would need several hundred TB of free space... Only if you clone from a snapshot and promote it, it will inherit the reservations of the original dataset/zvol (as it is a fully usable copy now, no longer a mere snapshot)

Argentum · Sep 23, 2021

nerozero said:
Hello,

could you please recommend me a reliable and fast disk for a server which should host 3-4 virtual machines, which use a large data blocks ( eg 2TB volume for VM: zfs create -sV2T -o volblocksize=4k -o volmode=dev zvmpool/zcloud-disk0 )

At the moment I found those quite promising:

Western Digital 12TB Ultrastar DC HC520, Datasheet

Seagate Exos X16 14TB (ST14000NM001G), Datasheet

At the moment I have issues like Snapshot send is quite slow with very high IOPS (snapshot send speed averaging @ 5-6MB/s, IOPS ~ 350 .. 400 )

To increase the speed of this storage system I strongly advise to use (small) fast SSD (just a single drive will do) in this configuration as L2ARC cache. Prepare to mirror these HDD-s.

nerozero · Sep 23, 2021

Argentum said:
To increase the speed of this storage system I strongly advise to use (small) fast SSD (just a single drive will do) in this configuration as L2ARC cache. Prepare to mirror these HDD-s.

Yes, this is planed, but now the primary focus is on a hard drives.

VladiBG · Sep 23, 2021

You can't go wrong with HGST.
To archive higher IOPS use more hard disks (more spindles) with less capacity.
For example for total capacity of 12TB it's better to use 14HDD disks of 1TB in Raid6 instead of 2HDD disks of 12TB in RAID1.

nerozero · Sep 23, 2021

VladiBG said:
You can't go wrong with HGST.

yah, just now had 2 "new" dead HGST disks on a desk, both - controllers dead (no response to sata at all, disk is sniping)...

Guys, do you experience issues with a disks recently? I'm asking because starting from September 12th I have 3 HDD and 1 ssd dead in my office.... wondering is this a curse of some kind, bad karma or should I blame solar activity...

VladiBG · Sep 23, 2021

If those dead HGST disks of yours are not under warranty can you take off the controller and take a picture of they contact pads to the head of the disk. I have bad experience with entry level disks of WD with oxidizing contacts which looks like this picture:
(Note this is example random pic from the internet)

When i clean those contacts with a rubber and alcohol then the hard disk start working again for another year or two.

ralphbsz · Sep 23, 2021

VladiBG said:
You can't go wrong with HGST.
To archive higher IOPS use more hard disks (more spindles) with less capacity.
For example for total capacity of 12TB it's better to use 14HDD disks of 1TB in Raid6 instead of 2HDD disks of 12TB in RAID1.

But note that you pay a high price in power consumption (most disks use 8-10 W, pretty independent of whether they store 1TB or 12TB), and in reliability: If you have 14 disks, the probability that at least one disk will fail is 7 times higher than if you have only 2 disks (assuming similar individual disk reliability). That's particularly true if you buy older (used) 1TB disks, which may already have used up much of their useful life. That means that you need more redundancy when using many disks. In your example, you did do that (14 disks with RAID-6 means you can handle 2 faults, unlike a pair of 2 disks which can only handle 1 fault). The math for estimating which version really is more reliable is difficult.

This is a very complex tradeoff, and many factors need to be considered, depending on environment.

VladiBG · Sep 24, 2021

No i'm not talking about old used 1TB disk it was just an example for the calculation of using enterprise disk compared to midline 7200rpm disk.
Hard disk that have >= 160 IOPs (random 128K read) and >165 IOPs (random 50% Write /50% Read) are marked as enterprise disks. Those disk are SAS running at 10K or 15K RPM with capacity size of 300GB , 600GB, 900GB, 1.2TB , 1.8TB and 2.4TB with warranty of 3 years or 5 years
Hard disk that have <=100 IOPs (random 128K read) and <=115 IOPs (random 50% Write/50% read) are marked as midline disk. Those disk are SAS/SATA running at 7.2K RPM with capacity size from 2TB up to 18TB with warranty of 1 year or 3 years

If you want to compare they MSR price
900GB, 12G SAS, 2.5'' Enterprise 15K rpm 512n or 4kn - $765.99 (Seagate Savvio 15k ST900MP0146 / ST900MP0006 now known as Seagate Enterprise Performance 15K v6 or Exos 15E900 ~$288) specification and Exos 15e900 spec
1,2 TB, 12G SAS, 2.5'' Enterprise 10K rpm 512n - $492.99 (Seagate Savvio 10K ~$ 350) specification
1 TB SAS 7.2K, 2,5'' Midline, 7.2K rpm, 512n - $245.99 (Seagate ST1000NX0453 Exos 7E200 ~$200 ) specification
12 TB, 12G SAS, 3.5'' Midline 7.2K rpm, Helium, 512e - $1,202.99 (Ultrastar DC HC520 (He12) ~$500) specification

MSR price is taken from the HPE web site.

Deleted member 67440 · Sep 24, 2021

I do not suggest HDD for primary storage, only for internal backup.
For VM go to NVMe on PCIe adapter board, in simply mirror.
Samsung 980 pro (not very good, but cheap)

For internal backup 2 WD Gold 16TB (hitachi rebranded in fact) not in in mirror . Backups on either

For restore test 4x or 8x cheap SSDs like samsung 860/870 evo

Deleted member 67440 · Sep 25, 2021

Some more explanations (I have BSD servers all around the world):
VMs on HDD are just fine (thinking on budget) if you DO not have to make frequent snapshots and backups.
If the VMs does almost nothing all the day, and you make only nightly backups, spinning drives can be used.
But in ALL other cases it is a big no-no.

Latency is so big, and bandwidth so low, that make a snapshot and backup a large-in use VM will take (almost) forever.
With such little numbers (3/4 VM really is just about nothing, a SOHO) it is hard not to go to even cheap DC500 SSDs, mirrored (no fancy RaidZ, go straight to mirror or RAID-1).
Of course the "normal" configuration is

2xsmall SSDs (mirror) = OS
XxNVMe (mirror on PCIe card) = primary data
Various big HDD = internal backup, no redundancy
XxSSD (stripe or RAID0) = internal space to extract the backups to be checked.

Because the most important thing is:
HOW MUCH precious are the data into the VM?

Short version: if you really cannot afford even cheap SSDs, put no more then 1/2 VM on every mirrored-couple of cheap (not so big) HDD (typically 4 free SATA ports = 2 mirrored volumes = 4 HDD+2SSD (for OS)
Do NOT, repeat do NOT, use some more complex configuration for such a little installation.
I do not suggest some ARC or ZIL cache: if it is simply, MAYBE it will work

You need to think, in advice, what I will do WHEN (not if, but when) something become broken? Either on the hardware and the software "side"

If you want something fast, buy cheap NVMEs

Cath O'Deray · Dec 4, 2021

sko said:
… zvols have a bunch of benefits, like e.g. using different block size than the underlying pool (important for windows guests) without performance loss, …

Is there a guide or recipe for how to configure a zvol for a Windows guest with VirtualBox?

My current setup is low-end, simplistic: .vhd stored in a single-disk pool that uses an old slow mobile hard disk drive (StoreJet Transcend):

Code:

% zpool list Transcend
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Transcend   464G   367G  96.7G        -         -    43%    79%  1.00x    ONLINE  -
% zfs get compression,compressratio Transcend/VirtualBox
NAME                  PROPERTY       VALUE           SOURCE
Transcend/VirtualBox  compression    zstd            local
Transcend/VirtualBox  compressratio  1.73x           -
% zpool status -v Transcend
  pool: Transcend
 state: ONLINE
  scan: scrub repaired 0B in 03:23:13 with 0 errors on Fri Nov 26 14:47:39 2021
config:

        NAME                   STATE     READ WRITE CKSUM
        Transcend              ONLINE       0     0     0
          gpt/Transcend        ONLINE       0     0     0
        cache
          gpt/cache-transcend  ONLINE       0     0     0

errors: No known data errors
% geom disk list da2
Geom name: da2
Providers:
1. Name: da2
   Mediasize: 500107862016 (466G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e3
   descr: StoreJet Transcend
   lunid: 5000000000000001
   ident: (null)
   rotationrate: unknown
   fwsectors: 63
   fwheads: 255

% geom disk list da3
Geom name: da3
Providers:
1. Name: da3
   Mediasize: 15518924800 (14G)
   Sectorsize: 512
   Mode: r1w1e3
   descr: Kingston DataTraveler 3.0
   lunname: KingstonDataTraveler 3.0
   lunid: KingstonDataTraveler 3.0
   ident: 60A44C3FACC93110B9970045
   rotationrate: 0
   fwsectors: 63
   fwheads: 255

% zpool iostat -v Transcend
                         capacity     operations     bandwidth
pool                   alloc   free   read  write   read  write
---------------------  -----  -----  -----  -----  -----  -----
Transcend               367G  96.7G      0      0    517    179
  gpt/Transcend         367G  96.7G      0      0    517    179
cache                      -      -      -      -      -      -
  gpt/cache-transcend  14.2G   240M      0      0    218     44
---------------------  -----  -----  -----  -----  -----  -----
%

sko · Jan 3, 2022

grahamperrin said:
Is there a guide or recipe for how to configure a zvol for a Windows guest with VirtualBox?

As with everything from Redmond: NO. You have to fail miserably, poke around for hours without proper error logs/messages and get completely useless and unrelated responses from their "support" until you find out it was something completely stupid like the blocksize... (same goes for stuff from oracle BTW, so think twice if you really want to go with virtualbox...)

Just been there today - we wasted the whole day with a vendor on a simple MSSQL installation which went fine without errors, but the service just wouldn't start. Only error message we got occasionally was a missing "NT AUTHORITY/SYSTEM" user, which is pretty much impossible.
Turns out MSSQL _STILL_ can't deal with disks/images with block size >4k. So good luck running that crap bare metal on your shiny new NVMe which might come with 1M physical block sizes... Of course you won't find such information e.g. in the hardware requirements or anywhere else. You need sheer luck to find a vaguely related post on stackoverflow where someone found the cause of this bug by accident...

</rant> (sorry about that)

Cath O'Deray · Jan 3, 2022

sko said:
from Redmond:

My question about how to configure a zvol for a Windows guest was, I think, primarily about configuration of FreeBSD; not of Windows.

Now I'm confused.

sko · Jan 3, 2022

-> zvols with blocksize >4k still don't work reliably with microsoft products, hence you have to configure the non-standard value of 4k or lower (default = 8k)

ZFS Choosing drives for ZFS Virtualization server

nerozero

sko

rootbert

sko

usdmatt

sko

gpw928

nerozero

usdmatt

sko

Argentum

nerozero

VladiBG

nerozero

VladiBG

ralphbsz

VladiBG

Deleted member 67440

Guest

Deleted member 67440

Guest

Cath O'Deray

sko

Cath O'Deray

sko