ZFS Choosing drives for ZFS Virtualization server

nerozero

Active Member

Reaction score: 18
Messages: 166

Hello,

could you please recommend me a reliable and fast disk for a server which should host 3-4 virtual machines, which use a large data blocks ( eg 2TB volume for VM: zfs create -sV2T -o volblocksize=4k -o volmode=dev zvmpool/zcloud-disk0 )

At the moment I found those quite promising:
  • Western Digital 12TB Ultrastar DC HC520, Datasheet
  • Seagate Exos X16 14TB (ST14000NM001G), Datasheet
At the moment I have issues like Snapshot send is quite slow with very high IOPS (snapshot send speed averaging @ 5-6MB/s, IOPS ~ 350 .. 400 )

Thanks in advance
 

sko

Aspiring Daemon

Reaction score: 403
Messages: 708

if you need high IOPs, especially for windows guests who completely trash their filesystems (particularly on boot), go for flash storage. nothing else will work here.

If you have to use spinning disks, don't use few big disks, go for many small disks and combine them in many mirror vdevs. the more zfs can spread the load over all disks, the better the pool will perform; mirrors roughly add up the iops of their providers, so they are the fastest option here.
 

sko

Aspiring Daemon

Reaction score: 403
Messages: 708

If you want to run VMs you need some kind of block storage for them, which is zvols on zfs.


BTW: the statement in that article "snapshots are better than zvols" is completely bogus. It's like saying rsync is better than a disk... they have nothing to do with each other. The author seems to have some misunderstandings of the ZFS terminology, so I'd take the statements made in this blog entry with a very big grain of salt. especially because zvol snapshots work just like any other snapshots on zfs:

Code:
[root@vhost1 ~]# zfs list -r nvme-zones
NAME                                                    USED  AVAIL  REFER  MOUNTPOINT
nvme-zones                                              315G   545G    24K  /nvme-zones
nvme-zones/140b4070-5212-edc0-d133-ef581dd1f930-disk0  64.0G   545G  23.0G  -
nvme-zones/7abf5fdb-42fc-6897-fc58-a4b63e2e45ba-disk0   157G   545G   135G  -
nvme-zones/9f8f2d3e-5b25-c5e9-af53-cef3aeabb48a-disk0  32.4G   545G  17.0G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  61.4G   545G  53.1G  -

[root@vhost1 ~]# zfs list -rt snapshot nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0 | tail
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_06.25.00--5d  1.93M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_07.25.00--5d  2.38M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_08.25.00--5d  2.19M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_09.25.00--5d  2.19M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_10.25.00--5d  2.47M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_11.25.00--5d  2.03M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_12.25.00--5d  2.40M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_13.25.00--5d  2.64M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_14.25.00--5d  2.58M      -  53.1G  -
nvme-zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@2021-09-20_15.25.00--5d  2.12M      -  53.1G  -

the *-disk0 datasets are vdevs for KVM and bhyve VMs on smartOS of which we take a lot of snapshots (currently >200 in total for those 4 zvols). snapshots of vdevs also occupy only the changed blocks, not the whole size of the vdev. The statement they made is true for linux LVM and snapshots, which are horribly inefficient and slow, as they copy the whole filesystem AND while doing so writing all changes into another filesystem during the snapshot, which then has to be committed to the original one, so they need _A LOT_ of free space (and an eternity to finish). We've used this crap a few years, and it is pretty much useless in production for snapshot-based backups; I hope I never have to touch this stuff ever again...
 

usdmatt

Daemon

Reaction score: 602
Messages: 1,543

If you want to run VMs you need some kind of block storage for them, which is zvols on zfs.

You don't necessarily need zvols to run VMs. I have more bhyve machines using basic file images than zvols as I just find it quicker and easier for small machines, rather than having a separate dataset for every disk.

The issue they discuss seems to be related to zvols reserving their entire provisioned space by default. If this reservation is inherited by snapshots, the snapshots will require the pool to have $reservation available space. They appear to show this in an example, although I've no idea if this is still the case, and I pretty much always create zvols as sparse so wouldn't see this issue either way.

Regarding LVM. Yes, I used this for guest image storage once and used snapshots each night to pull copies off. Glad I'm not trying to do that anymore.

For VM hosting, unless you need massive amounts of space, you just can't beat SSD these days. The performance is an order of magnitude better than traditional disks, although even so, that throughput seems incredibly slow. I can fairly easily flatline gigabit ethernet when sending ZFS snapshots, even off a basic pool of HDDs.
 

sko

Aspiring Daemon

Reaction score: 403
Messages: 708

You don't necessarily need zvols to run VMs. I have more bhyve machines using basic file images than zvols as I just find it quicker and easier for small machines, rather than having a separate dataset for every disk.
of course file-based images are always possible, but zvols have a bunch of benefits, like e.g. using different block size than the underlying pool (important for windows guests) without performance loss, and they are far easier to handle IMHO, especially regarding snapshots for backup/cloning/migration purposes.

The issue they discuss seems to be related to zvols reserving their entire provisioned space by default.
I have no idea how one comes up with such an idea - you should never ever let a pool run more than ~80-85% full. provisioning 100% with one zvol would render that pool completely useless as zfs couldn't even write any metadata. That's a supreme example of "shooting yourself in the foot".
 

gpw928

Aspiring Daemon

Reaction score: 241
Messages: 556

I recommend a close look at Backblaze Drive Stats for Q2 2021.

Backblaze has 8359 of the ST14000NM001G drives, but they are only 6 months old. Failure rate is 1.58%.

They also have 8400 of the WDC WUH721414ALE6L4 (Ultrastar DC HC530 14TB) of similar age. Failure rate is 0.49%.
 
OP
nerozero

nerozero

Active Member

Reaction score: 18
Messages: 166

Thank you guys !

especially for windows guests who completely trash their filesystems
Thanks god - no windows, only BSD servers hosted so far. Thank you for information about speeding up IOPS, unfortunately SSD-s is not an option at the moment - too expensive....
Someone may found this visualized illustration of ZFS pools vs Performance and Fail Tolerance: Link

Thank you! Found this site quite informative !
 

usdmatt

Daemon

Reaction score: 602
Messages: 1,543

provisioning 100% with one zvol would render that pool completely useless as zfs couldn't even write any metadata

Sorry, what I actually mean is that the zvol reserves the amount of space you give it, not someone creating a zvol the size of the entire pool. For example if you create a 1TB volume on a 10TB pool, the zvol will reserve 1TB of space. You have to use the sparse option to alter this (or remove the reservation afterwards).

They seem to suggest that a 1TB zvol that has reserved 1TB of space, will want to reserve space for each snapshot, although I haven't bothered to try and clarify whether they mean each snapshot will want to reserve another 1TB, or just enough for the actual data referenced.

Either way I'd never heard of this issue before this thread, and couldn't replicate it, so either they were doing something stupid or it's an old problem that doesn't exist anymore.
 

sko

Aspiring Daemon

Reaction score: 403
Messages: 708

They seem to suggest that a 1TB zvol that has reserved 1TB of space, will want to reserve space for each snapshot, although I haven't bothered to try and clarify whether they mean each snapshot will want to reserve another 1TB, or just enough for the actual data referenced.
The snapshots don't reserve any more space than they actually need.
As said: we have hundreds (or even thousands) of snapshots on that one machine alone, many of them for zvols with 100-500GB size. If they would reserve the space of their parent dataset/zvol that host would need several hundred TB of free space... Only if you clone from a snapshot and promote it, it will inherit the reservations of the original dataset/zvol (as it is a fully usable copy now, no longer a mere snapshot)
 

Argentum

Aspiring Daemon

Reaction score: 289
Messages: 608

Hello,

could you please recommend me a reliable and fast disk for a server which should host 3-4 virtual machines, which use a large data blocks ( eg 2TB volume for VM: zfs create -sV2T -o volblocksize=4k -o volmode=dev zvmpool/zcloud-disk0 )

At the moment I found those quite promising:
  • Western Digital 12TB Ultrastar DC HC520, Datasheet
  • Seagate Exos X16 14TB (ST14000NM001G), Datasheet
At the moment I have issues like Snapshot send is quite slow with very high IOPS (snapshot send speed averaging @ 5-6MB/s, IOPS ~ 350 .. 400 )
To increase the speed of this storage system I strongly advise to use (small) fast SSD (just a single drive will do) in this configuration as L2ARC cache. Prepare to mirror these HDD-s.
 
OP
nerozero

nerozero

Active Member

Reaction score: 18
Messages: 166

To increase the speed of this storage system I strongly advise to use (small) fast SSD (just a single drive will do) in this configuration as L2ARC cache. Prepare to mirror these HDD-s.
Yes, this is planed, but now the primary focus is on a hard drives.
 

VladiBG

Daemon

Reaction score: 554
Messages: 1,201

You can't go wrong with HGST.
To archive higher IOPS use more hard disks (more spindles) with less capacity.
For example for total capacity of 12TB it's better to use 14HDD disks of 1TB in Raid6 instead of 2HDD disks of 12TB in RAID1.
 
OP
nerozero

nerozero

Active Member

Reaction score: 18
Messages: 166

You can't go wrong with HGST.
yah, just now had 2 "new" dead HGST disks on a desk, both - controllers dead (no response to sata at all, disk is sniping)...

Guys, do you experience issues with a disks recently? I'm asking because starting from September 12th I have 3 HDD and 1 ssd dead in my office.... wondering is this a curse of some kind, bad karma or should I blame solar activity...
 

VladiBG

Daemon

Reaction score: 554
Messages: 1,201

If those dead HGST disks of yours are not under warranty can you take off the controller and take a picture of they contact pads to the head of the disk. I have bad experience with entry level disks of WD with oxidizing contacts which looks like this picture:
(Note this is example random pic from the internet)

wd.jpg


When i clean those contacts with a rubber and alcohol then the hard disk start working again for another year or two.
 

ralphbsz

Son of Beastie

Reaction score: 2,352
Messages: 3,241

You can't go wrong with HGST.
To archive higher IOPS use more hard disks (more spindles) with less capacity.
For example for total capacity of 12TB it's better to use 14HDD disks of 1TB in Raid6 instead of 2HDD disks of 12TB in RAID1.
But note that you pay a high price in power consumption (most disks use 8-10 W, pretty independent of whether they store 1TB or 12TB), and in reliability: If you have 14 disks, the probability that at least one disk will fail is 7 times higher than if you have only 2 disks (assuming similar individual disk reliability). That's particularly true if you buy older (used) 1TB disks, which may already have used up much of their useful life. That means that you need more redundancy when using many disks. In your example, you did do that (14 disks with RAID-6 means you can handle 2 faults, unlike a pair of 2 disks which can only handle 1 fault). The math for estimating which version really is more reliable is difficult.

This is a very complex tradeoff, and many factors need to be considered, depending on environment.
 

VladiBG

Daemon

Reaction score: 554
Messages: 1,201

No i'm not talking about old used 1TB disk it was just an example for the calculation of using enterprise disk compared to midline 7200rpm disk.
Hard disk that have >= 160 IOPs (random 128K read) and >165 IOPs (random 50% Write /50% Read) are marked as enterprise disks. Those disk are SAS running at 10K or 15K RPM with capacity size of 300GB , 600GB, 900GB, 1.2TB , 1.8TB and 2.4TB with warranty of 3 years or 5 years
Hard disk that have <=100 IOPs (random 128K read) and <=115 IOPs (random 50% Write/50% read) are marked as midline disk. Those disk are SAS/SATA running at 7.2K RPM with capacity size from 2TB up to 18TB with warranty of 1 year or 3 years

If you want to compare they MSR price
900GB, 12G SAS, 2.5'' Enterprise 15K rpm 512n or 4kn - $765.99 (Seagate Savvio 15k ST900MP0146 / ST900MP0006 now known as Seagate Enterprise Performance 15K v6 or Exos 15E900 ~$288) specification and Exos 15e900 spec
1,2 TB, 12G SAS, 2.5'' Enterprise 10K rpm 512n - $492.99 (Seagate Savvio 10K ~$ 350) specification
1 TB SAS 7.2K, 2,5'' Midline, 7.2K rpm, 512n - $245.99 (Seagate ST1000NX0453 Exos 7E200 ~$200 ) specification
12 TB, 12G SAS, 3.5'' Midline 7.2K rpm, Helium, 512e - $1,202.99 (Ultrastar DC HC520 (He12) ~$500) specification

MSR price is taken from the HPE web site.
 

fcorbelli

Active Member

Reaction score: 61
Messages: 189

I do not suggest HDD for primary storage, only for internal backup.
For VM go to NVMe on PCIe adapter board, in simply mirror.
Samsung 980 pro (not very good, but cheap)

For internal backup 2 WD Gold 16TB (hitachi rebranded in fact) not in in mirror . Backups on either

For restore test 4x or 8x cheap SSDs like samsung 860/870 evo
 

fcorbelli

Active Member

Reaction score: 61
Messages: 189

Some more explanations (I have BSD servers all around the world):
VMs on HDD are just fine (thinking on budget) if you DO not have to make frequent snapshots and backups.
If the VMs does almost nothing all the day, and you make only nightly backups, spinning drives can be used.
But in ALL other cases it is a big no-no.

Latency is so big, and bandwidth so low, that make a snapshot and backup a large-in use VM will take (almost) forever.
With such little numbers (3/4 VM really is just about nothing, a SOHO) it is hard not to go to even cheap DC500 SSDs, mirrored (no fancy RaidZ, go straight to mirror or RAID-1).
Of course the "normal" configuration is

2xsmall SSDs (mirror) = OS
XxNVMe (mirror on PCIe card) = primary data
Various big HDD = internal backup, no redundancy
XxSSD (stripe or RAID0) = internal space to extract the backups to be checked.


Because the most important thing is:
HOW MUCH precious are the data into the VM?

Short version: if you really cannot afford even cheap SSDs, put no more then 1/2 VM on every mirrored-couple of cheap (not so big) HDD (typically 4 free SATA ports = 2 mirrored volumes = 4 HDD+2SSD (for OS)
Do NOT, repeat do NOT, use some more complex configuration for such a little installation.
I do not suggest some ARC or ZIL cache: if it is simply, MAYBE it will work

You need to think, in advice, what I will do WHEN (not if, but when) something become broken? Either on the hardware and the software "side"

If you want something fast, buy cheap NVMEs


mod.jpg
 
Top