ZFS ZFS basic questions

Korger · Dec 16, 2021

Hi Everyone,

I've started to use ZFS instead of the venerable UFS and would like to clarify some points about vdevs and zpools. My understanding is that the primary purpose of vdevs is to allow the combination of several disks to be used as a single virtual disk, and the primary purpose of zpools is to allow a filesystem to be extended with new vdevs. A zpool is a collection of vdevs, and a vdev is a collection of disks.

Disks can comprise a vdev according to the following vdev types:

striped: blocks of a file may be distributed across all disks with no redundancy
mirrored: each block of a file is present on each disk
raidz: like striped, but one disk is reserved for parity
raidz2: like striped, but two disks are reserved for parity

In contrast, the zpools are always striped. So blocks of a file may be distributed across all vdevs in the zpool with no redundancy.

I think, and I'm looking for confirmation here, that a vdev must comprise of disks of identical size, and a zpool must comprise of identical vdevs. Is that right?

For example, I may create a vdev out of four 1TB disks using raidz2, resulting in a total of 3TB storage. I then create a zpool on top of this vdev. If later I want to add new disks to this zpool, I can't extend the vdev, since those stay fixed after they are created. Instead, I have to create another vdev, but that also must hold four 1TB disks using raidz2. Once this new vdev is added to the zpool, the total storage space will be 6TB.

Is this correct? Or is there a way to add disks of different models/sizes to zpools or even vdevs?

covacat · Dec 16, 2021

your 4x raidz2 will be 2TB not 3
i think you can add anything to the pool like single disks, mirrors whatever if you are happy with the resulting redundancy level
anyway you can test it with files or mdconfig (create test devices with mdconfig and create pools out of them)

Erichans · Dec 16, 2021

Welcome to the ZFS world!

Some of your questions:
"raidz2: like striped, but two disks are reserved for parity"
To be precise: it's the space of two disks used for parity; the parity information is distributed over the disks in the vdev. RAIDZ1 is not the same as RAID-5 but you can say it is RAID-5-like.

"I may create a vdev out of four 1TB disks using raidz2, resulting in a total of 3TB storage."
Using raidz2 with four 1 TB disks gets you a total of 4TB of disk space; 2 TB used for parity and 2 TB can be used for your data.

You can use different size disks in one vdev but only the smallest sized disk worth of space is used of each disk.
You can use different sized vdevs combined in a single pool but then you will have an imbalance in storage and (evenly distributed) performance.

"So blocks of a file may be distributed across all vdevs in the zpool with no redundancy."
You get redundency (with mirrors or RAIDz) from the vdevs in the pool, not from the pool as such.

Some links I'd like to suggest, especially from a beginners user perspective:

Chapter 20. The Z File System (ZFS) of the FreeBSD Handbook
a short intro of vdevs and pools at Quick recap of ZFS
ZFS for Newbies by Dan Langille
Today's ZFS Michael W Lucas

Ad #2: The article as a whole has a certain angle. Some comments, especially on expansions: ZFS in the trenches | BSD Now 123 (as the writer mentions at Addressing some feedback).

mer · Dec 16, 2021

My goto answer is:
Pick up a copy of FreeBSD Mastery: ZFS and FreeBSD Mastery: Advanced ZFS by Michael W Lucas and Allan Jude. For me, easiest to understand books about ZFS I've read.

Below is all my understanding and may not be 100% correct
Identical sizes: not really. Ideally you want them to be, but ZFS "expands to the smallest". If you create a mirror with a 1TB and a 2TB disks, ZFS says it's "a mirror sized 1TB".
This is an old article, but still makes sense to me:

ZFS: do not give it all your HDD

Sometimes, holding out is the best decision.

www.freebsddiary.org

Adding to a zpool: I think you can always add new disks to a striped pool but that increases the size, no redundancy and breaks if you remove a disk. This is a typical mistake when folks have a single disk that they are trying to mirror, so read the difference between add and attach.
I think the add and attach also wind up affecting adding disks/vdevs to other redundancy types.
I just looked up in my references, they all say "You can't add to a RAID-Z VDEV". That means you can't add a disk to your RAIDZ2 zpool.
I think you can create a mirror, but you need enough resources to create the same RAIDZ and attach that to the pool. That would give you stuff like a mirror of 2 RAIDZ-2 vdevs.

One thing about the sizes of the devices: ZFS will expand.
Say you have a mirror with 2 500GB drives. You replace one drive with a 1TB drive, ZFS will resilver, then you replace the other 500GB drive with 1TB drive, ZFS resilvers. Now you have a mirror of 1TB drives.

BjarneB · Dec 16, 2021

This is a very nice article on the basics: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/

gpw928 · Dec 16, 2021

All the "disks" in a VDEV have to be the same size. If they are not exactly the same size, you can force the use of the smallest common denominator (and waste the extra space).

You cannot add more "disks" to a VDEV once it is created (except you can convert a single "disk" to a mirror).

The only way to expand VDEV is to replace all the "disks" with larger ones (one at a time, because you have to resilver).

A zpool is composed of one or more VDEVS. I'm not certain here, but I don't believe that the VDEVs have to be the same size, or construction (but they should be, so silly things are possible). You can add more VDEVs to a zpool after it is created, but the resultant striping will be sub-optimal.

ZFS can use for storage media anything that is a FreeBSD GEOM storage provider, which is what I really meant by "disk" above. So you may compose your VDEVs from:

raw disk;
a disk partition;
file-backed storage;
GEOM provider, e.g. GELI, concat, mirror, multipath, memory disk, ... see geom(8).

Not all GEOM classes may make sense in context of ZFS, but some are very handy, e.g.

GELI has traditionally been used for encryption (ZFS now also have native encryption);
as covacat mentioned, mdconfig(8) is really handy for quickly crafting small test VDEVs;
gconcat(8) can be used to aggregate small disks into a single, larger, "disk".

GEOM classes can be stacked on top of each other.

grahamperrin · Dec 18, 2021

Korger said:
A zpool is a collection of vdevs,

A pool is a collection of vdevs. zpool is a command.

mer · Dec 18, 2021

gpw928 said:
All the "disks" in a VDEV have to be the same size. If they are not exactly the same size, you can force the use of the smallest common denominator (and waste the extra space).

You cannot add more "disks" to a VDEV once it is created (except you can convert a single "disk" to a mirror).

I think the "use smallest" is the default behavior. At least from my experience.
Mirrors: one can add providers to a mirror. Take a mirror of 2 providers, you can attach a third provider. Why? well after resilvering is complete, you have the data stored across 3 physical devices. "So what?" Well, now detach one and you have an instant backup and can move that to a different machine.
Or if the 3rd device is bigger, you can detach one of the original 2 smaller, attach a new bigger one. Now after resilvering you have 3-way mirror, 2 big devices one small. Now remove the small one (detach) and you are back to a mirror of 2 devices but now they are bigger.
Upgrade of the devices to larger sizes without losing redundancy. I've done that exact thing a couple times. I wish I had the right hardware for all the devices to be in trays that can be hot swapped, would have been a snap.

Jose · Dec 18, 2021

gpw928 said:
You cannot add more "disks" to a VDEV once it is created (except you can convert a single "disk" to a mirror).

Raidz vdev expansion is coming in 2022:

ZFS RAIDZ expansion is awesome but has a small caveat

gpw928 · Dec 18, 2021

mer said:
I think the "use smallest" is the default behavior. At least from my experience.

As my 8 year old 3 TB spindles failed, I have been replacing them with 4 TB spindles.

I have just remade the tank (5 spindles @ raidz1 to 7 spindles @ raidz2).

The new pool creation failed specifically because the disk sizes were mis-matched. I had to force it.

mer · Dec 18, 2021

gpw928 said:
The new pool creation failed specifically because the disk sizes were mis-matched. I had to force it.

Interesting. Did you try going from the raidz1 to raidz2 as part of it? I was basically upgrading a mirror and leaving it a mirror so did not change the configuration of the zpool.
What version of OS did you do this on? Mine was done on pre-FreeBSD12 so if you were going across OS versions maybe that had corner cases.
Anyway, "my personal experience" trumps all theory.

dave01 · Dec 18, 2021

mer said:
One thing about the sizes of the devices: ZFS will expand.
Say you have a mirror with 2 500GB drives. You replace one drive with a 1TB drive, ZFS will resilver, then you replace the other 500GB drive with 1TB drive, ZFS resilvers. Now you have a mirror of 1TB drives.

Additionally, if you have mixed drives, say 2x1TB and 2x2TB, you can make zpool of 4x1TB using all four drives, eg raidz of 3TB, and create a mirror of 2x1TB in the "wasted" space on the 2x2TB drives giving redundancy across 4TB of usable space rather than only 3TB provided by 2x1TB mirrored and 2x2TB mirrored. Some may see that as risky, others as being frugal and maximising the use of resources

ralphbsz · Dec 18, 2021

Dave01's suggestion is excellent, and it is exactly what I do at home: I have a pair of drives, 3TB and 4TB, and I use that for one 3TB mirror pair, plus one 1TB separate pool that's non-redundant (for scratch storage of backups).

My only warning about such setups is that you need to be very careful. For example, in dave01's example, you have six 1TB partitions. When setting those up, you need to make 100% sure that you assign the partitions to pools correctly. Creating a mirror pair out of two partitions that are on the same disk will be really bad: not only will it be unsafe from a durability point of view, it will also have lousy performance. Once again, I would recommend using gpart to name the partitions with human-readable names, and then using those names.

Why do I mention that? My colleagues and me did exactly that once at a customer site. On a $100M computer, systematically for thousands of disks. We got in BIG trouble.

gpw928 · Dec 18, 2021

mer said:
Interesting. Did you try going from the raidz1 to raidz2 as part of it? I was basically upgrading a mirror and leaving it a mirror so did not change the configuration of the zpool.

I purchased two new 12 TB WD Gold SATA disks. They are on product run-out sale (WD are re-branding their high end disks to HGST). I connected them externally as a mirror'd pool on USB (3.1Gen2 StarTech) SATA adapters. That allowed me to send the original internal RAIDZ1 tank to the USB mirror. The system ran for a couple of days like that (with a desk fan pointed at the WD Golds). I then configured an internal RAIDZ2 tank with two additional spindles (7 x 3 TB), and sent the tank back to the new RAIDZ2 pool. The new 12 TB disks have just been redeployed for an off-site backup rotation for snapshots of the tank.

My original plan was to vacate the tank to a pool (single VDEV) consisting of a GEOM concat of three USB attached spindles. But I couldn't resist the common sense suggestion made by VladiBG of getting the 12 TB disks. It allowed the tank to be vacated safely (redundancy preserved at all times), and the ZFS server off-site backup capacity sorted for years to come. It cost some dollars, but well spent, I think. [The ZFS server has on-line backups of everything else, so its off-site backups are critical.]

mer said:
What version of OS did you do this on?

It's FreeBSD 13.0-RELEASE, on the (standard) delayed patch cycle.

gpw928 · Dec 18, 2021

dave01 said:
Additionally, if you have mixed drives, say 2x1TB and 2x2TB, you can make zpool of 4x1TB using all four drives, eg raidz of 3TB, and create a mirror of 2x1TB in the "wasted" space on the 2x2TB drives giving redundancy across 4TB of usable space rather than only 3TB provided by 2x1TB mirrored and 2x2TB mirrored. Some may see that as risky, others as being frugal and maximising the use of resources

Or a GEOM gconcat(8) of 2x1TB (1+1) spindles, and then combine with 2x2TB (2+2) spindles to provision a VDEV of 6TB total ((1+1)+2+2) = (2+2+2) as a 6TB stripe, a 4TB RAIDZ1, or a 2TB tripple mirror.

grahamperrin · Dec 19, 2021

ZFS - Best practice for specifying disks (vdevs) for ZFS pools in 2021?

This started out as a "why isn't this working?" question, but after a full day of trial and error, I think it might help more people to discuss "what's the best approach in today's world? (FreeBSD 12.2)" How should disks (or vdev's) be identified when creating ZFS pools in 2021? (and...

forums.freebsd.org

Jose said:
Raidz vdev expansion is coming in 2022:

ZFS RAIDZ expansion is awesome but has a small caveat

Nice.

More discussion of the Ars Technica article: <https://old.reddit.com/r/homelab/comments/o11j53/-/>