ZFS 4 drives coming - raid-z1, or what

decuser · May 4, 2021

I’ve got 4 3tb 7200rpm drives coming and the plethora of options for raid have me scratching my head in wonder and confusion. I can use y’all’s help...

I want to be able to lose a single drive and be ok. I also want the fastest possible write performance. I’m ok sacrificing read performance. Which is my best bet?

So far, I think raid-z2 or raid-10, but I’m not sure.

Thanks,

Will

chungy · May 4, 2021

decuser said:
I also want the fastest possible write performance

If this is the ultimate priority, use two mirrors striped (RAID10 in common parlance). zpool create tank mirror gpt/zfs0 gpt/zfs1 mirror gpt/zfs2 gpt/zfs3 is a quick version of how to accomplish it. See zpool-create(8) and zpoolconcepts(8) for all the details

ralphbsz · May 4, 2021

First, let's assume you are interested in streaming writes.

RAID10 gives you decent capacity (2 disks' worth), and reasonable write performance: No parity calculation necessary, but checksum calculation is still necessary, so that's not a very big gain. For every byte you write from user-space, you need to write 2 bytes to disk. If also gives you tolerance to any fault of a single disk or stripe, but a second fault while recovering the first disk is 50% likely to be fatal.

Compare that to RAID-Z1. It gives you 50% more capacity (3 disks worth), but maybe you don't care. For write performance, you need to calculate parity (which may not be a big deal, given that the CPU already has to calculate checksums, and you may be flush with disk). The important thing is this: If may give you good write performance too, since for every byte written from user space, you only need to write 1.333 bytes to disk! So your throughput may be 1.5 times better than RAID10 at the disk level, which probably makes up for the parity calculation. The problem with RAID-Z1 is the fault tolerance: Any fault while recovering the first disk, and you're dead.

For this reason, I would strongly consider RAID-Z2, which can handle any two disks failing. Capacity is like RAID-10 (2 disks' worth). Write performance will be the worst of the lot: For any byte you write, the disks have to write 3 bytes to disk, reducing your throughput a lot. On the other hand, it will handle way more disk failures.

Second, if you are doing small writes, in particular update in existing files (in the style of a transaction database), the rules change. Then it's all about block sizes, and how parity updates are done.

I would actually benchmark your workload before making a decision. You could use a few dd commands as a workload generator, and see how much throughput you get, and compare RAID10 with RAID-Z1. But personally, I would give up some write throughput, and go for the higher reliability of RAID-Z2 instead; it seems unlikely to me that your need for throughput is so urgent that you can't afford the extra durability of your data.

mer · May 4, 2021

If you want to boot off the drives, be careful what you choose for configuration.
Mirrors are typically fastest on reads because any device can satisfy it, writes are not completed until all devices are written.

ralphbsz has good suggestions, the best is about benchmark your needs.

sko · May 4, 2021

Unless you *really* have to (or for (very) big pools), always use mirrored vdevs. They are faster and MUCH more flexible and easier to expand/swap out than raidz.

As a rule of thumb: the more vdevs, the faster the pool - so especially with few drives, mirrors will almost always be the fastest option.

You should also consider that resilvering raidz takes _A LOT_ of time and imposes a much higher load on the system than resilvering a mirror - on slower CPUs this can slow down the resilvering to several days up to weeks for very big drives. Resilvering a mirror is usually only restricted by disk- and controller-bandwidth and therefore very fast.

SirDice · May 4, 2021

sko said:
Unless you *really* have to (or for (very) big pools), always use mirrored vdevs. They are faster and MUCH more flexible and easier to expand/swap out than raidz.

Except when two drives in the same mirror die, then the whole thing goes down. Something to keep in the back in your head.

rootbert · May 4, 2021

maybe you find something interesting at https://calomel.org/zfs_raid_speed_capacity.html ... and maybe you do some benchmarking yourself and add the results to https://forums.freebsd.org/threads/microbenchmark-collection.79732/

jb_fvwm2 · May 4, 2021

in post #2, how are /dev/zfs0 etc created from the raw disks?

rootbert · May 4, 2021

/dev/gpt/zfs0 is created via labels on partitions, eg: "gpart add -t freebsd -a 1m -l zfs0 ada0"; then you can view your labels on your partitions of your disks via "gpart show -l" ... for details see the man page of gpart(8)

Deleted member 67440 · May 4, 2021

The answer depends on scenarios.

In general I always use mirrors (modern disks are typically large enough not to have the problem of too small volumes), both for HDDs and SSDs and NVMe (if it's simple, maybe it will work).

If the volume is only working space, then stripe all (raid0) on 4 disks.

Obviously, as suggested, the processed data must then be moved to more reliable systems (mirror)

jb_fvwm2 · May 4, 2021

rootbert said:
/dev/gpt/zfs0 is created via labels on partitions, eg: "gpart add -t freebsd -a 1m -l zfs0 ada0"; then you can view your labels on your partitions of your disks via "gpart show -l" ... for details see the man page of gpart(8)

would not that be freebsd-zfs?

sko · May 4, 2021

SirDice said:
Except when two drives in the same mirror die, then the whole thing goes down. Something to keep in the back in your head.

same goes for raidz-1... also: no variant of mirrors or raidz is an excuse for having no backups.

decuser · May 4, 2021

Hmmm... so, I'm torn between striping over two mirrored pools (Raid 10?) and Raid-z1 which will give me another 3gbs of storage at the expense of risking a second fault during restore being 100% problematic, vs 50% problematic with Raid 10. The discussion about partitioning makes me curious... I was planning on using all four drives bare, without setting up any partitions. Do I need partitions or do they offer any performance advantage?

mer · May 4, 2021

Lots of opinions about raw vs partitions, but all "3TB" drives may not have the same amount of space. I ran across a link a long time ago discussing this (no don't recall where it was or have it around), it made sense to me at the time, so I've always put down partitions, even when they are just for data, size as big as possible so on a 3TB drive, make the partition 2.8TB. On SSDs there may be some benefit to using partitions and not using the whole disk; free space that the device and use for wear leveling.
Some of the benefits may be theoretical, some may be real, but I've done it that way for a long time and can't remember why.

Edit: this is what I read about doing partitions instead of raw devices. Again, it made sense to me, others may have different opinions.

ZFS: do not give it all your HDD

Sometimes, holding out is the best decision.

www.freebsddiary.org

chungy · May 4, 2021

jb_fvwm2 said:
would not that be freebsd-zfs?

You are correct. And though my example command was pretty brief, I hope you might infer that making GPT labels that make sense to you is best overall. Ideally labels that help you find the exact disk causing trouble when the time comes to know that information.

Plain old numbers like my simple example may or may not work toward this goal. In my personal PC, they work fine. I think of 0 as being the drive on top inside my tower, and 3 being the drive on bottom of my tower.

SirDice · May 4, 2021

sko said:
same goes for raidz-1...

True. But with a, lets say, 8 drive RAID 10, 4 drives can die if they're the 'right' ones and everything will be fine. Or two 'bad' ones (in the same mirror) and the whole pool will go down. That's a risk that's often overlooked. With an 8 drive RAID-Z2 any two drives can fail and the data would still be accessible.

sko said:
also: no variant of mirrors or raidz is an excuse for having no backups.

Completely agree. No amount of RAID is a substitute for good backups.

decuser said:
Do I need partitions or do they offer any performance advantage?

No performance advantage or disadvantage. There is an administrative advantage though, a labeled partition will make it easier to identify the disk and what's on it.

Jose · May 4, 2021

SirDice said:
Except when two drives in the same mirror die, then the whole thing goes down. Something to keep in the back in your head.

Also considering that these are likely identical drives all bought at the same time it's very likely that you'll have more than one drive fail at around the same age. I would strongly consider RAIDZ2.

Yes, backups are nice, but you'll still have data loss unless the second drive decides to fail right after your backup finishes. In my experience, running a backup is exactly the kind of stress test that will cause an old drive to fail.

Deleted member 67440 · May 4, 2021

decuser said:
The discussion about partitioning makes me curious... I was planning on using all four drives bare, without setting up any partitions. Do I need partitions or do they offer any performance advantage?

Essentially it refers to possible differences in the number of sectors available for disks, that may not be identical.
One (for whatever reason, new HDD firmware version or whatever) is enough to cause problems. This is why, generally, you make partitions leaving a few MB of space in the tail, just to be safe, if you need to replace a disk that will/could have a different effective size.
No performances differences.

SirDice said:
No amount of RAID is a substitute for good backups.

Good backup = verified backup.
Professional backup = at least 5 backups on different media, with different software, all verified.
Franco's backup = at least 7 backups on different media, different software, different hardware, and 3 different WANs, all verified.

For example, it is not trivial to restore files longer than (about) 260 characters on Windows machines, unless you enable LongPathsEnabled (which is generally not recommended, programs often limit themselves to internally defining buffers of 255 characters).
The recovery of UTF characters is not obvious either (for example FreeBSD / Linux zfs / ext4 are not 100% compatible).
In short, a "good" restore is quite different from a "copy-paste"

SirDice · May 4, 2021

fcorbelli said:
Good backup = verified backup.

Oh, yes. You wouldn't be the first that diligently backed up everything for many years only to find out when you actually needed those backups they're worthless. Don't test your backup strategy, test your restore procedures!

Deleted member 67440 · May 4, 2021

Jose said:
Yes, backups are nice, but you'll still have data loss unless the second drive decides to fail right after your backup finishes. In my experience, running a backup is exactly the kind of stress test that will cause an old drive to fail.

A backup is not really stressful, because it read the data (last access turned off).
Write a backup, and extracting a backup will be much more stressful.
That's why, typically, you should use more backups, stored forever (no throwing of old data), and consumer-grade cheap SSD with stripe for deduplication restores to be checked: when they fail (daily scrub) just throw away and replace without mercy.

Deleted member 67440 · May 4, 2021

SirDice said:
Oh, yes. You wouldn't be the fist that diligently backed up everything for many years only to find out when you actually needed those backups they're worthless. Don't test your backup strategy, test your restore procedures!

This is the golden rule, to which I add: never delete a single copied byte.
The day you purge a 3-year-old backup will be the day
you need a 3-year-1 day old file.
Guaranteed

Deleted member 67440 · May 4, 2021

And I add: make something that send you reports where you can find something strange

Code:

zpaqfranz v51.17-experimental journaling archiver, compiled Apr 25 2021
franz: -n 1000
Dir compare (7 dirs to be checked), ignoring .zfs and :$DATA
Creating 7 scan threads
04/05/2021 04:55:52 Scan dir |00| <</tank/condivisioni/>>
04/05/2021 04:55:52 Scan dir |01| <</temporaneo/dedup/1/condivisioni/>>
04/05/2021 04:55:52 Scan dir |02| <</temporaneo/dedup/2/tank/condivisioni/>>
04/05/2021 04:55:52 Scan dir |03| <</temporaneo/dedup/3/tank/condivisioni/>>
04/05/2021 04:55:52 Scan dir |04| <</monta/nas1_condivisioni/>>
04/05/2021 04:55:52 Scan dir |05| <</monta/nas2_condivisioni/>>
04/05/2021 04:55:52 Scan dir |06| <</copia1/backup1/sincronizzata/condivisioni/>>


Parallel scan ended in 448.644000
Free 0           458.568.938.496      427.08 GB    <</tank/condivisioni/>>
Free 1           345.316.097.024      321.60 GB    <</temporaneo/dedup/1/condivisioni/>>
Free 2           345.316.097.024      321.60 GB    <</temporaneo/dedup/2/tank/condivisioni/>>
Free 3           345.316.097.024      321.60 GB    <</temporaneo/dedup/3/tank/condivisioni/>>
Free 4         3.269.358.272.512        2.97 TB    <</monta/nas1_condivisioni/>>
Free 5         4.029.367.296.000        3.66 TB    <</monta/nas2_condivisioni/>>
Free 6         8.948.942.921.728        8.14 TB    <</copia1/backup1/sincronizzata/condivisioni/>>

=============================================
Dir 0            544.830.246.673      466.094 21.046 <</tank/condivisioni/>>
Dir 1            544.830.246.673      466.094 23.983 <</temporaneo/dedup/1/condivisioni/>>
Dir 2            544.830.246.673      466.094 24.322 <</temporaneo/dedup/2/tank/condivisioni/>>
Dir 3            544.830.246.673      466.094 22.822 <</temporaneo/dedup/3/tank/condivisioni/>>
Dir 4            544.831.201.401      466.125 251.026 <</monta/nas1_condivisioni/>>
Dir 5            544.830.613.627      466.104 287.066 <</monta/nas2_condivisioni/>>
Dir 6            544.827.548.665      466.092 448.645 <</copia1/backup1/sincronizzata/condivisioni/>>
=============================================
               3.813.810.350.385    3.262.697 452.844 sec (3.47 TB)

Dir 0 (master) time 21.046 <</tank/condivisioni/>>
size           544.830.246.673 (files 466.094)
-------------------------
= /temporaneo/dedup/1/condivisioni/ scantime 23.983000 
-------------------------
= /temporaneo/dedup/2/tank/condivisioni/ scantime 24.322000 
-------------------------
= /temporaneo/dedup/3/tank/condivisioni/ scantime 22.822000 
-------------------------
Dir 4 (slave) IS DIFFERENT time 251.026 <</monta/nas1_condivisioni/>>
size           544.831.201.401 (files 466.125)
excess  (not in 0) /tank/condivisioni/Utenti/XXX/CORRISPONDENZA CLIENTIXXXXXXX/
(...)
-------------------------
Dir 5 (slave) IS DIFFERENT time 287.066 <</monta/nas2_condivisioni/>>
size           544.830.613.627 (files 466.104)
excess  (not in 0) /tank/condivisioni/Utenti/XXX/CORRISPONDENZA CLIENTI/XXXXXXX/
(...)
-------------------------
Dir 6 (slave) IS DIFFERENT time 448.645 <</copia1/backup1/sincronizzata/condivisioni/>>
size           544.827.548.665 (files 466.092)
missing (not in 6) /tank/condivisioni/.cestino/ut-02/Utenti/XXX/CORRISPONDENZA CLIENTI/XXXXXreinsegnements.eml
missing (not in 6) /tank/condivisioni/Utenti/XXX/CORRISPONDENZA CLIENTI/XXXXETTRONICA/reinsegnements.eml
-------------------------

In this example: 6 backup, 3 good, 3 to be checked

Eric A. Borisch · May 4, 2021

Having automated/tested backups in place is the only way to squeeze out peak performance without making regrettable choices. You also need to consider your tolerance to data loss. If you back up once a day, are you OK potentially losing data from one day (if things go south, like a power supply failing at the wrong point) in order to get more performance? Can you recreate the data if it is lost? (Compilation outputs or downloads, for example.)

That said, there are a few reasonably simple things you can do with ZFS to squeeze more write performance out. For any of these, be sure to understand the implications for your workload.

If you can tolerate — almost certainly a bad choice for a mission-critical database — disabling sync ( zfs sync=disabled ...) on the dataset, you can hide the latency from burst-y writes to any (raidz1/2, raid10) layout; I do this on /usr/obj, for example, as the contents can always be recreated if the power goes out. Having a UPS in place (and actively monitored with sysutils/apcupsd) can mitigate some [1] of the risk here.

There's also the vfs.zfs.vdev.bio_flush_disable sysctl to consider. This stops asking the drives to flush data from their on-disk cache to the medium; so long as you never lose power, this should be OK. Note that your power supply failing counts as losing power even if you have a UPS. Did I mention backups?

If you're mainly doing large writes to big files, bumping up the recordsize can also help (fewer calls up/down the chain per MB written), and also improves compression. I'd put this as near-zero marginal risk tuning option.

Compression itself (typically with lz4 or zstd-fast) can make writes faster if you have compressible data. A faster CPU can come into play in this case, too.

As always, benchmarking (as ralphbsz mentioned) your workload is always the best indicator as far as performance is concerned, and better than anyone's (including my own) opinion on a forum. Unfortunately, performance of any filesystem changes (for the worse) as the storage is filled and fragmented, so take your benchmark results on a fresh system as "best case" and go from there.

[1] In the event of a kernel crash, you're certainly more exposed to an unexpected (in written user data; not the zpool/zfs filesystem) state with a dataset's sync disabled. While I'd say there's always a chance for corruption / unexpected state after a crash, databases in particular go through a lot of trouble to make sure it doesn't happen, and disabling sync undoes much of that protection. Did I mention backups?

Deleted member 67440 · May 5, 2021

Eric A. Borisch said:
...
If you can tolerate — almost certainly a bad choice for a mission-critical database — disabling sync ...
There's also the vfs.zfs.vdev.bio_flush_disable sysctl to consider. ...
Did I mention backups?

I really don't agree, unless we're talking about a really seasoned system engineer.

Any changes that reduce storage reliability is risky.
Too risky.
Performance gains are typically tiny, hardly noticeable.
In the "SOHO world" database backups once a day may be enough.
Maybe.

In the "real" world, certainly not.
Imagine what would happen if Amazon lost all one-day orders, or a forum like this all one-day posts.

There are certain methods, master-slave replication, datatable replication, dump, and so on.
But really the last thing you want during an emergency is to have the slightest doubt about reliability.

In case you have really "disposable" data (eg dynamic site cache) I suggest ... to use ramdisk.

Normally split data to keep (mirror), large temporary data (stripe on cheap SSDs), small temporary data (ramdisk)

Remembering the most important thing of all, often underestimated.
The typical command is save, not memorize or write.

Short version: minimal improvement, high risk, a 0.5% difference can cost days of work and swearing

decuser · May 5, 2021

Uh... wow. I'm just a humble end user. I don't want to lose data, but it's not mission critical stuff. Let me add some more info. Here's the deal, I have a Mac Pro w/12 xeon cores and 32GB of ram. The box has a 500GB PCIe sata ssd drive and four SATA slots. I bought four 3 TB 7200RPM enterprise drives and plan to run FreeBSD 13 w/Open ZFS with the SSD and four additional drives on the machine (currently it runs FreeBSD 12.2 w/ZFS (and a sloppy set of 3 mismatched drives). The four drives will be used for my other system's backup storage (think Time Machine, or rsyncs).

So far, based on the discussion, I've tentatively decided:
1. Partitioning is good - that way, when a drive bites it, I can replace it with any similar capacity drive and be up and running again quickly
2. Raid 10 is the best 4 drive raid for 1/2x write performance, 2x read performance, able to lose 1 drive from each mirror.
3. Raid Z1 is the best 4 drive raid for 1/3x write performance, 3x read performance, able to lose any 1 drive.
4. Raid Z2 is the best 4 drive raid for 1/4x write performance, 2x read performance, able to lose any 2 drives.
5. I'm not gonna mess with sysctls, nun uh, not gonna do it!

I'm not 100% sure I've gotten the write and read performance right, but I look forward to your corrections

.

ZFS 4 drives coming - raid-z1, or what

decuser

chungy

ralphbsz

mer

sko

SirDice

Administrator

rootbert

jb_fvwm2

rootbert

Deleted member 67440

Guest

jb_fvwm2

sko

decuser

mer

ZFS: do not give it all your HDD

chungy

SirDice

Administrator

Jose

Deleted member 67440

Guest

SirDice

Administrator

Deleted member 67440

Guest

Deleted member 67440

Guest

Deleted member 67440

Guest

Eric A. Borisch

Deleted member 67440

Guest

decuser