ZFS ZFS Best Practice

Khaine · Mar 20, 2020

Hi,

I set up a ZFS file server a number of years ago. I'm currently upgrading the capacity of this server. When I set it up, I had 4 x 2.5 TB hard drives set up in raidz1. I'm migrating to 4 x 8 TB hard drives. I wanted to know if radz1 is still the best method of having redundancy, and whether there was any updated guidance on the best way to lay out FreeBSD with ZFS.

Thanks

BjarneB · Mar 20, 2020

When I was in the same situation about some time ago, I followed this advise: https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ and went from raidz to mirrors. I just recently upgraded my zfs pool again and since I am now using mirror devs it was dead easy and fast, replacing one disk at a time.

Peter Eriksson · Mar 20, 2020

Khaine said:
Hi,

I set up a ZFS file server a number of years ago. I'm currently upgrading the capacity of this server. When I set it up, I had 4 x 2.5 TB hard drives set up in raidz1. I'm migrating to 4 x 8 TB hard drives. I wanted to know if radz1 is still the best method of having redundancy, and whether there was any updated guidance on the best way to lay out FreeBSD with ZFS.

Thanks

With just 4 drives and at 8TB sizes I would probably go for two concatenated mirrors instead. With 3+1 RAIDZ1 and the rebuild times for for 8TB drives I'd feel too nervous about another disk going bad at the same time... A bit less space (2*8 instead of 3*8) but better speed.

Resilver rates are basically limited to the speed of one drive - and considering a typical 7200rpm 8TB drive will write around 170-200MB/s (sustained, in reality - at first it looks like it's much faster but that's due to things in the ARC) and at 50% usage, it'll take around 8*1000*1000 / 170 * *0.50 = 7 hours...

If everything goes well. Recently resilvered a 140TB RAID-Z2-array (2x(4+2) 10TB 7200rpm drives), and it took 4 days...
Well, actually 10 days since it restarted when the server ran out of memory when almost done the first time - turns out resilver would pull a lot of metadata (this server have/had like 160M files&directories, 24000 ZFS filesystems & 400 snapshots per filesystem) into the ARC cache, which grew over the limits and ... *boom*. Turns out 256GB RAM wasn't enough with the kernel settings we used

Some lessons learned for us:

- Do _not_ set/modify kern.maxvnodes (the ZFS tuning guide is wrong here) - especially do _not_ modify it after the systems is up and running. Seems the arc_max limit isn't really respected then for some strange reason so it'll grow way over the arc_max limit. Also - setting it to high numbers like 25000000 (25M) uses up a lot of RAM (like 11GiB) that better could be used as ARC cache too.

- 600 Samba smb processes use a lot of RAM (like 100-200MB per process).

- Resilver reads a lot of data into the ARC/memory

- Setting the vfs.zfs.arc_max to 128GB is too much (with our Samba usage) - 96GB is better. (The default (90% of RAM) is way too high).

- But then you better increase vfs.zfs.arc_meta_limit to 50GB (50% of arc_max) instead of the default (25% of arc_max) so more meta data fits into it...

- Buy more RAM.

PMc · Mar 20, 2020

Peter Eriksson said:
- Do _not_ set/modify kern.maxvnodes (the ZFS tuning guide is wrong here) - especially do _not_ modify it after the systems is up and running. Seems the arc_max limit isn't really respected then for some strange reason so it'll grow way over the arc_max limit.

Ah ja. Think I wrote a word about that one recently. kern.maxvnodes is what used to be the inode cache, which is kept by the kernel. And as long as the kernel keeps these in it's own cache, they are tagged as <referenced by some other service> in the ARC, and cannot be evicted.
So the arc will run evicts, and try to honor arc_max, but does not get rid of these blocks.

gpw928 · Mar 20, 2020

Hi, I think you need to consider the trade-off between resilience, capacity, and performance.

With RAIDZ2 you can lose any two disks and still be running. During re-silvering, the performance may be a (potentially very serious) problem.

However, none of the other options are as resilient.

In managed facilities (where I was not allowed to see or touch the hardware) I have experienced the results of an engineer pulling the wrong disk from a RAID5 set with one dead disk. RAID6 would have saved the situation.

My ZFS server is "too big" to backup in its entirety. Resilience is the primary requirement. It justifies RAIDZ2. I have RAIDZ1. That was a mistake (and hard to fix after the fact).

ralphbsz · Mar 21, 2020

With today's disk drives, configuring a RAID system that can only tolerate a single fault is dangerous. Matter-of-fact, the former CTO of NetApp called selling RAID-5 (what ZFS called RAID-Z) "professiomal malpractice". The reason is this: Today's drives have gotten VERY large, for example 10TB, but the error rate of drives has not gotten better, for example 10^-14 hard read errors per bit. Now, a 10TB drive happens to have roughly 10^14 bits, so if you read a 10TB drive once, you expect roughly one bit error, meaning one unreadable sector, meaning one IO error. Now if you have a single-fault tolerant RAID array, and one disk fails (for example smoke comes out of the electronics board, or the spindle bearing seizes, or a head crashes hard), you have to read all other disks in the array. The probability of finding at least one read error in doing this is very high (see math above), so you will get data loss, meaning at least one place in the RAID array is unrecoverable, and recovery will not complete!

If you want a reasonably reliable RAID system today, you need to guard against double faults. With four disks, I would do RAID-Z2, which gives you two disks' worth of capacity, but can handle the failure of any two disks (or more likely the failure of one disk, plus one disk error).

In addition, you should have backups. At least of the important data. Backups and RAID protect against different things. Backup is against an admin who by mistake does "rm -Rf *". Or against a file you deleted in earnest, and then a week later discovered you shouldn't have. Or against there being a fire at your location that destroys all four disks.

Eric A. Borisch · Mar 21, 2020

Just seconding all the things ralphbsz says above.

Stripes of mirrors should be used for performance (IOPS) sensitive environments where you have sufficient backups. For most home users, raidz2/3 (depending on the size of the array) are much better choices for durability.

And RAID != backup. If your data is important to you (think family photos or financial records, not your stash of mp3s and dvd rips), you need a copy somewhere outside your home.

gpw928 · Mar 21, 2020

The best opportunity you will get to understand the performance, capacity, and resilience trade-offs is while you are building the new system and before you commission it.

Real understanding will come if you take the time to get a copy of bonnie++, or iozone, and test the various disk configuration options. (With bonnie++, disable the getchar/putchar tests which are way too slow for "big disks".)

If possible, test your real application(s).

You can even pull a disk, and see what re-silvering does to the outcomes.

When I first saw RAIDZ, I was very pleasantly surprised at how well it took advantage of striping.

I agree that local RAID != backup. The problem is that NAS storage capacity has outstripped the economic technologies for traditional backup methods. I triage my data. The top terabyte gets off-site backup (rsync to a pair of eSATA disks, with one off-site at all times). My life is a lot easier because that triage was designed into the ZFS server data layouts from the outset.

However, large commercial sites do use RAID for backups. It's simply an issue of how many copies you keep, and where they are physically located. Dell/EMC have made a killing with Data Domains (which are remarkably similar to ZFS servers) because, at some point, securing all the backup data onto tape regularly becomes either physically or economically impossible.

Eric A. Borisch · Mar 21, 2020

Data Domains product is literally a backup. Backup doesn’t have to be tape; a separate copy of data on spinning rust or flash can certainly be a backup. Main features in my mind that distinguish a “backup”: physical distance (shouldn’t be under the same roof) and some level of offline-ness: the data isn’t being manipulated in real time by the “active” primary system. It’s the rm -f issue; if one fumbled keystroke (or malicious user) can remove all copies of the data, it’s not backup.

My current backup is an 8TB encrypted (geli) zpool on a USB drive that lives at work. Every month or two it comes home, I update (ZFS send/recv and expiration of oldest snapshots) the backup over the weekend, and the it goes back to work. I also back up photos into iCloud, and important documents on an encrypted rclone destination on google drive. So not all that expensive. ?

Edit: And yes, the backup can/should certainly be on RAID for durability, too. That’s actually a good way to think of it, too: RAIDZn for durability (unlikely to break), backups for recoverability (disaster / whoopsie / virus / ransomware).

Khaine · Mar 25, 2020

Thanks all

wolffnx · Mar 25, 2020

Eric A. Borisch said:
My current backup is an 8TB encrypted (geli) zpool on a USB drive that lives at work. Every month or two it comes home

same way here,all my firewall configs,dns jail's,dhcp config,etc
but only 50/60MB

I'dont manage the amount of data like the users who post here but I remark the backup method to someting external outside the servers/datacenter