ZFS Storage scheme with 4+1 (12+TB) drives

Hello there,

I am building a new home media server using HP Microserver N40L w/ 8Gb RAM + 4 x 3Tb WD green HDDs + 1 x 256Gb SSD. Basically I intend to use it for personal data (only 1Tb at the most) that need redundancy plus everything else without the need of redundancy (like 4K movies, games, etc.).

I'm pretty new to FreeBSD and trying to wrap my head around ZFS: what would be the optimal storage scheme with these 4 + 1 drives for a home server? The SSD is for the OS (with way too much overhead of course) but what about the rest? I really don't care about possible data loss *above* that 1Tb data regarding my movies etc. By the way, I will use NextCloud + Plex too.

Thanks for any helpful insight in advance.
 
Thank you for the quick reply but with what level of RAIDZ exactly? And what about mirror vdev(s) perhaps?
 
With level RAID-Z1.
If you choose to use mirror you will lose too much storage space. In RAID-Z1 with configuration of 4 disks each of 3TB the usable space is 9TB. In mirrored Vdev Zpool will be two mirrors x 3 TB or total of 6TB usable storage space.
 
OOPS EDIT: When I first wrote this, I thought you were going to use four 12TB drives. Now I see that you have 12TB total = four 3TB drives. Quite a few updates from that.

On the important data, where you said you "need redundancy": How valuable is your data? How big would the damage be if one file, or the whole file system was gone? How much work would it be to restore from an off-site backup (which you certainly have if your data is actually valuable)? How disruptive would be a multi-day outage while you rebuild your system and restore from backup? Most likely the answer is: You really don't want to take that risk.

Here's why I'm asking: With today's very large disk drives (which you are using), and the rate of uncorrectable errors not having improved with time (it is still spec'ed around 10^-15, plus or minus one, and the real-world rate is considerably worse), single-fault tolerant RAID is not good enough any longer. If one drive fails, you need to read all other drives to rebuild, and the probability of finding an error when reading all three drives is very near 1, meaning you will likely have an error during RAID reconstruction.

You can do the math yourself: 3 TB x 3 drives (after 1 has failed) x 8bits/byte x 10^-15 = 0.072. That's the number of expected read errors when you have to do a full read. Since that number is relatively small, you can roughly say: the probability that your 1-fault tolerant RAID will be damaged during a rebuild is roughly 7% (and roughly 93% of the time you will survive a drive failure). For really important data, where loss or a multi-day outage would be naughty, you don't want to take a 7% risk. That defeats the purpose of RAID.

The good news is that your valuable data is pretty small: only 1TB, out of the 12TB of raw capacity. I would definitely store that part of the data with at least 2-fault tolerance. You could use ZFS RAID-Z2 for that. But all parity or RS based encoding schemes suffer from having bad performance for small writes (ZFS partly cures that by using append-only logs, but only partly). And your capacity is so large, here is a proposal: For that 1TB, store it 4-way mirrored. That uses 4TB of space out of your 12TB available, about 1/3. Actually, in reality your 1TB of space usage might be too optimistic. If you think you need 1TB, then you should probably reserve 2TB of space, use 4-way mirroring, and there goes your first 8TB of disk space.

That leaves you with 4-8TB of disk space for the "unimportant" stuff. Even that I would not store without redundancy. Why? Not because of the risk of loss of data, but because of the work and hassle of having to recreate it, or restore it from backup. I would use RAID-Z1 (single fault tolerant RAID) for that, which gives you a formatted capacity of 3-6TB. If you get lucky (about 93% of the time), that is enough redundancy that if you get a disk failure, you can just put a new disk in and reconstruct everything without data loss. If you get unlucky (the other 7%), then it sucks being you, but that's a good tradeoff.

How to implement this? Sadly, I don't know a way to tell ZFS "take this device, and logically partition it into two volumes, take the left one and put it into a 4-way-mirrored pool, and take the right one and put it into a RAID-Z1 pool". It would be cool if ZFS could do that itself and change the size of the volumes dynamically, but I don't know how to do it (or whether it is even possible). So here is how I do it: Take each of the raw drives. Use gpart to partition them, creating a 1-2TB volume (give them logical names like "valuable1" through "valuable4"), and a 2 (or 1?) TB volume "scratch1" and so on. Then use the zpool command to make two storage pools, the first one you add all the valuable volumes to (and don't configure it for RAID-Z, it will automatically be mirrored), and the second one you configure for RAID-Z1. Then set up your file systems. I would definitely use symbolic names in gpart, it makes management so much easier than wrestling with /dev/ada2p3.

Lastly, the SSD. Using it as a boot drive is a great idea, it makes booting really fast. But be aware: You have NO redundancy here! On the other hand, there is no valuable data on the boot drive. But if your SSD fails, your computer will be down for at least many hours (perhaps days), while you drive to the store, get a new SSD, and reinstall the OS. And reinstalling the OS from scratch and getting all the little tuning and configuration right will take a long time (BTDT, tedious). Now, does this mean you should buy a second SSD right away? I think that's a waste of money for typical home users. Here would be my proposal: Have a cron job that once a day makes a full backup of the boot SSD onto your scratch ZFS file system. Then if the SSD dies, you need to (a) buy a replacement, (b) temporarily boot from a USB stick or DVD, (c) copy the backup to the new SSD, (d) do some minor tweaking to make the new SSD bootable, and (e) go drink, because the system is back up and running, with just an hour or two of work. There are many other variations that are possible. For example, you could reserve a tiny amount of space on the four big drives to have a backup bootable system there, and update that regularly. Then if your SSD dies, you just take it out, and temporarily boot from hard-disk, until you get the replacement SSD in the mail. That might be easiest if you use root-on-ZFS, and use snapshots and copy to update the backup. There are many options.

As you said, most likely your root/boot file system will not fill the SSD. You can partition it with gpart and use the leftovers for a ZFS cache. There are various ways to use it (logs, L2ARC, and all that). Personally, I wouldn't even waste my time on that. Why? For a home user, the performance of ZFS on spinning disk drives is typically adequate. Sure, it will run faster with an SSD cache, but will the extra speed actually buy you real-world happiness? Enough to balance the complexity and extra work of setting it up?

There is also a technical argument against it: SSDs don't like being written to; modern flash chips have remarkably bad write endurance. By using your only (!) boot disk as a ZFS cache, you are shortening its life, and increasing the probability that it will die. And as described above: if your boot disk dies, it will be a big hassle, and ruin your whole day. Now the write endurance is not a big effect, but for home users, this makes the convenience <-> performance tradeoff even more unbalanced.

Last two pieces of advice: Run smartd, and check regularly for any signs of problems with your disks. And scrub your ZFS file systems regularly. I do mine every 3 days (which is probably excessive), but 2-4 weeks as a scrubbing interval seems to be industry standard practice.
 
Get rid of the WD Green drives. Don't try to use them with ZFS. Just ... don't. It's not worth the headaches they will cause. Green drives don't play nicely with RAID. Seriously, just turf those drives, they aren't worth the metal they're made from.
 
Get rid of the WD Green drives. Don't try to use them with ZFS. Just ... don't. It's not worth the headaches they will cause. Green drives don't play nicely with RAID. Seriously, just turf those drives, they aren't worth the metal they're made from.

I don't know about that. My 512GB WD Greens worked in Raidz1 setup for entire lifecycle of my previous desktop computer (8 years). In fact they outlasted other components.

But perhaps due to higher data density...
 
Get rid of the WD Green drives. Don't try to use them with ZFS. Just ... don't. It's not worth the headaches they will cause. Green drives don't play nicely with RAID. Seriously, just turf those drives, they aren't worth the metal they're made from.
You exaggerate.

Now, notice that I didn't say "you're wrong". Because you are definitely somewhat correct. The Green drives are optimized for capacity, and designed for use in single-disk systems. One of the biggest effects of that is that they will just about kill themselves in an attempt to read data in the case of read errors, in spite of the fact that RAID allows the computer to reconstruct the data from other drives. The overall systems ends up working very badly, because the RAID layer first has to wait 10 or 20 seconds for the defective drive to retry over and over, and perhaps even create OS problems (like IO timeouts).

Even worse: I think the green drives implement power-saving tricks, which turn parts of the drives off when "idle", and they expect the workload to actually go idle (like most single-user computers do). This tends to play very badly with RAID systems that (a) scrub, meaning things never go completely idle, and (b) RAID causing an individual disk to be idle for a while, even though the workload is going on other disks. The Greens are famous for their variable RPM and long seek and walkup times. The RPM changing becomes a liability when RAID ends up trying to use the disks all in sync, but one is much slower than the others.

But where you exaggerate: If the guy already owns them, and they are paid for, they will work. It might be a bit slow, but with a good fault-tolerant RAID, it will be reliable enough. Throwing them in the trash is an overreaction. If he were building a system from scratch, he should use the enterprise-grade near line drives (I think WD sells them as "black", although the real enterprise drives manufactured by WD are sold under the Hitachi or HGST brand). Matter-of-fact, for a smallish RAID system, I personally would only buy HGST disks (funny coincidence, that's what in my server!).
 
Wow, thanks everybody! :D I try to reflect on as much as I can in no particular order.
  • as for the drives themselves - yepp, I kinda "inherited" these drives so that's what I have for the time being BUT I am more than willing to invest in more suitable ones (especially after what were written by ralphbsz + phoenix :)). I understand HGST drives are superb but what do you think about WD Reds? I know they are slower than HGST (RPM and all) but aren't they cooler, quiter and less power hungry? I'd sacrifice some performance for the sake of these latter aspects in this Microserver - after all I'm aiming for a home environment.

  • ralphbsz - if your math is "resilvered" for 4x4Tb or 4x6Tb raw storage would you still suggest basically the same proportions?

  • it seems to be a good idea to create 2 seperate storage pools, one mirrored for the sensitive data and one RAIDZ1 for everything else. My sensitive data currently weigh as much as 500Gb hence I had counted twice as much. I also have a HP RDX removable disk backup storage w/ 1Tb cartridges so there you go, your seperate backup solution. :cool:
 
Having 2 disks mirrored and 2 disks in RAIDZ1 doesn't make much sense to me. You could just do 2 separate mirrors instead...

Personally I would probably be eyeing up RAIDZ2 across all 4 disks - if I was happy with ~5TB of space, especially with cheaper disks as it maintains redundancy during a failure+resilver. Not sure how much 4K video that would hold though.

I use reds all the time and am pretty happy with them. They are our default go-to when we need a standard SATA disk.

Just to add: There's the option of putting the system on the pool and using the SSD as cache or log, as already mentioned. I'm not sure either of those would really be that effective for you though. If any of the software you plan to install uses a database, that would heavily benefit from SSD (not that you'll really be stressing it) so I'd consider either having the system + applications on the SSD using UFS, or a ZFS pool on the SSD for the system, and a UFS partition for database storage (then back them up to somewhere on the data pool - note of course a catastrophic failure could take all the disks out so that's not a "true" or foolproof backup).
 
If your WD Green are from the oldest model search google for wdidle and wdtler. It's also good idea to check they smart status (ID:1 and ID:193 / C1)
 
I've just skimmed through the link you gave, Vladi. Uhmazing! So I set the parking interval time on my Greens from 8 sec to 300 sec all right. :p

Anyway, so it is 2 separate mirrs or 1 RAIDZ2 w/ all the disks. Hm... Maybe the latter one would mean the least hassle.... or is it 2x2-wide RAIDZ2 configuration, perhaps 1x4-wide RAIDZ3... just thinking out loud.... o_O
 
if your math is "resilvered" for 4x4Tb or 4x6Tb raw storage would you still suggest basically the same proportions?
See below regarding the size of your "valuable" data. I would still recommend putting "valuable" data on the most fault-tolerant storage scheme you can afford (multiple way mirrored is a good example). The question then becomes: Do you have enough raw disk space to afford making all data this fault tolerant? If yes, then life is easy. But most likely you don't (even with 4 or 6TB drives). In that case, you have to split your disks into "valuable" and "scratch" areas, and for the "scratch" use a storage scheme that reduces risk to a small but tolerable level, mostly to minimize work of recreating the data (like RAID-Z1).

Now, what should your "most fault tolerant scheme" be? As I said above: If you have 4 drives, I would use 4-way mirrored. Sure, the storage efficiency is horrible (25%), but it can handle any failure that's statistically likely. And with just 4 drives, you can not do better. You can instead use RAID-Z2 (2-fault tolerant, storage efficiency is 50%), but given that the 1TB for your valuable data is not a huge part of your total space, I don't see the point of compromising storage efficiency for fault tolerance on that part.

In a nutshell, the problem of storing data on a small (consumer grade) system is that you don't have enough disks to spread the risk over. You have 4 disks (I have 2 at home), which means your best fault tolerance is 3-fault tolerant, and that gives you an efficiency of 25%, and there is nothing we can do about that. A professional-grade system uses hundreds or thousands of disks, and then groups them into larger groups. For example, I have worked on systems that use 11 disks at a time to be 3-fault tolerant (with a storage efficiency of 73%), and on systems that spread the data over more than 50 disks at a time and are around 90% efficient yet can handle a half dozen faults. But home or small business users don't have the 11, 50 or thousand disks that are required for such schemes.

My sensitive data currently weigh as much as 500Gb hence I had counted twice as much.
That seems sensible: If you are currently using 500GB, then budget for 1TB. It's hard to estimate future growth rates (I personally use 20% per year for my home data, but everyone's use pattern varies), so a factor of 2 is probably good for a few years without having to redo your storage. But: a lot of how you set up your system depends on (a) your estimate of space usage, and (b) your tolerance of risk of data loss, separately for valuable and scratch data.

And from an efficiency point of view, this is not crazy: You have a total of 12 (=4x3) or 16 or 24TB of raw space. If you carve out the first TB for valuable data which is stored using 4-way mirroring, and then use the rest using RAID-Z1 for scratch space, you end up with a usable capacity of 7TB (=1+6, with 3TB drives), or 10 or 16TB (with 4 or 6TB drives). That is an overall efficiency that is better than 50%, yet you have a good compromise in your level of protection.

Clearly, an alternative way of doing this would be to take your 4 drives and put them into a RAID-Z2, and use that for both valuable and scratch data. Efficiency is 50%, fault tolerance is 2 faults, and system administration is easier. In particular, if your estimate of "1TB valuable data" becomes wrong.
 
I see. So either I go e.g. for 2 x 500Gb partitions in mirror sitting next to everything else in stripe, all wrapped up in 1 storage pool or have the whole bunch of storage sit in RAIDZ2. You know what? With only 4 disks I'd really go for dodging physical faults, don't really care about performance plus I have external backup along w/ (commercial) cloud access. So having everything in one array is getting more and more appealing by the minute: RAIDZ2 or mirrors. Final thoughts?

PS. Speaking of performance though, should I care about the ZFS vs 4K drives issue these days? Or in other words should I deal with "tricks" like gnop (regarding my WD Greens as 4K drives) before "zpool create" or not?
 
In that case, with 4 drives I would do RAID-Z2. And you absolutely need to deal with 4K drives. But I think that is completely automated; when I set up my ZFS on a new FreeBSD 11.0 system there was nothing I had to do special. Check the handbook.
 
While I like the Calomel site for its ZFS information the benchmarking of 24 drives seems very low. 24 SSD's should give way more performance. He is using an SAS expander instead of raw controllers and is showing a single LSI 9207-8i. That is not leet.
The rest of the information is top notch.
 
Back
Top