20 TB RAID-Z3 hardware guidance

listentoreason · May 23, 2013

Nutshell:

I am putting together an ~20 TB RAID-Z3 FreeBSD box to replace an existing 4 TB RAID-Z2 FreeNAS appliance. I intend the system to function primarily as a ZFS NAS platform with NFS/Samba, but will also be installing Postgres and Apache on it. I'm new to FreeBSD, and my intent on using ECC memory plus the need for supporting a large number of SATA devices has moved me into hardware realms that I am not very comfortable in, and I'd appreciate any advice or sanity checks.

Details:

I have been very happy with FreeNAS, my 5 x 1.5 TB zpool has run more-or-less flawlessly for about three years. However, it's close (400 GB) to capacity and one drive is coughing up checksum errors at a distressing rate. I'd also like to upgrade the NAS to function as a modest-weight server (Apache, Postgres, and eventually would like to get my HDHomeRun working on it as well).

My general mindset / guiding principles, in rough order of priority

Data integrity: I'm very comfortable that ZFS will keep the sanctity of my data once it's in the pool. However, I fret over in-RAM errors; hence, I've decided to go with ECC memory on this system
Low cost: I'm going for the "I" in RAID. I've come to recognize that an ECC system is a relatively big step up in price, but am accepting that. The bulk of the cost is going to be in the drives, and I don't want to go wild there.
Out-of-the-box functionality: From looking over the supported hardware, I don't think I'm straying into esoteric components, but would really appreciate warnings if I am heading towards devices that require heroic efforts to install.
Reliability: I'm willing to accept moderately frequent failure rates with the drives, given the redundancy and ease of purchase. For the other components, I'd rather not be swapping them out every year. I tend to value warranties not so much because I expect to use them, but because I see them as a measure of confidence the vendor has for their own products.
Low power: As much as possible I'm looking for low power consumption. It is not neccessary that the machine be a high-performance compute farm.
Let it sit: I'm intentionally building far beyond my current needs. I don't mind building systems, but I'd rather use them and just "let them be" for as long as possible. 20 TB is chosen to put off a rebuild for as long as possible.

I generally shop at NewEgg, and have found (probably unsurprisingly) that they're not really geared toward hard-core server components. If it's kosher for the forum, I would be interested if people have recommendations for other vendors that are focused on more appropriate hardware. I am providing NewEgg links below because they're what I have in my notes, and they provide full details on each part. At the moment, I am considering:

CPU: Xeon E5-2603 Quad 1.8 Ghz (LGA2011 $220 80W) or Xeon E3-1230 V2 Ivy Bridge Quad 3.3 GHz (LGA1155 $240 69W), AFAICT the cheapest ECC-supporting chips on their site.
Motherboard: handwringing over Intel S1200BTLR ($230 LGA1155) or Intel S2600CP4 ($550 dual LGA2011). The latter seems overkill, but might make it easier to hook up the required number of drives. I have focused on Intel boards because they seem to have the best warranty.
Disks: I calculate I need 11 x 2 TB drives to get "20 TB", and RAID-Z3 + a hot swap takes me to 15 total drives. Historically I've used WD Green drives and have been reasonably happy, but am open to other suggestions. I'm intrigued by the Red line, and would be willing to pay the extra for the extra two years warranty (with over a dozen drives I expect to be sending a few back), and the apparent lower power consumption. I'm also a little worried that they may not be appropriate for ZFS; I have read anecdotes on the web of apparently satisfied ZFS users, but a call to WD support asking about ZFS suitability was met with the telephone equivalent of a blank stare.
Host Bus Adapters: I think this is the way to go? If I understand SAS right, I can attach 4 SATA drives to each port with special cabling (e.g. this?). That means I'd need 4 SAS ports. I have the vague-but-possibly-misguided sense that I'd be better off with 2 cards x 2 ports, rather than 1 card with 4 ports. LSI appears to have the best support for open operating systems, so I was leaning toward two 3081E-R (3Gb/s $215) or 9211-8i (6Gb/s $260).
Expander: Or is this the way to go? The RES2SV240 ($280) lists 6 SAS ports with 6 Gb/s. Comparing an HBA to the expander, I don't have a sense how I/O is moved (or bottlenecked) through one or the other, or if there are other considerations I'm oblivious to.
RAM: No real research yet until I've settled on the CPU and board. I will aim for 16-32 GB because even with ECC RAM is cheap (and I'm spoiled at work with 500 GB systems). I would like some thoughts on Registered/Buffered; what are the advantages of having either/both? I recognize I need to match the RAM to the motherboard, but don't know if I should be seeking out, say, boards that use registered memory.
Case: Cheap. I found a 4U / 15x3.5" rack mount for $100.
Power supply: I have a spare SeaSonic X650. I've used it on two other systems, including the FreeNAS box, and was impressed with the low power consumption. However, I've read suggestions that it can provide unstable power, and would be willing to consider alternatives.
NIC: Planning to use the on-board interface(s).
SSD: I'll use an SSD for boot and system; I think I have a spare M4. I'll probably have a non-zpooled HDD for scratch/swap.

If you've read this far thank you. Any guidance is most appreciated. I both work and play in Linux, but am looking forward to picking up experience in FreeBSD!

Terri_Kennedy · May 23, 2013

listentoreason said:
If you've read this far thank you. Any guidance is most appreciated. I both work and play in Linux, but am looking forward to picking up experience in FreeBSD!

You might want to look at http://www.tmk.com/raidzilla2 (draft article, not yet formally posted). It describes a similar build (32 TB raw, but more compute power and a noisier chassis). You may find the "lessons learned" sections particularly useful. Also, as this was my second big storage server design/build, it benefited from my prior experience.

listentoreason · May 24, 2013

Terry_Kennedy said:
You might want to look at http://www.tmk.com/raidzilla2 (draft article, not yet formally posted). It describes a similar build (32 TB raw, but more compute power and a noisier chassis). You may find the "lessons learned" sections particularly useful. Also, as this was my second big storage server design/build, it benefited from my prior experience.

That's definitely helpful! A few questions:

You mention a battery backup; am I correct in interpreting that it is specifically for the RAID controller? If so, what's the logic behind that? Is the controller a potential source of unusually bad failure in a power outage?
What was your thinking in going with a RAID card in JBOD mode rather than using a HBA? From another of your articles (RAIDzilla II and Backblaze Pod compared) you indicate you're focused on speed. I had (proabably naÃ¯vely) presumed that the HBAs would be focused on throughput, with the RAID card diverting some of its resources to the various RAID management features. It appears you're using high-end drives; are they still the limiting factor for I/O, or is the card the bottleneck?
Per your comments on ZFS autoreplace not being implemented - should I just keep a replacement drive on the shelf, rather than having it "hot" in the system? With RAID-Z3 I should be heavily "hot protected" as is.

I appreciated the comments about backups. Many years ago I looked at tapes but was distressed by the cost. I utilize a sneaker-net with my parents; I keep external rsync'ed USB HDD backups on a shelf at my house (1-2 TB drives), and additional copies at Dad's. When we visit each other a fresh copy goes to his home, and the stale one returns here to be refreshed. Updates are discrete (not cron'ed) which provides a crude one-off snapshot). I'm sure there's some bit-rot on them, but I'd expect to recover the majority of files. I have some time-constrained backups (eg "only files between Jan-2009 and May-2012") that can stay put to reduce the number of drives driven back-and-forth; I just need to remember to fire them up occasionally to keep the spindles limber. He's far enough away to hopefully limit catastrophic loss at both sites to Yucatan-scale asteroid impacts. Once the new system is built, one option is to disassemble to old FreeNAS zpool and then rebuild with a few spare 1.5 TB drives to bump up the capacity a bit, and send that down to his house and set up rsync over SSH. I'm reluctant to punch a hole in his router, though.

Thanks again for sharing your build!

Terri_Kennedy · May 24, 2013

listentoreason said:
You mention a battery backup; am I correct in interpreting that it is specifically for the RAID controller? If so, what's the logic behind that? Is the controller a potential source of unusually bad failure in a power outage?

The more memory and autonomous processing there is between the operating system and the physical media, the greater the chance of data loss during a power or hardware failure. The 3Ware controller I'm using has onboard memory. If there's no battery backup, it defaults to not telling the operating system "all done" until it actually writes the data to the drive(s). With battery backup, it reports completion to the operating system and then goes and writes the data to the drive(s). You can override this behavior if desired.

What was your thinking in going with a RAID card in JBOD mode rather than using a HBA? From another of your articles (RAIDzilla II and Backblaze Pod compared) you indicate you're focused on speed. I had (proabably naÃ¯vely) presumed that the HBAs would be focused on throughput, with the RAID card diverting some of its resources to the various RAID management features.

I like the 3Ware cards because of their excellent support under FreeBSD, including web-based management and complete sysutils/smartmontools support (see my other topic about missing passthru devices on mpt(4) controllers making drives invisible to smartctl). These cards can be configured to export a number of drives as a single volume to the operating system, export each drive as an individual volume, or act as a dumb HBA. In dumb mode you don't get any of the advantages of controller caching.

I think one of the bottlenecks in the Pod is the use of SATA expanders - a single hard drive (non-SSD) can't saturate a SATA channel, but 4 or 8 disks behind a port expander can.

It appears you're using high-end drives; are they still the limiting factor for I/O, or is the card the bottleneck?

I believe the limit is caused by the number of ZFS vdevs in the pool. A larger number of vdevs, each with a smaller number of drives should provide better I/O throughput. Going from 3 x 5 to 4 x 4 wasn't a big win, as ZFS apparently likes 2^n+1 drives for raidz1, 2^n+2 for raidz2, etc. per vdev.

One issue is getting useful benchmarks - benchmarks/iozone wants files at least 2x the size of main memory, and a full run (all record sizes) can take days. However, the results are nice.

Per your comments on ZFS autoreplace not being implemented - should I just keep a replacement drive on the shelf, rather than having it "hot" in the system? With RAID-Z3 I should be heavily "hot protected" as is.

It depends on how rapidly you can get physical access to the system to replace the drive. It is more likely that other drives will fail during a rebuild (RAID) / resilver (ZFS) - it isn't Murphy's Law, but due to them being stressed more with the additional I/O. With the "hot" spare in the chassis, I can perform a ZFS replace without needing to physically access the server. However, you will want to have spare drive(s) on the shelf, as you don't want to run your pool in degraded mode any longer than you have to.

I appreciated the comments about backups.

Yes - it's something that I don't think gets enough consideration, even here.

Thanks again for sharing your build!

Glad to help!

throAU · May 28, 2013

Have you considered going to RAIDZ2 or RAIDZ1 (or mirror?) - maybe with smaller drives in more VDEVs?

Whilst yes, RAIDZ3 will give you better protection within a VDEV, for the (relatively) small number of drives you're talking about, it is a serious amount of overhead and you're throwing away a lot of performance.

I.e., instead of

11x2 TB + 3 parity + spare = 15

maybe

(5x2 TB + 1x2 TB) + (5x2 TB + 1x2 TB) + 2x2 TB spare = 14 (2 RAIDZ1 VDEVs, and 2 hot spares)

You'll have 2 VDEVs for improved performance, shorter rebuild time and 2 hot spares for improved (vs RAIDZ1 - RAIDZ3 will obviously be better) resiliency in case of drive failure. All with 1 less drives.

Because you've got 2 VDEVs, you can actually have 1 failure in each VDEV at the same time (prior to resilvering) and still be OK. If you have time to rebuild to hot spares, you can then sustain another 2 drive failures (1 in each VDEV) before a third (i.e., 5th, assuming rebuild to spare occurred in time) failure takes you out.

Sure, strictly going by failure probability RAIDZ3 will give you better fault tolerance, but you will pay for it... you are going to have backups anyway, right? Each to their own of course, but in an array that size, RAIDZ3 is overkill IMHO. But that's for you to decide.

Even enterprise array vendors (e.g., Netapp) are only using RAID6 (DP) - and even using RAID-DP are happy to recommend larger numbers of SATA drives within a RAID group than you have here.

listentoreason · May 31, 2013

throAU said:
Have you considered going to RAIDZ2 or RAIDZ1 (or mirror?) - maybe with smaller drives in more VDEVs?

Whilst yes, RAIDZ3 will give you better protection within a VDEV, for the (relatively) small number of drives you're talking about, it is a serious amount of overhead and you're throwing away a lot of performance.

Hm, I hadn't thought of that; I didn't realize there were significant performance issues. My thinking was that the parity drives were fixed in number regardless of the size of the pool, so the cost per TB drops as you increase the total number of drives in the pool. I have a general philosophical preference for large storage volumes, since I don't like having to reshuffle directory structures when the volume fills up.

You spurred me to do some searching for benchmarks. There's some interesting data at calomel.org describing comparative benchmarking on FreeBSD 9.1. I plotted out their data, and the major trend I see is "more drives = faster", the exception being for writes, which appear to plateau in their data (I presume that there may be CPU bottlenecking at large array sizes?). In particular, at high array sizes the performance difference between Z1 and Z3 appear relatively small in their data, on the order of 10% for reads and 15% for writes.

It sounds like you've come across more dramatic differences; could it be older zpool versions, or different hardware? Or am I looking at apples when I should be considering oranges? The calomel data was intriguing since they tested quite large arrays (they have data for up to 24 drives). I found some other benchmarks while googling, but they were for 5 drives (total) and less. If you have links to additional data I'd really like to check them out.

Offhand, I'm willing to tolerate some performance degredation for more redundancy and for just having a single big pot o' storage, but I'm also considering your comments regarding the use of two pools and the redundancy that would bring.

Thanks,
CAT

20 TB RAID-Z3 hardware guidance

listentoreason

Terri_Kennedy

listentoreason

Terri_Kennedy

throAU

listentoreason

Attachments