New fileserver: ZFS pool layout

Savagedlight · Aug 17, 2010

tl;dr: Quite a few alternative ways of configuring a 10-disk file server using ZFS for data storage. Some discussion on adventages/disadventages of the alternatives, and currently wants help in deciding which alternative is the better for the described situation.

Goal of this thread: Air my thoughts around this task, hopefully getting other peoples thoughts on the matter, with the ultimate goal of letting people whom are going through this process have one more source of well-reasoned argumentation for/against various alternatives for the provided situation.

I'm going to set up a new file server in the near future.
Drive bays: 10 (6 internal + 4 in a drive bay fitting in the 5.25" slots)
RAID controllers: On-board (ICH10, 6p s-ata) + HighPoint RocketRAID 2320 (PCI-Express 4x, 8p s-ata)
RAM: 8GB

I'm going for 10x 1.5TB drives, and am trying to find the 'best' balance between data redundancy and capacity.

Considered configuration alternatives

FreeBSD installed on data pool:

(A) RaidZ-2 of 5 drives + RaidZ-2 of 5 drives, 0 hotspares. 60% data capacity. Will be keeping a number of 'cold-spare' drives. The file server will be located in my home, so the time it takes to replace a drive with a 'cold spare' should be minimal.
(B) RaidZ-2 of 9 drives + 1 hot spare. 70% data capacity.
(C) RaidZ-2 of 6 drives + RaidZ-1 of 3 drives + 1 hot spare. 60% data capacity.

FreeBSD not installed on data pool:

(D) Install OS/applications/etc on a 2-disk mirror, adding 7 drives to RaidZ-2, with 1 hotspare for the RaidZ-2 vdev. (data pool gets 50% data capacity)
(E) Install OS on a mirror of 2 USB keys, and use RaidZ-2 of 9 drives + 1 hotspare (70% data capacity)
(F) Install OS on a mirror of 2 USB keys, and use 2x RaidZ-2 of 5 drives in a single pool. 0 hotspares. (60% data capacity)

Arguments for/against

Alternative A-C would cause OS write/read operations to affect the performance of the data storage pool. However, since this is a home file server, the redundancy + amount of storage is more important than minor performance loss.

Alternative D-F might allow for simpler reinstallation of OS since the data pool can be marked/considered 'no-touch' untill the host OS is up and running, thus reducing the risk of human error upon such an operation.

Alternative E&F will let me use all 10 drive bays for the data pool, and keep the OS off of the data pool.

Alternative A seems to be the most redundant alternative, but it has the worse capacity and has no hotspare.
Alternative B & E allows for any two drives to fail at the same time. If reslivering finishes in time, allows for a total of three drives failing in succession w/o recieving attention from system admin.
Alternative C doesn't feel much better than B; Possibly because of the risk of things stored on the RAIDZ-1 being only two disk crashes away from the void.
Alternative D will however reduce data pool total size by 20 'percent points'. The available storage loss will be somewhat less than this, since the base OS is not installed on the 'data' pool.
Alternative F allows for any two drives to fail at any time. With a bit of luck, four drives may fail at the same time (two in each vdev) without data loss. No automatic reslivering due to lack of hotspares.

Questions:
I'm landing on A for redundancy or B for capacity; But I feel very unsure about which to pick, and even on my 'measurement' of redundancy. I basically need more pros/cons before I can make a decision; Any input is welcome.

~~If a single disk fail in a RaidZ-2, will the pool become unavailable until reslivering is done?~~
Any thoughts or suggestions?

vermaden · Aug 17, 2010

I would go that way:

1. Install FreeBSD (both / and /usr) on RAID1 (use gmirror)

Do it on flash/pendrives attached via USB, these can be SDHC cards thru adapters, connected internally in the case, 2-4GB devices will do here, DO NOT CREATE SWAP HERE.

You can later make the flash/pendrives READ ONLY.

2. Physical 'layout' of the disks

-- 5 drives attached to ICH10
-- 5 drives attached to HighPoint

3. IMHO best option choice for the ZFS pool

RAIDZ (5 drives) on ICH10 <-- STRIPE --> RAIDZ (5 drives) on HighPoint

This way You have 80% capacity of the installed disks and albo have 100% increase of speed because of STRIPE between the controllers.

IMHO better option here would be adding another 1.5TB drive as hot spare for these two RAIDZ/RAID5, then going for STRIPE of RAIDZ-2/RAID6.

You have 6 + 8 ports (14 drives max), so if You plan to expand that system, them maybe 2 * RAIDZ (4 drives) would be better and then extend with additional 2 disks for 3 way STRIPE of 4 disks RAIDZ and You will still have 2 'slots' for hot spares.

... about Your questions, RAIDZ/RAIDZ-1 (RAID5 equivalent) can work with ONE BROKEN drive, RAIDZ-2 (equivalent of RAID6) can work with TWO BROKEN drives.

jalla · Aug 17, 2010

Two dual-parity raidgroups is surely overkill, even if you're the type to wear both belt and suspenders

Alt B should satisfy your requirement for redundancy with reasonable space-efficiency. You could even drop the hot-spare, as the array will keep working with a loss of 2 disks.

olav · Aug 17, 2010

I really like Vermaden's suggestion. FreeBSD on mirrored flash/pendrives in read-only mode is the way to go! I'm planning todo that on my new server.

danbi · Aug 17, 2010

The operating system on mirrored USB flash drives. Might be or might not be read only. I hardly see any reason for swap in a fileserver setup. Swapping server processes will not serve data

You may however use the data pool for things like /usr/src, /usr/ports, /usr/obj etc. These do not normally involve traffic.

The ZFS layout... depends

If you will be writing a lot, it is best to use RAID10, that is, 5 vdevs of mirrored disks. Even better, you can pair the drives to be on different controllers. If a controller dies (happened to me the integrated ICH to do weird things),you have your array intact.
If you want more capacity, just replace pars of drives with larger drives.

Pros: highest possible performance, highest reliability, can grow easily/cheaper, fastest resilver
Cons: only 50% of capacity

It is very bad idea to have an raidz setup with lots of drives (single vdev). First, resilvering will take eons, may never complete and in some cases might be lethal (if it so happens, that a second drive dies during resilver). Write performance is not better than that of single drive. If you will be using raidz, make sure you have hotspares. You may not have much time, between disk failing and resilvering.

It is good idea to use disks from different manufacturers and/or batches. Sometimes, all disks from a single batch die at once.

Having said all this, I am a big fan of RAID10. Disk capacity is cheap these days and is getting cheaper.

By the way, the best thing about the RAID10 setup is that you may start with just two drives for your pool. Then, just add pairs of drives as you see need, or replace the existing drives with larger models.

knarf · Aug 17, 2010

danbi said:
RAID10 [...]
highest reliability

I disagree. You lose all your data if the wrong two disks fail (and the fail of the 2nd disk of a mirror during resilver is quite likely). raidz2 with hotspare is much more reliable.

phoenix · Aug 17, 2010

ZFS mirror vdevs can hold any number of drives, they are not limited to 2.

Thus, if you want to absolutely paranoid about your data, yet have the best performance, you can use mirror vdevs with 3, 4, or more drives each.

Savagedlight · Aug 17, 2010

updated OP with the suggested setups.
I did simplify Vermadens suggestion because adding a stripe on top of two raidz vdevs sounds overly complicated to deal with.

I'm currently leaning towards E&F, depending how likely it is to end up 'forever reslivering' a 9drive raidz2 vdev.

vermaden · Aug 17, 2010

Savagedlight said:
I did simplify Vermadens suggestion because adding a stripe on top of two raidz vdevs sounds overly complicated to deal with.

I will try to simplify

Code:

      +--- STRIPE/RAID0 ---+
      |                    |
   (ICH10)            (HighPoint)
 RAIDZ/RAID5          RAIDZ/RAID5
      |                    |
  +-+-+-+-+            +-+-+-+-+
  | | | | |            | | | | |
  D D D D D            D D D D D  
  R R R R R            R R R R R
  I I I I I            I I I I I
  V V V V V            V V V V V
  E E E E E            E E E E E
  0 1 2 3 4            5 6 7 8 9

jalla · Aug 17, 2010

Keep in mind that 2x5-disk RAIDZ has much greater risk of failure than 1x10-disk RAIDZ-2. ie, RAIDZ-2 will *always* survive two failed disks, RAIDZ *can* survive with two failing disks if they happen to be in different raidgroups.

phoenix · Aug 17, 2010

Savagedlight said:
updated OP with the suggested setups.
I did simplify Vermadens suggestion because adding a stripe on top of two raidz vdevs sounds overly complicated to deal with.

Adding two (or more) vdevs to a single pool automatically creates a stripe across all of the vdevs. IOW, [b]zpool create mypool raidz da0 da1 da2 raidz da3 da4 da5[/b] creates a stripe (RAID0) across the two raidz (RAID5) vdevs, thus giving you (in effect) a RAID50.

Savagedlight · Aug 17, 2010

That's good to know, thank you.

I initially thought ZFS would place a single file on a single vdev, unless copies were set to more than one - in which case the copies could end up on different vdevs... Is that me assuming wrong, or is it true?

olav · Aug 17, 2010

I think it works like this:
If have a 5 disk(1TB) raidz1 pool you will have 4TB of total storage/copies.
So if you have copies set to 2 you will have 2TB of storage.

phoenix · Aug 17, 2010

Savagedlight said:
That's good to know, thank you.
I initially thought ZFS would place a single file on a single vdev, unless copies were set to more than one - in which case the copies could end up on different vdevs... Is that me assuming wrong, or is it true?

Totally, utterly, and completely false.

danbi · Aug 18, 2010

Comparing ZFS with 'traditional' RAID setups is not always valid. When we talk about RAID5 vs. RAIDZ .. they only look similar and raid5 with regards to ZFS can be used only because of the familiar language.
The thing is, that ZFS always stores checksums with data, while traditional RAID does not.
For example, in traditional RAID mirror (RAID1) if you have a disk fail, the RAID will happily rebuild the array, as long as the other disk does not give read errors. Which does not meant it does not read garbage. With ZFS, if you only have two disk mirror, you could know which data is bad. If you have three or more disk mirror, chances are the copy on the third etc. disk will be valid and you will have reliable storage. I remember comments that more than three way mirror does not give much greater reliability.

The performance of mirror (RAID1) compared to anything else is undisputed.

As for security, I have a colleague, who always specifies hardware for new (small) servers with 6 disks and 'hardware raid' in configuration: RAID1 + 4 spares

If they used one of the better operating systems, they could just use 3 way mirrors in stripped config, no spares and get more out of the disks that spin anyway.

By the way, on your F setup you will have 60% capacity.

Management wise, the single raidz(2) setup is worst. If you need to upgrade drives, for more capacity, you will need to replace all of them, one by one. I have no idea how long an 9 drive raidz will resilver -- I myself, gave up on such configs long ago. This will give you 1x the write performance of a single drive.
In this regard the stripped mirrors setup is best, as mentioned before. This will give you 5x the write performance of single drive.
2x5 drive raid will let you upgrade drives in batches of 5. This setup gives you 2x the write performance of single drive.
You may try also 3x3drive raidz1 + one spare. Then, you can upgrade drives in batches of 3. This setup also gives you 3x the write performance

The reason I am pushing you to look at pool management is that, as always happens with storage, you may end up with some 10TB data on the pool. You will need more space. In the event you need reconfiguration, you have few choices:
- backup to external storage (tape? other server?), reconfigure pool, restore;
- build an entirely new server with new disks, copy data over;
- expand/upgrade ZFS pool by adding/replacing drives. This is one of the best selling points for ZFS. No need of downtime --- disks grow in capacity all the time.

knarf · Aug 18, 2010

danbi said:
[raidz2 with 9 disks] This will give you 1x the write performance of a single drive.

Why exactly 1x?

I also wonder why resilvering of a 9disk-raidz2 should take very long and therefore is a showstopper for such a config. I always thought the time of resilvering is mainly dependent of the amount of data stored in the pool. And remember: The pool is working fine during resilvering, so it's not like sitting in front of a foreground-fsck and waiting.

This is why I'll never again use gmirror. It has no idea about the amount of data stored on the disks so rebuilding always takes ages and at the end you still don't know if the array is okay, because there was no checksumming done at all. If you do this on you pen drive, you'll kill them soon.

Knarf

danbi · Aug 18, 2010

I haven't yet killed a pen drive with gmirror, although almost all my configs are such

For the record, I managed to kill a pen drive (two actually) with ZFS. No idea how..

Indeed, if you have 9 drive raidz pool and use say, 5% of it's capacity, resilver will be reasonably fast. But this then brings the question, why you wasted so many disks. Storage arrays usually get full. With valuable data. And when one of the disks dies, you need to have it resilvered, fast. That is, before any other disk dies.

One annoying thing about ZFS is that it gets slower as you fill up the pool over certain percentage. Maybe someone using raidz more recently can share how this reflects in resilvering performance. It sure does affect scrub.

There are good general and raidz specific ideas of what ZFS can and cannot do.

In the end, it all depends on the intended usage.

jalla · Aug 18, 2010

danbi said:
Management wise, the single raidz(2) setup is worst. If you need to upgrade drives, for more capacity, you will need to replace all of them, one by one.

Except it would be plain silly of course to do an upgrade in that way.
Create a new pool of higher capacity drives an 'zfs send' your filesystems over.

knarf · Aug 18, 2010

danbi said:
Indeed, if you have 9 drive raidz pool and use say, 5% of it's capacity, resilver will be reasonably fast. But this then brings the question, why you wasted so many disks. Storage arrays usually get full. With valuable data. And when one of the disks dies, you need to have it resilvered, fast. That is, before any other disk dies.

One annoying thing about ZFS is that it gets slower as you fill up the pool over certain percentage. Maybe someone using raidz more recently can share how this reflects in resilvering performance. It sure does affect scrub.

Something like this?

Code:

# zpool list zdata
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zdata  8.12T  3.98T  4.15T    48%  ONLINE  -
# zpool status zdata
  pool: zdata
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 6h32m with 0 errors on Tue Aug 10 19:53:15 2010
config:

        NAME            STATE     READ WRITE CKSUM
        zdata           ONLINE       0     0     0
          raidz2        ONLINE       0     0     0
            da0p1.eli   ONLINE       0     0     0
            da1p1.eli   ONLINE       0     0     0
            da2p1.eli   ONLINE       0     0     0  628K repaired
            da3p1.eli   ONLINE       0     0     0  182K repaired
            da4p1.eli   ONLINE       0     0     0
            da5p1.eli   ONLINE       0     0     0
            da6p1.eli   ONLINE       0     0     0
            ada2p1.eli  ONLINE       0     0     0
            ada3p1.eli  ONLINE       0     0     0
        spares
          da7p1.eli     AVAIL   

errors: No known data errors

Ok, using geli is the bottleneck here, even on an i7. This is 8.1-R with (still) zfs v13.

Oh, and next time I'll use gpt labels. Do you know how to change that without touching the data?

phoenix · Aug 19, 2010

knarf said:
Why exactly 1x?

Because of the way raidz works algorithmically, you are limited to the write IOps of a single drive. In the most simplistic terms: every write to a raidz vdev touches every drive in the vdev (in actuality, it depends on the size of the write). The wider (the more drives) in a raidz vdev, the slower things get.

Regardless of the RAID technology used, a single RAID array should be narrow, and a RAID system should be made up of multiple RAID arrays. RAID10, RAID50, RAID60, etc. IOW, lots of small (4-6 disk) raidz vdevs in a single pool.

I also wonder why resilvering of a 9disk-raidz2 should take very long and therefore is a showstopper for such a config.

Since every write to a raidz vdev touches every disk, rebuilding a disk requires reading from every other disk. If the raidz vdev is over 50% full, or if a lot of snapshots have been created and destroyed, then the data will be fragmented, leading to a lot of disk thrashing. Resilvering a disk in a 9-disk raidz vdev won't be too bad. But try resilvering a disk in a 24-disk raidz vdev to see just how bad it can be.

It's actually impossible to do, as we found out.

phoenix · Aug 19, 2010

danbi said:
One annoying thing about ZFS is that it gets slower as you fill up the pool over certain percentage. Maybe someone using raidz more recently can share how this reflects in resilvering performance. It sure does affect scrub.

ZFS is copy-on-write at the block level (not the file level). Which means, if you write out a 100 MB file, then change 10 MB of it, those 10 MB of new data will be written to new areas of the disk and the file will no longer be contiguous on disk (ie fragmented). Now create a bunch of snapshots, change a bunch of data, create more snapshots, change data, destroy snapshots, etc. You now have a very fragmented disk layout where blocks belonging to a single file will be spread over the entire disk. ZFS tries to write new data out contiguously in free areas of the disk using a "best-fit" model. However, when the drive nears 80% full, it gets harder and harder to find large, contiguous chunks of free disk space to write data into. As it nears 90%+, it gets even harder.

Which means, resilver/scrub has to thrash the disk heads to read data, as scrub/resilver reads data from oldest to newest.

phoenix · Aug 19, 2010

jalla said:
Except it would be plain silly of course to do an upgrade in that way.
Create a new pool of higher capacity drives an 'zfs send' your filesystems over.

Actually, it's one way recommended by pretty much everyone on the zfs-discuss mailing lists.

There are 2 ways to increase the size of a storage pool without resorting to the nuke-and-pave method you listed:

Add more vdevs to the pool, thus increasing the total size of the pool, and also increasing the available IOps, or
Replace all the drives in a single vdev, then export/import the pool. This increases the total size of the vdev, thus increasing the total size of the pool.

For those using mirror vdevs, it's even possible to do this without losing any redundancy:

zpool create storage mirror da0 da1 mirror da2 da3
zpool attach storage da4 da0 (this creates a 3-way mirror out of da0 da1 da4)
zpool detach storage da0 (this drops it back to a 2-way mirror)
zpool attach storage da5 da1 (this creates a 3-way mirror out of da1 da4 da5)
zpool detach storage da1 (you now have a 2-way mirror of larger disks)

jalla · Aug 19, 2010

phoenix said:
Actually, it's one way recommended by pretty much everyone on the zfs-discuss mailing lists.

Yeah well, I guess my line of thinking is coloured by my experiance with other storage systems. Where I come from upgrades and migration between systems are almost exclusively done by snapshot mirroring. But thats systems with more than a handfull of disks though, I'd rather not move 50 or 100 disks by failing one after another, thank you

phoenix · Aug 19, 2010

It's also a matter of convenience and cost. It's a lot less expensive to buy X new harddrives than it is to buy an entire new enclosure, especially if you don't have any more rackspace available to put a temp system.

Savagedlight · Aug 22, 2010

After seeing what can happen once the zpool reaches maximum capacity, I wonder..
Is there any way of defragging a zfs pool?