Solved Pool with 15 disks

circus78 · Sep 20, 2017

Hi,
I have a physical server with 15 SATA disk 500 GB each and one SSD 60 GB.

I need to deploy a mail server with several users (> 15000) and about 3 TB data.
I will use FreeBSD 11.1-RELEASE, dovecot, postfix, mysql backend.

I would like to withstand 2 disk failure
What is better way to deploy storage?

It is better to use SSD only for caching (read or write)?
I read that it is better to use 9 disk max.. how do you arrange 15 disk ?
Thank you very much for your suggestion

SirDice · Sep 20, 2017

circus78 said:
It is better to use SSD only for caching (read or write)?

That will depend on the usage. Adding L2ARC or ZIL isn't guaranteed to improve things. In my case, my home 11TB RAID-Z used around 1% of the L2ARC capacity and I still got a cache miss rate of about 99%. Which means it was totally useless and probably even affected performance in a negative way.

circus78 said:
I read that it is better to use 9 disk max.. how do you arrange 15 disk ?

Split it up into two RAID-Z2 vdevs around half for each, say 8 and 7, those two can be combined giving you one large pool.

phoenix · Sep 20, 2017

Do you want to be able to withstand a 2-disk failure in a single vdev (meaning you'd use raidz2)? Or that you want to be able to withstand the loss of any 2 disks in the pool (meaning you'd use raidz1 or mirror)?

There are a few options, depending on whether you need more IOps or more storage space (fastest option at the top, going down to slowest):

Create 7x mirror vdevs in a single pool, giving you 3.5 TB of very fast storage and 1 spare disk. Use the SSD for the OS.
Create 5x mirror vdevs using 3 disks each, giving you 2.5 TB of very fast and resilient storage. Use the SSD for the OS.
Create 3x raidz1 vdevs using 5 disks in each, giving you 6.0 TB of fairly fast storage. Use the SSD for the OS, and possibly as an cache.
Create 2x raidz2 vdevs using 6 disks in each, giving you 4.0 TB of fast-ish storage. Use another 2 disks to create a separate pool for the OS. Use the SSD as a cache vdev. Leaving you with 1 disk as a spare
Create 2x raidz2 vdevs using 7 disks in each, giving you 5.0 TB of storage, and 1 spare disk. Use the SSD for the OS and as a cache.
Create 2x raidz2 vdevs using 7 disks in one and 8 disks in the other, giving you 5.5 TB of unbalanced storage. Use the SSD for the OS and as a cache.
Create 1x raidz3 vdev using all 13 disks, giving you 5.0 TB of slow storage. Use two others for a separate pool for the OS. Use the SSD for cache.
Create 1x raidz3 vdev using all 15 disks, giving you 6.0 TB of very slow storage. Use the SSD for the OS and as a cache.

Options 7 and 8 allow you to lose any 3 disks without losing the pool, but are the slowest options.

Options 2, 4, 5, and 6 allow you to lose 2 disks per vdev without losing the pool.

Options 1 and 3 only allow you to lose 1 disk per vdev without losing the pool, but these are some of the fastest options.

Personally, I like 6-disk raidz2 vdevs as that gives you a nice balance between storage space, speed, and redundancy. But, I deal with 24- and 45-bay storage chassis where you can get a lot of 6-disk vdevs.

For 500 GB drives, raidz1 should be okay, and the rebuild time won't be that long and shouldn't stress the other drives too much (which would lead to another drive failing before the first is finished, causing the pool to die). For anything over 1 TB, you definitely want to use raidz2. If you plan on replacing the drives with larger ones in the future, you should use raidz2 from the beginning.

I wouldn't recommend a giant raidz3 vdev, especially for a mail server dealing with lots of small files. But I listed it to show you the options.

circus78 · Sep 20, 2017

SirDice said:
Split it up into two RAID-Z2 vdevs around half for each, say 8 and 7, those two can be combined giving you one large pool.

Hi, did you mean:
1 vdev with 8 disk (RAID-Z2) = 3.6 TB +
1 vdev with 7 disk (RAID-Z2) = 3 TB

Can I suffer 2 hard disk failures in EACH vdev?
Can I create a single mount point (let's say, /var/spool/mail) with 6.6 TB size total?

Thank you!

phoenix · Sep 20, 2017

circus78 said:
Hi, did you mean:
1 vdev with 8 disk (RAID-Z2) = 3.0 TB +
1 vdev with 7 disk (RAID-Z2) = 2.5 TB

Correct (although your math was wrong). A single pool with 2 vdevs. Data/access is striped across both vdevs to improve performance. Because it's an "unbalanced" pool, meaning one vdev is larger than the other, initial performance will be "poor" as data is written to the vdev with the most free space first. Once the free space in both is roughly equal, then all new writes will go to both.

Can I suffer 2 hard disk failures in EACH vdev?

Correct.

Can I create a single mount point (let's say, /var/spool/mail) with 5.5 TB size total?

Correct (although your math is wrong). You can create as many (or as few) datasets (aka filesystems) in that pool, and mount them wherever you want, and they will all share the same pool of storage.

A raidz2 vdev uses 2 disks for parity, so an 8-disk raidz2 vdev would only have 6 data disks, for 3.0 TB of usable space.

Similarly, a 7-disk raidz2 vdev would only have 5 data disks, for 2.5 TB of usable space.

circus78 · Sep 20, 2017

phoenix said:
Do you want to be able to withstand a 2-disk failure in a single vdev (meaning you'd use raidz2)? Or that you want to be able to withstand the loss of any 2 disks in the pool (meaning you'd use raidz1 or mirror)?

Hi, thank you for your very detailed reply.
I want to be able to withstand loss of any 2 disk in the pool. I mean: with Linux, I will choose RAID 6.
I have very old disks, so I guess I will need to replace them soon or later.
In that case, will I able to use larger hard disk ?

I mean: I have 1 vdev with 3 disk x 500 GB.
One day, one disk dies. Can I replace it with 1 TB disk?

Most important: how can I be aware of any disk issue? ZFS could send and email in case of any failure?
Thank you again!

Regards

SirDice · Sep 21, 2017

circus78 said:
I want to be able to withstand loss of any 2 disk in the pool. I mean: with Linux, I will choose RAID 6.

RAID-Z2 is fairly similar to RAID6. There are some minor differences though.

In that case, will I able to use larger hard disk ?

You can build an RAID-Z from a 1, 2 and 3 TB disks for example. The pool will be based on the smallest one, so you get 3x1TB (effectively 2TB usable).

circus78 said:
I mean: I have 1 vdev with 3 disk x 500 GB.
One day, one disk dies. Can I replace it with 1 TB disk?

Yes, that's not a problem. When eventually all disks have been replaced with 1TB you can actually enlarge the pool on-the-fly. My home server started with 4x1TB RAIDZ. It's now 4x3TB. I replaced the disks one by one, replace one disk, resilver, replace the next, resilver, etc. Once all the disks have been replaced you can set the autoexpand property on.

phoenix · Sep 21, 2017

circus78 said:
Most important: how can I be aware of any disk issue? ZFS could send and email in case of any failure?
Thank you again!

If you edit /etc/periodic.conf with the following you will get an e-mail everyday with the status of the pool:

Code:

# ZFS
daily_status_zfs_enable="YES"
daily_status_zfs_stats_enable="YES"

The first line will include the output of zpool status while the second line will provide you with all kinds of stats about the pool, the ARC, the L2ARC, etc.

If you want to be alerted sooner, you can whip up a script to run via cron every 15 minutes or so that looks similar to this:

Code:

#!/bin/sh

send=0
host=$( hostname | cut -d . -f 1 )
emailto="someemail@somehost.com"
msgsubj="Filesystem issues on ${host}"
gmirrormsg=""
spoolmsg=""
zpoolmsg=""
syspool="somepoolname"
storpool="someothername"


# Check zpool status
## Storage pool
pstatus=$( zpool list -H -o health ${storpool} )
if [ "${pstatus}" != "ONLINE" ]; then
        zpoolmsg="Problems with ZFS $( zpool status -x ${storpool} )"
        send=1
fi

## System pool
sstatus=$( zpool list -H -o health ${syspool} )
if [ "${sstatus}" != "ONLINE" ]; then
        spoolmsg="Problems with ZFS $( zpool status -x ${syspool} )"
        send=1
fi

# Check gmirror status
#if $( gmirror status | grep DEGRADED > /dev/null ); then
#        status=$( gmirror status )
#        gmirrormsg="Problems with gmirror: ${status}"
#        send=1
#fi

# Send status e-mail if needed
if [ "${send}" -eq 1 ]; then
        echo "${spoolmsg} ${zpoolmsg} ${gmirrormsg}" | mail -s "${msgsubj}" ${emailto}
fi

exit 0

There's also messages that are emitted by devd() that you can search for in /var/log/messages, and you can even edit /etc/devd.conf to have devd run scripts when certain messages are received. PC-BSD included a bunch of those by default, and I think some of them were included in FreeBSD 11.x (they're not in 10.x).

SirDice · Sep 21, 2017

FreeBSD 11.0 and higher have zfsd(8):

Code:

     zfsd attempts to resolve ZFS faults that the kernel can't resolve by
     itself.  It listens to devctl(4) events, which are how the kernel
     notifies userland of events such as I/O errors and disk removals.  zfsd
     attempts to resolve these faults by activating or deactivating hot spares
     and onlining offline vdevs.

circus78 · Sep 22, 2017

Hi,
if my server supports hotswap, can I assume that I can replace disks "on the fly" with ZFS? I mean: without rebooting server?
Will ZFS be able to recognize new disk?

SirDice · Sep 22, 2017

circus78 said:
if my server supports hotswap, can I assume that I can replace disks "on the fly" with ZFS? I mean: without rebooting server?

Yes. I've replace numerous drives while the system itself kept running.

Will ZFS be able to recognize new disk?

It depends how you've set it up. The new disks were detected immediately but I did have to specifically use zpool replace to replace it in an existing pool.

circus78 · Sep 25, 2017

phoenix said:
Create 3x raidz1 vdevs using 5 disks in each, giving you 6.0 TB of fairly fast storage. Use the SSD for the OS, and possibly as an cache.

Hi phoenix, I will use your solution above. For SSD, should I use two separated partition for OS and cache? What size do you recommend for cache?
Thank you very much for your patience.

phoenix · Sep 25, 2017

circus78 said:
Hi phoenix, I will use your solution above. For SSD, should I use two separated partition for OS and cache? What size do you recommend for cache?
Thank you very much for your patience.

Yes, create separate partitions on the SSD for the OS, for swap, and for the cache. Look at other systems to see what you need for the OS install and swap, and use the rest for the cache. You can probably use something like 32 GB for the OS, 4 for swap, and the rest for the cache.

circus78 · Sep 26, 2017

phoenix said:
There are a few options, depending on whether you need more IOps or more storage space (fastest option at the top, going down to slowest):

Create 2x raidz2 vdevs using 6 disks in each, giving you 4.0 TB of fast-ish storage. Use another 2 disks to create a separate pool for the OS. Use the SSD as a cache vdev. Leaving you with 1 disk as a spare

Create 2x raidz2 vdevs using 7 disks in each, giving you 5.0 TB of storage, and 1 spare disk. Use the SSD for the OS and as a cache.

..
..

For 500 GB drives, raidz1 should be okay, and the rebuild time won't be that long and shouldn't stress the other drives too much (which would lead to another drive failing before the first is finished, causing the pool to die). For anything over 1 TB, you definitely want to use raidz2. If you plan on replacing the drives with larger ones in the future, you should use raidz2 from the beginning.

Sorry Phoenix, according to your last quote, I think I will not able to find new disk 500 GB, so it is better to start thinking for disks > 1 TB.
So I think I will use 2x raidz2 vdevs using 6 disks or 2x raidz2 vdevs using 7 disks, as you suggested.
I'd prefer last solution because it's bigger

One doubt: suppose one day I will lose SSD disk.
After replaced it and reinstalled OS, will I able to mount zfs volume?
Should I need to backup some metadata-stuff to avoid any problem?

Thank you again.
Regards

phoenix · Sep 26, 2017

All the ZFS metadata is stored on the drives themselves. If the SSD dies, you just need to replace it, reinstall the OS, and run a manual "zpool import <poolname>". That will query the drives for the ZFS metadata and import the pool. It will be marked as DEGRADED, as the cache vdev will be missing, but that's easy to fix. Just "zpool remove" the cache, and "zpool add <poolname> cache..." to add a new one.

circus78 · Sep 27, 2017

phoenix said:
For 500 GB drives, raidz1 should be okay, and the rebuild time won't be that long and shouldn't stress the other drives too much (which would lead to another drive failing before the first is finished, causing the pool to die). For anything over 1 TB, you definitely want to use raidz2. If you plan on replacing the drives with larger ones in the future, you should use raidz2 from the beginning.

Hi phoenix, let's say in the future I would like to replace 500GB with new disk 1 TB size.
Should I first replace disk one by one only in ONE vdev, and then grow it?
I will end with two vdevs with different capacity, I read that this is not optimal, right?
Or should I replace ALL disks, and only when I will have 15 x 1 TB disk, can I grow both vdevs at same time?
Thank you very much

phoenix · Sep 27, 2017

circus78 said:
Hi phoenix, let's say in the future I would like to replace 500GB with new disk 1 TB size.
Should I first replace disk one by one only in ONE vdev, and then grow it?

For best performance of the pool during the replace, and for best redundancy, yes. Replace 1 drive in 1 vdev at a time.

I will end with two vdevs with different capacity, I read that this is not optimal, right?

It's not 100% optimal, but it shouldn't affect performance too much, especially if the wait until the other vdev is upgraded isn't too long (days vs weeks vs months). The disk allocator in ZFS will favour the vdev with the most free space, so more writes will go to the larger vdev. But it will still use both vdevs.

Or should I replace ALL disks, and only when I will have 15 x 1 TB disk, can I grow both vdevs at same time?

This is a policy issue that you will need to decide.

Do you want 100% optimal performance at all times with perfectly balanced vdevs? Or are you okay with a few days of 95% optimal performance (for example)? It also depends on how long you wait to start replacing drives. If the pool is almost full, you may not be able to wait for all 15 drives to be replaced. It's up to you, and your workload.

Sebulon · Oct 12, 2017

I have to say something about performance here, which has been overlooked in this conversation, in favor of volume.

So this storage is going to serve > 15000 users from 2x raidz2 vdevs is like hooking 15000 users up to two harddrives (about 300 IOPS), performance-wise. With ZFS and this fixed set of disks, you won´t get a do-over. Once you´ve created those vdevs, they are set in stone and it´s not good to mix different types of vdevs in the same pool. If I were you, I would see this as a high performant system type, like a virtualization, OLTP type of workload and would configure the disks for maximum performance, since you really can´t change the configuration later. My choice would be nr. 1 from phoenix that would give as much performance as possible (about 1050 IOPS), and then resilver the drives for bigger ones as volume-needs crop up over time. I would also get at least one SLOG to speed up any synchronous writes, if the need arises. If you go into production and notice writes aren´t all that fast, try and set sync=disabled just for a while to see if performance is boosted, then set it back to standard. If performance is boosted, buy a really fast (or two and mirror them) SSD and add a small piece of it/them as "log" device(s).

Just my thoughts.

/Sebulon

Solved Pool with 15 disks

circus78

SirDice

Administrator

phoenix

circus78

phoenix

circus78

SirDice

Administrator

phoenix

SirDice

Administrator

circus78

SirDice

Administrator

circus78

phoenix

circus78

phoenix

circus78

phoenix

Sebulon