Best practices for very large (32TB) fileserver?

Background: I've never used ZFS and I am much more comfortable with hardware-based RAID, based on years of practice.

I'm setting up some new fileservers. The primary use of these will be as Samba servers for backups of a few dozen systems (PCs with ShadowProtect and other FreeBSD boxes with rdump). That data will be pretty much write-only. I'll also be using them as general-purpose file-servers, so this part will be mostly reads. All of these systems are connected va Gigabit Ethernet.

I'll be using these to replace my existing units, which have 16 400GB drives on 2 8-port 3-Ware controllers (each controller has 7 drives in a RAID 5 set and a hot spare). I currently have 3 of these systems: a production server here, an off-site replication server connected via Gigabit Ethernet and using rdiff-backup for synchronization, and a hot spare here. Details here if anyone is interested.

The new systems will have 16 2TB drives on a single 3Ware 9650SE-16ML controller (with battery backup), 2 E5520 CPUs, and 48GB of RAM. There will also be an Ultra 320 SCSI card connecting the system to a 16-slot DLT-S4 autoloader (12.8TB uncompressed). They'll run FreeBSD 8.x, tracking 8-STABLE.

Obviously, I'll need to use a filesystem other than UFS, since chopping the array up into 2TB-sized chunks is impractical. It seems the answer for this is ZFS. But there are a huge number of configuration choices. Given my comfort level with the 3Ware products, my instinctive solution is to set up 15 of the drives as a single RAID 6 logical unit with the 16th drive being a hot spare, and then allocate something like 4 8TB partitions and format them with ZFS. I like the ability to do RAID rebuilds at the controller level - this is something that has been solid for many years. ZFS on FreeBSD is much newer with less of a track record.

However, ZFS apparently has knowledge of multiple spindles and may be able to do a better job of optimizing performance if I simply export all 16 drives individually and use ZFS to manage them.

The current fileservers have a smallish (60GB) slice which contains FreeBSD. I can continue to do this with the new servers, or I can install separate storage for the OS. This could be a pair of 2.5" SATA drives (cabled to the motherboard controller) in RAID 1, or a compact flash card (though I'd be concerned about card lifetime with the amount of writing being done for log files, etc.). I can stick with UFS2 for the system partitions or use ZFS. Given that ZFS boot support is quite new, and from what I've read the 8.0 distribution disk was cut before this feature was added, it seems that I should stay with UFS2 for the FreeBSD partitions.

The idea is to get the fastest I/O possible in day-to-day use. Other factors are the performance of the weekly tape backup job (a 2TB UFS2 filesystem takes a while to snapshot) and the nightly rdiff-backup job. I don't have to use rdiff-backup; however it has been doing a very good job. Fast, simple file access on the backup server is a must (so no special backup container formats).

Of course, I can try different configurations and benchmark them to try to find out what the best solution is, but I'd appreciate any advice from users here as to things to try or things to avoid.
 
@Terry_Kennedy

They'll run FreeBSD 8.x, tracking 8-STABLE.
8.1 will be released about JUNE 2010, by the way.

Given my comfort level with the 3Ware products, my instinctive solution is to set up 15 of the drives as a single RAID 6 logical unit with the 16th drive being a hot spare, and then allocate something like 4 8TB partitions and format them with ZFS. I like the ability to do RAID rebuilds at the controller level - this is something that has been solid for many years. ZFS on FreeBSD is much newer with less of a track record.
Bad idea, You will loose a lot performance, its a lot better to create RAID50/RAID60 setup to have a lot better performace RAID0 from RAID5/RAID6 arrays with ZFS of course.

The current fileservers have a smallish (60GB) slice which contains FreeBSD. I can continue to do this with the new servers, or I can install separate storage for the OS. This could be a pair of 2.5" SATA drives (cabled to the motherboard controller) in RAID 1, or a compact flash card (though I'd be concerned about card lifetime with the amount of writing being done for log files, etc.).
This seems ideal 'problem' for my 'sollution', I have written HOWTO for similar system here:
http://forums.freebsd.org/showthread.php?t=10334

In short, 512MB for read-only / (root) filesystem so You will even be able to get 4 x PENDRIVE and put RAID1 (I also use mirror for / in this setup) on top of them for even better redundancy.

All the rest (including /usr /var and so is on ZFS).

/tmp is mounted on SWAP in my HOWTO, but it can be 'moved' to ZFS without a problem (just create another dataset).

The idea is to get the fastest I/O possible in day-to-day use.
As You have 16 drives I would do RAID0 on top of 4 x RAID5 (each one with 4 drives), so 12/16 of total space used, or RAID0 on top of 3 x RAID5/RAID6 volumes with 5 drives and 1 drive left for HOT SPARE. That would do 12/16 space used for RAID50 variant and 9/16 space used with RAID60.
 
Terry_Kennedy said:
Given my comfort level with the 3Ware products, my instinctive solution is to set up 15 of the drives as a single RAID 6 logical unit with the 16th drive being a hot spare,

Even if you were to stick with hardware RAID, creating a single array across all 16 drives would be foolhardy, and only lead to performance issues. RAID 5/6 arrays should be narrow (4-8 disks). And then joined together via RAID0 (RAID50/RAID60, etc). Remember: every write to a RAID 5/6 array touches every disk, and doing a read-modify-write cycle across 16 disks would take ages.

However, ZFS apparently has knowledge of multiple spindles and may be able to do a better job of optimizing performance if I simply export all 16 drives individually and use ZFS to manage them.

Yes, this is the way to do it. Create 16 "Single Disk" arrays. Do not use JBOD. JBOD disables all the fancy management features of the card, disables the BBU, disables the cache, basically turning it into a generic SATA controller. "Single Disk" allows you to use 3dm2 to manage the card, use the onboard cache, use the BBU, etc.

Configure the individual disks to use the Performance profile, enable Write Cache, enable Queueing, disable Auto-Verify (ZFS scrub will handle that).

In 3dm2, set the Controller Settings to allow for Fastest I/O. And disable the schedules for auto-rebuild and auto-verify.

Then, in FreeBSD, use ZFS to create multiple raidz2 vdevs of around 6 disks each, and to join them all into a single storage pool. ZFS will then stripe write across the vdevs. (Think of vdevs as RAID arrays, and the pool as a RAID0 stripeset across all the arrays.) Since ZFS does Copy-on-Write, it doesn't have the "RAID5-write-hole", it just writes out new data all the time, never modifying any existing data on disk.

The current fileservers have a smallish (60GB) slice which contains FreeBSD. I can continue to do this with the new servers, or I can install separate storage for the OS. This could be a pair of 2.5" SATA drives (cabled to the motherboard controller) in RAID 1, or a compact flash card (though I'd be concerned about card lifetime with the amount of writing being done for log files, etc.).

My recommendation is to use separate storage for the OS, whether that's USB sticks, CompactFlash, 2.5" drives, 3.5" drives, whatever. Use gmirror(8) for them. Leave / and /usr on there. But put /usr/local, /usr/ports, /usr/src, /usr/obj, /var, /tmp, and /home onto ZFS filessytems. Basically, the OS storage is just to boot the system and for trouble-shooting. Everything else runs off ZFS.

I can stick with UFS2 for the system partitions or use ZFS. Given that ZFS boot support is quite new, and from what I've read the 8.0 distribution disk was cut before this feature was added, it seems that I should stay with UFS2 for the FreeBSD partitions.

Personally, I prefer to keep the OS on UFS, as it makes troubleshooting a lot simpler. No need for rescue CDs and whatnot. All the tools needed to manage ZFS are under /bin and /usr/bin.

The idea is to get the fastest I/O possible in day-to-day use. Other factors are the performance of the weekly tape backup job (a 2TB UFS2 filesystem takes a while to snapshot) and the nightly rdiff-backup job. I don't have to use rdiff-backup; however it has been doing a very good job. Fast, simple file access on the backup server is a must (so no special backup container formats).

If you put two of these boxes together at the same time, you can use the built-in send/recv features in ZFS. ZFS snapshot creation is virtually instantaneous. And you can send snapshots from one ZFS pool to another, basically cloning them.

Alternatively, if you don't mind running -CURRENT, there's the new High-Availability Storage (HAST) feature. This does block-level (ie, below the ZFS pool level) mirroring between 2 remote hosts. We're playing around with this right now, with the hope to replace our current rsync-based sync process for our 2 backup servers.

You should also consider removing the tape-backup setup completely. ZFS snapshots allow you to keep multiple "backups" online simultaneously, and having it replicated to an off-site server completely obviates the need for tape.

Before creating the ZFS pool, use glabel(8) to label the disks. Then use the labels to create the ZFS vdevs. That way, you don't have to worry about the device nodes being renumbered if you boot with a missing drive, and your pool getting confused. :)

For example:
Code:
# glabel label da0 disk01
# glabel label da1 disk02
# glabel label da2 disk03
# zpool create storage raidz1 label/disk01 label/disk02 label/disk03
That will label 3 drives, create a raidz1 vdev using the label devices, and create a ZFS pool called "storage". Using labels, you can re-arrange the drives any way you want in the server, and the pool will "just work".
 
Matty said:
I thought ZFS uses the guid to identify the disks. So switching connectors wouldn't be a problem.
It does, but that won't help you when you have a dead/dying drive you have to replace. If you have a 10+ drive system, with each drive changing the underlying device name several times over the lifetime of the system, how are you going to know what physical drive in the machine you are supposed to replace?

guid=2203261993905846015 isn't helpful
gpt/shelf2-disk5 is
 
It also requires an export/import of the pool in order for ZFS to rescan the labels and sort things out internally. Using labels (whether GEOM labels, GPT labels, UFS labels, doesn't matter) sets everything up at a layer below ZFS, so ZFS "Just Works". :)
 
Terry_Kennedy said:
I'll need to use a filesystem other than UFS, since chopping the array up into 2TB-sized chunks is impractical.

Your task is suited to using ZFS, but i want to point out that maximum UFS2 volume size is 8ZiB.
 
Jago said:
It does, but that won't help you when you have a dead/dying drive you have to replace. If you have a 10+ drive system, with each drive changing the underlying device name several times over the lifetime of the system, how are you going to know what physical drive in the machine you are supposed to replace?

guid=2203261993905846015 isn't helpful
gpt/shelf2-disk5 is

good point. And how to you know what disk is shelf2-disk5 in freebsd when you start to label? at that point this disk is just /dev/ad20.
 
Matty said:
good point. And how to you know what disk is shelf2-disk5 in freebsd when you start to label? at that point this disk is just /dev/ad20.
When you put the drives into the system, you write down the serial numbers. Then inside the OS, you probe each /dev/adXYZ with smartctl and voila, you can tell which /dev/adXYZ represents which physical disk.
 
Or, if they are hot-swap capable, you leave them unplugged, boot, then plug them in one at a time and label them as they appear to the OS.

It's not rocket science. :)

You can even use /boot/loader.conf settings to configure which SCSI devices appear in which order, including SATA devices (via ahci(4), ataahci(4), or options ATA_CAM).

However, the simple method is to manually connect your cables in a known order, and to use that order for the device names.
 
came up with my own method yesterday :p

dd if=/dev/ad{4,6,8,10} of=/dev/null and then just check the hdd activity led on the swapbay.
 
Back
Top