ZFS on raw disks

aragon · Dec 4, 2010

I'm concerned about using raw disks for a pool due to the future possibility of having to replace a broken disk with a different brand/model as the original disks, hence risking the replacement having slightly less disk space than the original disks.

Is the best mitigation to use disk partitions instead, or is there a more correct way to handle the above situation?

fronclynne · Dec 6, 2010

If you're dead set against using partitions (I'll note here that you didn't say you were), consider buying a spare or two in advance & then explicitly planning to replace all further failed drives with definitely larger units. Of course, if you buy N spares you'll most likely have N+1 failures before you're ready to "move up". I don't know how to trick Murphy (or Murphy's Wife), though. Good luck.

aragon · Dec 6, 2010

I'm not against partitions. Trying to stick to the ZFS best practices, but frankly I don't understand their argument that raw disks are much simpler to setup, and enough so to potentially sacrifice the freedom of switching drive models/vendors.

I'm going to go with partitions... 4k aligned ones at that.

fronclynne · Dec 6, 2010

Raise your hemline by lowering your expectations

According to http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

For production systems, use whole disks rather than slices for storage pools for the following reasons:

Allows ZFS to enable the disk's write cache for those disks that have write caches. If you are using a RAID array with a non-volatile write cache, then this is less of an issue and slices as vdevs should still gain the benefit of the array's write cache.

For JBOD attached storage, having an enabled disk cache, allows some synchronous writes to be issued as multiple disk writes followed by a single cache flush allowing the disk controller to optimize I/O scheduling. Separately, for systems that lacks proper support for SATA NCQ or SCSI TCQ, having an enabled write cache allows the host to issue single I/O operation asynchronously from physical I/O.

The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on slices.

ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other systems by using zpool import and export features.

In general, maintaining slices increases administration time and cost. Lower your administration costs by simplifying your storage pool configuration model.

Number 1 makes some sense, number 2 sounds like a reiteration of number 1, & the rest are justifications rather than reasons, & I suspect that they don't necessarily even apply to non-point'n'drool (& -Solaris, for that matter) setups.

phoenix · Dec 6, 2010

The disk cache issues are not relevant to FreeBSD systems. The way GEOM works, it enables disk caches regardless of the disk partitioning/layout. It's one of the benefits of using FreeBSD over Solaris. So long as you don't disabled the wc sysctls, then the cache is enabled. Straight from the horse's mouth, see Pawel's reply about 3 messages in.

If you go the way of using partitions, consider starting the first partition at the 1 MB boundary, so that it is aligned to every power-of-2 block size up to 128 KB.

And consider leaving 1 MB at the end of the disk, in case you want to use glabel(), or hastd(), or any other GEOM class that needs to write metadata to the end of the disk.

danbi · Dec 7, 2010

Simplification of the admin processes is not something that can be explained to the inexperienced. Keeping your setup as 'generic' as possible always pays well in the long run. Spending much time to finely tune and craft a system pays good benefits in the short to medium term but is nightmare to deal with in case of system transformation/recovery/upgrade etc.

Otherwise yes, do just as phoenix mentioned. If you decided to use partition, create one partition in the 'middle' of the drive and document it well. Many RAID controllers do just that -- if you give them say 200GB drive, they will round up to a megabyte or whatever down and use that for the 'usable' disk size. This allows for slightly smalelr '200 GB' drives from other or the same make to be used as replacements. Make sure you have good math there -- instead of 'leave that much space unused' do the reverse --- allocate that much space for the slice.

Oh, and another advantage in using the raw drive is that the ZFS pool will be usable in other operating systems, just for what ZFS is in fact designed. Using FreeBSD-specific partition format, encryption etc. will definitely prevent you from say attaching the pool to an Solaris or Linux system (even if only to recover your data).

aragon · Dec 7, 2010

phoenix said:
If you go the way of using partitions, consider starting the first partition at the 1 MB boundary, so that it is aligned to every power-of-2 block size up to 128 KB.

Why do you think that's important? I was planning to align on 4k and on a cylinder boundary. Aligning on a power of 2 and a cylinder boundary would be... tricky.

danbi said:
Otherwise yes, do just as phoenix mentioned. If you decided to use partition, create one partition in the 'middle' of the drive and document it well. Many RAID controllers do just that -- if you give them say 200GB drive, they will round up to a megabyte or whatever down and use that for the 'usable' disk size. This allows for slightly smalelr '200 GB' drives from other or the same make to be used as replacements. Make sure you have good math there -- instead of 'leave that much space unused' do the reverse --- allocate that much space for the slice.

Planning to leave 100 MB free at the end, just in case. It's less than 1% of 2TB.

phoenix said:
Oh, and another advantage in using the raw drive is that the ZFS pool will be usable in other operating systems, just for what ZFS is in fact designed. Using FreeBSD-specific partition format, encryption etc. will definitely prevent you from say attaching the pool to an Solaris or Linux system (even if only to recover your data).

Will use standard BIOS partitions, so I guess that'll be fine on other OSes. In any case, not planning to switch OS. The NAS is for me, and I only do FreeBSD.

phoenix · Dec 7, 2010

Why bother with cylinder boundaries? What uses it? Everything is LBA now.

Search the freebsd mailing list archives. DES (or maybe Matt Dillon?) posted a long message about why 1 MB was the simplest, and most future-proof, offset to use for the first partition.

carlton_draught · Dec 8, 2010

aragon said:
I'm concerned about using raw disks for a pool due to the future possibility of having to replace a broken disk with a different brand/model as the original disks, hence risking the replacement having slightly less disk space than the original disks.

I would think that as time goes on, disk capacities are likely to increase rather than decrease, so that should not be an issue. Or get some spares of the right size. Fooling around with partitions is a major PITA, and the time you'd spend fooling around with them would not justify the cost savings you might have in order to get the same reliability.

aragon · Dec 8, 2010

phoenix said:
Why bother with cylinder boundaries? What uses it? Everything is LBA now.

Search the freebsd mailing list archives. DES (or maybe Matt Dillon?) posted a long message about why 1 MB was the simplest, and most future-proof, offset to use for the first partition.

I think I found the thread, thanks. Well DES thinks FreeBSD's partitioning tools don't care about cylinder boundaries anymore, but I suspect that hasn't trickled down to STABLE yet. I'll need to test...

Matt's suggestion for 1 MB alignment seems like a good idea.

carlton_draught said:
I would think that as time goes on, disk capacities are likely to increase rather than decrease, so that should not be an issue.

I considered that too. I will be keeping at least one cold spare handy, and if I run out I guess >2TB drives will be cheap by then. Maybe I'll go with raw disks after all. Decisions decisions...

Thanks all!

danbi · Dec 9, 2010

I would only worry about disk size if my disks are somewhat unique -- either already the largest available size, or some very special make/model. In either case, such installations typically have enough budget to allow for few spares (if not already built into the pool).

In any case, 'wasting' a 2TB disk as replacement for an 1.5TB disk in a pool is no big deal, as by that time it would be already cheaper, faster etc.

Disk hardware has advanced significantly in recent years so we now throw out perfectly good drives, just because they no longer compare in capacity and performance with what is available cheaply on the market. Not to forget also power consumption

In the context of ZFS, the filesystem is designed in such a way, that you may mix any available storage media you have, whatever size and it will still work well. No more artificial constraints. With new ZFS versions to come, thing will get much better, when the dreamed 'block pointer rewrite' feature becomes reality.