Raid Controller cache and ZFS

Herrick · Apr 29, 2010

Hi,

I have setup a 12x1TB array in raid6 giving me approx 10TB virtual disk. I configured ZFS with a single volume on top of it. I know, lots of people will argue I should have used them is JBOD with raw disk and RAIDZ2, but I personnaly feel more confortable with a hardware solution than zfs since zfs is kinda new and the fbsd implementation is not really tested at large (please don't argue, I won't change my mind on this)

My question is in regards of the BBU Cache of the raid controller. I'm using a poweredge R510 and a H700 raid card with 512 megs of cache. The driver for it is mfi.

There is 14 disks in the server. The 12x1TB as a raid6 + ZFS and 2 SAS drive running in raid 1 for the operating system...

Code:

running #mfiutil cache 0
mfi0 volume mfid1 cache settings:
      I/O caching: disabled
    write caching: write-back
       read ahead: adaptive
drive write cache: default

Both of my volumes (RAID1 and RAID6) have the same cache settings...

Is that optimal or should I use something else ?

I thought of using "enable" (Enable caching for both read and write I/O operations) for the raid1 and "writes" (Enable caching only for write I/O operations) for the raid6+zfs

I want absolutely no compromise on data integrity, but would like to optimize performance as much as possible, are my proposed settings better ? Should I stick with how it is currently ?

Any comments would be more then welcomed.

Thanks a lot in advance

Herrick · Apr 29, 2010

By the way, here is a section of man mfiutil.

Code:

cache volume [setting [value]]
             If no setting argument is supplied, then the current cache policy
             for volume is displayed; otherwise, the cache policy for volume
             is modified.  The optional setting argument can be one of the
             following values:

             enable  Enable caching for both read and write I/O operations.

             disable
                     Disable caching for both read and write I/O operations.

             reads   Enable caching only for read I/O operations.

             writes  Enable caching only for write I/O operations.

             write-back
                     Use write-back policy for cached writes.

             write-through
                     Use write-through policy for cached writes.

             read-ahead [value]
                     Set the read ahead policy for cached reads.  The value
                     argument can be set to either ``none'', ``adaptive'', or
                     ``always''.

             write-cache [value]
                     Control the write caches on the physical drives backing
                     volume.  The value argument can be set to either
                     ``disable'', ``enable'', or ``default''.

                     In general this setting should be left disabled to avoid
                     data loss when the physical drives lose power.  The bat-
                     tery backup of the RAID controller does not save data in
                     the write caches of the physical drives.

sub_mesa · Apr 29, 2010

"I want absolutely no compromise on data integrity"

Using a write-back is dangerous and puts your data at risk. Out-of-order execution of I/O may also cause corruption in case of a reset/crash; some newer I/O requests did make it to disk while some older I/O requests did not.

To use a controller safely with ZFS it needs to support BIO_FLUSH; write-back likely ignores these requests. Basically you're playing with fire. You also lose most of the ZFS benefits, such as Self Healing and protection against BER/corruption. For all intents and purposes; ZFS treats your array as being non-redundant.

I'd say ZFS is one good example of how Software RAID can be superior to Hardware RAID in a fundamental level.

User23 · Apr 29, 2010

Herrick said:
Hi,
I want absolutely no compromise on data integrity, but would like to optimize performance as much as possible ...

Learn more about ZFS before you maybe make a wrong decision.

ZFS: The Last Word in File Systems

http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs

phoenix · Apr 29, 2010

Herrick said:
(please don't argue, I won't change my mind on this)
...
Is that optimal or should I use something else ?
...
I want absolutely no compromise on data integrity

By using ZFS on a non-redundant block device, and by disabling all of the data integrity features of ZFS, you are running a non-optimal setup that will compromise your data integrity. But, since you won't change your mind, there's little reason to continue this discussion.

Herrick · Apr 29, 2010

Hi, thanks for the links but I've been reading documentation mailing lists and forum posts for a couple of days...

I came to the conclusion that ZFS is not enough tested and I don't have enough confidence in it for critical stuff.

Comments from people all around saying that I should not run it on a raid controller is confirming (imo) that ZFS is really not stable and tested. I agree that doing so, I won't have the nice features of ZFS, but why would it be appocalyptic ? People are saying I'll run into problems. I understand I might get data corruption if the raid hardware does some errors but why would it be worst than ufs ? I've been running UFS on compaq server's and dell servers for 8-9 years, never had a single corruption problem (manageing about 50 servers)... If this would ever happen, ZFS will tell me there is a problem with a particular file, I'll restore from backup, big deal... People are talking like if the whole file system will die. If this is the case, then gee, ZFS sucks

Forget the raid card for one second. Are you telling me it's a nonsense to use ZFS unless you have more than one disk ? If it's that unstable, I'll start looking into having more smaller ufs drive...

The discussion was supposed to be on optimal raid settings, I guess we are far from it

Thanks for your comment by the way, I appreciate to see other peoples opinions

Regards

Matty · Apr 29, 2010

Herrick said:
Forget the raid card for one second. Are you telling me it's a nonsense to use ZFS unless you have more than one disk ? If it's that unstable, I'll start looking into having more smaller ufs drive...

nobody said it will be unstable with just 1 disk but you will miss the self-healing ability.

I'm really wondering how you made up your mind that zfs is still pre-alpha and can't be used in production where there a plenty of examples of working setups

Herrick · Apr 29, 2010

Ok, so if it's not unstable with one disk, it won't be on a single "virtual disk" which is a handled by a raid controller... For my particular setup, I need to store documents for legal purpose and keep them available for 6 years. So basically, they are written to disk, maybe read 2-3 times in the first week. Then nobody will touch them (except for very rare cases) before they get deleted after 6 years... I was looking to zfs for the scrub feature, and since the array is 10TB, I'm scared of a UFS fsck on the drive.

I posted a message on freebsd mailing lists about zfs, and I got a lot of offlist replies of people telling me that they love freebsd, but for such a scenario, they wouldn't trust zfs as zfs is not as mature as in the opensolaris. I don't feel confident right now about zfs to store 10TB of legal documents. Sure I could restore from backups, but I'm hoping I'll never need to.

Yesterday, I was doing read/write tests while unplugging two drives and freebsd never complained about anything. I trust raid hardware card, they never let me down (in fact, once but it was my fault

). Right now I don't trust ZFS enough.

I feel the best compromise is to have a raid hardware card for reliability and zfs for file system. Sure I'll lose some cool features, but I don't care about them in my particular scenario. That's why my post was about optimal raid settings but the discussion is turned toward why I should use ZFS

In five years, if we get back to this post, maybe I'll say, you were right, ZFS was stable, I could have used it. In the other hand, maybe I'll say: I'm so glad I didn't, I lost all my data and had to restore from backup... Time will tell, but right now, I don't trust ZFS enough to put my career on it (exageration)

Regards

phoenix · Apr 29, 2010

Herrick said:
I came to the conclusion that ZFS is not enough tested and I don't have enough confidence in it for critical stuff.

ZFS has been used and beaten on in Solaris for over 5 years. ZFS has been used and beaten on in FreeBSD for over 2 years (across 4 separate releases of FreeBSD). It's definitely not alpha or even beta quality anymore.

Comments from people all around saying that I should not run it on a raid controller is confirming (imo) that ZFS is really not stable and tested.

ZFS is designed to replace RAID controllers. The whole point of using ZFS is to run it on normal, prone-to-errors disks. ZFS on a single block device has no redundancy, and is unable to correct any of the problems that it detects. ZFS on multiple disks can do a lot of fancy things that RAID controllers can't: detect and correct corrupt files, detect and correct dead disks, update existing files without hitting the RAID5/6 write-hole, etc.

Using ZFS on top of a RAID array completely defeats the point of ZFS.

I agree that doing so, I won't have the nice features of ZFS, but why would it be appocalyptic ? People are saying I'll run into problems. I understand I might get data corruption if the raid hardware does some errors but why would it be worst than ufs?

It will be slower than UFS, and if any files are detected as corrupt, there is no way to fix them. In which case, you may as well not have been using RAID, since the hardware RAID controller can't fix the corrupt files either.

You also won't get any of the volume management features of ZFS (easy storage pool creation, easy storage pool expansion, easy drive replacement with larger disks, etc).

I've been running UFS on compaq server's and dell servers for 8-9 years, never had a single corruption problem (manageing about 50 servers)... If this would ever happen, ZFS will tell me there is a problem with a particular file, I'll restore from backup, big deal... People are talking like if the whole file system will die. If this is the case, then gee, ZFS sucks

Ah, but if you run ZFS on "Single Disk" arrays or JBOD, then ZFS will detect the corrupt file, and repair it, without you having to touch your backups or having any downtime.

Forget the raid card for one second. Are you telling me it's a nonsense to use ZFS unless you have more than one disk ? If it's that unstable, I'll start looking into having more smaller ufs drive...

It has nothing to do with stability. Using ZFS with a single block device defeats the whole purpose of using ZFS. ZFS is designed for multiple disks, as it does all the RAID stuff itself (mirror, raidz1, raidz2, raidz3, striping across vdevs, etc).

If you are hell-bent on using a single hardware RAID6 array, then just use UFS on top.

If you are concerned about data integrity, then create a bunch of "Single Disk" arrays or switch to the controller to JBOD, and use ZFS to create the raid arrays (and use multiple, smaller raid arrays, as ZFS will then stripe across them, in effect creating RAID50 or RAID60).

phoenix · Apr 29, 2010

Herrick said:
For my particular setup, I need to store documents for legal purpose and keep them available for 6 years.

ZFS with snapshots works wonderfully for this. Our backups servers have over 10 TB of disk each, managed solely by ZFS.

So basically, they are written to disk, maybe read 2-3 times in the first week. Then nobody will touch them (except for very rare cases) before they get deleted after 6 years... I was looking to zfs for the scrub feature, and since the array is 10TB, I'm scared of a UFS fsck on the drive.

ZFS scrub is useless without redundancy in the ZFS pool.

I posted a message on freebsd mailing lists about zfs, and I got a lot of offlist replies of people telling me that they love freebsd, but for such a scenario, they wouldn't trust zfs as zfs is not as mature as in the opensolaris.

When did you post? A year ago? When ZFSv6 was in FreeBSD? ZFSv14 is not in FreeBSD 8-STABLE with plans for ZFSv15 to be available in 8.1-RELEASE in July. This is the same version of ZFS as in Solaris 10. If it's good enough for Solaris, why wouldn't it be good enough for FreeBSD?

I don't feel confident right now about zfs to store 10TB of legal documents. Sure I could restore from backups, but I'm hoping I'll never need to.

I feel confident enough in FreeBSD + ZFS to store a years worth of daily backups for over 120 servers for an entire school district. Including e-mail for archiving/legal purposes. We're only keeping 1 year, as that's all the storage space we have right now for keeping daily snapshots. Taking snapshots less frequently would allow for longer retention, and replacing the 500 GB drives (24 of them) with larger ones would allow for more storage, but we don't have the money for that right now.

Yesterday, I was doing read/write tests while unplugging two drives and freebsd never complained about anything. I trust raid hardware card, they never let me down (in fact, once but it was my fault ). Right now I don't trust ZFS enough.

To each their own.

Herrick · Apr 29, 2010

phoenix said:
It will be slower than UFS, and if any files are detected as corrupt, there is no way to fix them. In which case, you may as well not have been using RAID, since the hardware RAID controller can't fix the corrupt files either.

If you are hell-bent on using a single hardware RAID6 array, then just use UFS on top.

My idea of using ZFS, was to be able to scrub the disk. Sure it cannot fix the error, but at least it can detect it (so I can restore the file from backup). The metadata for each file keeps a crc so you can easily validate a file is still valid.

Herrick · Apr 29, 2010

phoenix said:
ZFS with snapshots works wonderfully for this. Our backups servers have over 10 TB of disk each, managed solely by ZFS.

ZFS scrub is useless without redundancy in the ZFS pool.

When did you post? A year ago? When ZFSv6 was in FreeBSD? ZFSv14 is not in FreeBSD 8-STABLE with plans for ZFSv15 to be available in 8.1-RELEASE in July. This is the same version of ZFS as in Solaris 10. If it's good enough for Solaris, why wouldn't it be good enough for FreeBSD?

I feel confident enough in FreeBSD + ZFS to store a years worth of daily backups for over 120 servers for an entire school district. Including e-mail for archiving/legal purposes. We're only keeping 1 year, as that's all the storage space we have right now for keeping daily snapshots. Taking snapshots less frequently would allow for longer retention, and replacing the 500 GB drives (24 of them) with larger ones would allow for more storage, but we don't have the money for that right now.

To each their own.

The scrub, like I just replyed in the previous message was in my understanding useful to detect problems even if I cannot repair automatically.

Regarding the post, it's last week... That's why I'm relying on it for my perception of ZFS. I must say that with your scenario you seem to be really confident about it and considering your status in the forum, I must say you seriously put a doubt in my mind...

I'm starting to think I might give zfs a try

If so, I have 12 disks and zfs recommendation is 9 disk max... How would you see the zpools ?

Thanks

phoenix · Apr 29, 2010

12x 1TB disks can be done in multiple ways:

12-disk raidz2 vdev (10 TB, can lose any 2 disks, horribly slow, don't do it)
2x 6-disk raidz2 (8 TB, can lose 2 disks in each vdev, faster than above)
3x 4-disk raidz2 (6 TB, can lose 2 disks in each vdev, faster than above)
3x 4-disk raidz1 (9 TB, can lose 1 disk in each vdev, faster than above)
4x 3-disk raidz1 (8 TB, can lose 1 disk in each vdev, faster than above)
6x 2-disk mirror (6 TB, can lose 1 disk in each vdev, fastest)

It all depends on the level of redundancy, amount of usable storage, and random access speed that you want.

raidz2 (double-parity RAID, similar to RAID6) provides the best redundancy/usable storage ratio. But it's the slowest method, due to the double-parity calculations. You want to use lots of small raidz2 vdevs to compensate (although going below 5 disks in a raidz2 vdev is pretty pointless).

raidz1 (single-parity RAID, similar to RAID5) provides decent redundancy and more usable storage. Not too useful with large drives (over 500 GB), though. If you lose a second drive while resilvering a dead drive, the whole pool is lost.

mirroring (similar to RAID1) provides the best performance, but uses up the most disk space for redundancy. mirroring in ZFS is N-way, meaning you can use as many disk in the mirror as you want. If you have tonnes of disks to play with, you can create a pool with 3-disks, 4-disks, 5-disks or more in each mirror. Maximum redundancy ... minimum usable storage.

raidz vdevs are limited to the IOPS of the slowest disk in the vdev. IOW, no matter how many disks you add to the vdev, the IOPS will be the same as for a single disk. To get around that, you make lots of small raidz vdevs, and add them all to the pool, thus aggregating the IOPS. ZFS will add them to the same pool, and will stripe data access across all the vdevs. Thus (in effect), creating one large RAID0 stripeset across all the raidz vdevs.

ZFS also supports spare disks, although the FreeBSD implementation is currently (I believe) cold-spares, meaning you have to manually start the replace/resilver using the spare disk. But there were messages on the freebsd-fs list saying someone was working on enabling hot-spares (they're part ZFSv13+).

The really fun stuff will come in FreeBSD 9.0 (hopefully), when ZFSv22+ is included. That's when deduplication support is enabled.

danbi · Apr 30, 2010

Herrick said:
My idea of using ZFS, was to be able to scrub the disk. Sure it cannot fix the error, but at least it can detect it (so I can restore the file from backup). The metadata for each file keeps a crc so you can easily validate a file is still valid.

If you have metadata with crc, separate from the data files, you may well do verification without using ANY filesystem tools. Just write small program to check each and every of your files. If you keep the metadata on separate storage, then you have all what you need.

ZFS is something different. It is the dreamed "store and forget" filesystem and pretty much it works so.

Last year, I had a 3ware RAID fail on me. It just lost it's array configuration (luckily, there were several arrays, some of them persisted). That was my first such system, created years ago and I had the belief, that this is something rock solid. Just as you mentioned --- I believed the 'hardware RAID' would give me peace of mind and data integrity.
False sense of security! It is all wrong.

In my case, because of this false sense of security, that particular system (and ironically, only that one) has no recent enough backup. So --- urgent 3ware help to sort of recover the arrays, two sleepless nights and two-three days of downtime. That is.. to migrate everything to ZFS.
Since that incident, I have been migrating many 3ware based RAID arrays to ZFS.

The irony is, with modern hardware, if you don't need many disk drives, there is much better performance with the on-board SATA ports, than with an 3ware controller. This is performance, not reliability of course.

Phoenix has commented on most of the rest.

The point is -- forget about the hardware RAID. The 3ware controllers are still good, as smart SATA adapters, although not that fast as the on-board interfaces.

In your application, as you describe it, the important thing is backups. RAID is no replacement for backup. But, with ZFS you have the benefit that in the unlikely event you have file corruption, you will know which file is dead and can only restore that file from backup.

Another point in storage applications like yours. Tomorrow, you may want to expand your storage. With the RAID controller, although it may be theoretically possible, usually takes an awful lot of time, during which (I believe) your data is vulnerable.

With ZFS, you just replace your drives with larger drives and are done. You need to replace all the drives in a vdev of course, so here the 'wasting' nature of mirror comes to help. If today you have 12 1TB drives with 6TB of usable storage (6x 2 way mirrors), if you need 2 more GB, you just replace two of the drives with 2TB drives, one by one and so on. You never have to stop the running system, never have to re-partition the RAID volume or grow it -- all those risky tasks. With ZFS over single RAID volume this is unrealistic.

Anyway, I fail to understand your position on ZFS. If ZFS is unreliable, why would you want to use it on top of your perfect RAID?!?!

Jago · May 5, 2010

That is quite a trainwreck you have going on here. If you don't trust ZFS, that's fine, just don't use it. But you want to use ZFS, at least use it in a proper configuration. Configure your hardware raid card to serve the drives as 12 x single-disk arrays, this will allow you to keep using all the hardware features of the raid card (true JBOD mode tends to turn hardware raid cards into dumb SATA controllers, disabling most features).

Configure a ZFS pool out of these 12 "drives". What kind of a pool layout to use is up to your own taste, but the important thing to consider is to never ever use raidz/raidz2 vdevs that are bigger than 9 parts each (ie. DO NOT make a pool consisting of a single 12-drive raidz2 vdev). Break up the pool into 2 x 6 disk raidz or 2 x 6 disk raidz2. Or, as danbi described, go for a RAID10-like configuration of 6 x 2 disk mirrors. This will only give you 6 drives worth of usable space for obvious reasons, but you would be able to sustain 6 dead drives without losing data, as long as each of the dead drives belongs to a different mirror. You will also have an easier time growing your pool by growing the underlying mirrors one at a time.

fgordon · May 15, 2010

Hmmm I'm usind a raidz2 12 x 2 TB setup as it's really convenient. For the saved money of multiple eg. RAID10 setups I have two backup systems (both also raid-6-like) but with different OS and different filesystems.

So the chances of hardware failure is covered ... but much more important also sw-implementation errors of the filesystem/OS/network........that might destroy all vdevs or the data or the pool or...

Hmm raidz2 is not that slow even on "low-end-hw"... at least if you are limitied to a 1-gigabit network....

phoenix · May 15, 2010

You really shouldn't use more than 10 drives in a raidz vdev. To see why, pull one drive, wait, plug it back in, and wait to see what happens when it tries to re-silver the drive.

Resilvering a single 500 GB drive in a 24-drive raidz2 vdev was impossible.

The drives just thrashed and thrashed and thrashed for over a week before we gave up and redid the pool with smaller vdevs.

Resilvering a 1.5 TB drive in an 8-drive raidz2 vdev where only 500 GB is in use takes 35+ hours.

If I could build our storage servers again right now, I would use 6-drive raidz2 vdevs, with 2 on each 12-port controller.

fgordon · May 15, 2010

As I have backup systems I prefer the advantage of a "big pool" and accept the fact, resilvering will take some hours more

On my "low end system" a resilver needs ~ 39 hours (12 x 2 Tbyte, 70% full, raidz2)

If I really had to consider resilver/rebuild time on a (raid-)server, then I would probably use a different system like dfs or cloud.... as rebuild/resilver should not happen that often

On my Linux servers with up to 24 drives / pool the need for rebuild was ~ every two to three years - on some I never had to do a rebuild at all ....

hwalther · Jan 29, 2011

Hi there, I hope it's OK to reply on old threads ?

I have kinda the same issue, that it, a very good RAID controller and I want to combine it with ZFS's nice snapshots and nice cache feature with a SSD disk...
The reason I have gone with the RAID controller is that from what I was able to find out, ZFS isn't that great at expanding existing raidz groups with additional disks...
It can expand a pool with another raidz, but if you had a raidz1 with say 6 disks, and wanted to expand with 6 more disks, you would have to do another raidz and ending up with 2 parity disks. With my RAID controller you are able to add the disks to an existing RAID5 or RAID6 (yes it has to rebuild), then you would be able to expand the volume/LUN, and the ZFS pool will be able to autoexpand the pool for you... nice and easy.

I would be happy to turn my RAID controller into an expensive JBOD controller if someone can sell me a better way to expand my pool...

The RAID5/6 write-hole is not something I am afraid of, mostly because I have been working with enterprise storage systems for over 10 years, and most of them were running RAID5/6 and I have never had problems with it... I have been working mostly with NetApp for the past 5 years, and if only someone would copy their technology with RAID-DP (RAID4 with two parity disks), and WAFL on top for great and fast, no performance degradation snapshots, I would be truly happy

Just my 5 cents on the subject ;-)

danbi · Jan 30, 2011

By all means, avoid putting the SSD drive under control of your RAID controller. The whole point to use SSD is for the low latency, and even an 'very good RAID controller' does not match SSDs latency. If possible, connect the SSD to a motherboard port or to an HBA.

It was pointed out many times, that there is no value in using ZFS over an 'RAID volume', other than perhaps utilizing an older setup, or SAN or whatever. In any case, you will be much better, performance wise and much safe (i.e. taking lower risk to lose your data) if giving ZFS direct access to each disk. In you can't do this, at least create two 'RAID' volumes and give these to ZFS for redundancy.

Since ZFS has appeared, many have reconsidered their belief in the reliability of RAID controllers -- sometimes ZFS shows bad data, where the RAID controller is quite happy.

About volume expansion.. ZFS takes different approach to this issue. From your statements, I understand performance is not your primary concern. Therefore, you could be happy to have say 6 disk raidz vdev, then add a mirror vdev. ZFS will spread the data over both vdevs, in effect increasing both your available storage and performance.

Have you ever increased the size of an RAID5/6 volume? What size disks? How many days, weeks.. does it take?

What if something goes wrong with any of the disks during this process?

phoenix · Jan 31, 2011

hwalther said:
Hi there, I hope it's OK to reply on old threads ?

So long as the reply is on-topic to the thread, there's no problem. Replying to an old thread with a new/unrelated topic is frowned upon.

I have kinda the same issue, that it, a very good RAID controller and I want to combine it with ZFS's nice snapshots and nice cache feature with a SSD disk...

The reason I have gone with the RAID controller is that from what I was able to find out, ZFS isn't that great at expanding existing raidz groups with additional disks...

ZFS is very good at expanding the available storage space in a pool. Just add another vdev to the pool, and ZFS will stripe data across all the vdevs in the pool.

For example, create a storage pool using one 6-disk raidz2 vdev (similar to a 6-disk RAID6 array). Later, add another 6-disk raidz2 vdev to the pool. You now how have a pool with 2 vdevs, with writes stripes across them (similar to a RAID60 array) so you get more storage space *AND* increased performance. Later, add another 6-disk raidz2 vdev. And continue until you run out of storage bays.

Also, you can replace each of the drives in a vdev (one at a time, allow it to resilver each one). Once that process has completed, you will get access to all the new storage space. For example, create a pool using one 6-disk raidz2 vdev with 500 GB drives. You have 2 TB of usable storage. Later, you replace all those drives with 2 TB drives. You now have 8 TB of usable storage.

The only thing you can't do (yet) with ZFS is to change the size of a raidz vdev. For example, you can't start with a 6-drive raidz2 vdev and then later add 2 disks to it to create an 8-drive raidz2 vdev. Nor can you go from raidz1 to raidz2, or from raidz to mirror, or anything like that.

With my RAID controller you are able to add the disks to an existing RAID5 or RAID6 (yes it has to rebuild),

How long does it take to rebuild, and what happens if a drive dies during the rebuild? With ZFS, adding another vdev takes a few seconds, the storage space is available instantly, and you never lose redundancy.

then you would be able to expand the volume/LUN, and the ZFS pool will be able to autoexpand the pool for you... nice and easy.

And completely unsafe.

I would be happy to turn my RAID controller into an expensive JBOD controller if someone can sell me a better way to expand my pool...

See above. Stop thinking in terms of a single RAID array, and start thinking in terms of "storage pool, comprised of lots of arrays". A single 24-disk RAID6 array is going to suck in terms of performance; but a 24-drive RAID60 array using 6-drives per RAID6 array will scream in comparison.

Same for ZFS. Stop thinking "I need 1 huge array that I just keep expanding" and start thinking "I need 1 huge pool made up of lots of little arrays that I keep adding".

I have been working mostly with NetApp for the past 5 years, and if only someone would copy their technology with RAID-DP (RAID4 with two parity disks), and WAFL on top for great and fast, no performance degradation snapshots, I would be truly happy

And that's different from raidz in ZFS how?

buddy_ekb · Dec 31, 2011

Although it is obvious that ZFS itself is at least not a worse solution to manage disk devices than any hardware RAID-controller, I'd like to discuss some drawbacks in case of ZFS-only approach deployed over JBOD.

For a start it's silly not to use the write performance gain which is achieved by cache memory of controllers backed up with battery. But within a ZFS-configuration with mirror that kind of performance is effectively halved because of double-caching the data written.

Secondly, there are two problems with fail-safe booting. If the first disk drive is failed then there will be likely boot failure. And all the same one needs to have same advanced skills to properly make partitioning and boot preparing.

All these drawback simply don't exist in case of an array managed by HW RAID, so it's not quite clear for me whether it is better to rely on ZFS or on HW RAID.

ctengel · Jan 3, 2012

Failsafe booting with mirrored ZFS drives is not hard. Just a matter of putting bootblock/bootloader onto both disks, and configuring OBP/BIOS to boot from one first and the other second.

Also may be interesting to note that Sun/Oracle recommends using Solaris ZFS instead of HW RAID on their own boxes.

andrej · Jan 5, 2012

RAID controller in JBOD mode vs. regular SATA connector?

ctengel said:
Also may be interesting to note that Sun/Oracle recommends using Solaris ZFS instead of HW RAID on their own boxes.

That Sun/Oracle recommends ZFS over HW RAID is understandable, would be shocked otherwise

Would ZFS work better with a RAID controller in IT/JBOD mode and BBU than regular SATA connections on the motherboard, since a RAID controller has a cache and does not need to flush to the disks? Which should be safe because of the BBU, right?

ctengel · Jan 5, 2012

andrej said:
That Sun/Oracle recommends ZFS over HW RAID is understandable, would be shocked otherwise

True, but I guess what surprised me is that the box in question was a Sun SPARC server, and the full HW RAID was one of the selling points of the box!

andrej said:
Would ZFS work better with a RAID controller in IT/JBOD mode and BBU than regular SATA connections on the motherboard, since a RAID controller has a cache and does not need to flush to the disks? Which should be safe because of the BBU, right?

Again, I guess they're biased but the Sun engineers are pretty consistent on saying the closer ZFS is to the drives, the better. On raidz, as has been mentioned, all writes are designed to be atomic across the disks. If you've got a cache in between that's not being flushed, all of a sudden they are not. I guess having it on BBU vs not is certainly safer. But cache can still fail for another reason and that is a single point of failure. Then we're back to the same RAID5 write hole issue that ZFS atomic writes were supposed to fix.

As has been said by others in this thread, for ZFS to work as intended (not that you can't get some cool functionality out of it either way, I'm speaking about the reliability), it needs to have a full understanding of what is going on on the disks, and any cache must be honest with it when it comes to flushing.