ZFS A petabyte zpool - striped vs. non-striped JBODs ?

Let's say we're going to make a single, 1PB zpool. Maybe that is or is not a good idea, but let's just pretend that's what we're doing.

Let's also assume raidz3 with 15-disk vdevs, and we use 45-disk JBOD enclosures.

There's two ways we can lay this out - we can put three 15-disk vdevs into each JBOD, and just add up JBODs until we have a petabyte. This is very simple.

However there is a second, more complicated thing we could do - we could buy 15 JBODs and stripe the vdev across all 15 - so each vdev has one single disk inside of each JBOD. When we build the VDEVs, we just tell ZFS to use non-consecutive disks for the vdev members (#1, #16, #31, and so on ...)

It *seems* like option two is a better one - we can actually lose three *entire* JBODs and the pool keeps running. Option #1, we cannot lose any JBODs - if we lose a connection to any of the JBODs, the zpool is broken immediately.

BUT, I think I would prefer option 1 - please comment on my reasoning:

If I lose an entire JBOD with "normal" organization, the zpool is broken and stops. But, all I need to do is repair the bad connection and bring the JBOD back online and then replay ZFS transactions backwards until the zpool is coherent again. This is downtime, but when the downtime is over, I have a normal, healthy zpool.

With option 2, the zpool keeps running, which is very nice, but when I reconnect the JBOD, *all* of my vdevs - every single one - needs to resilver, and with that much storage it could take a long, long time ... it could be in a very bad performance state.

So my conclusion is:

- If you want high availability, striping vdevs across JBODs is a good solution, but it will be ugly if you lose a JBOD

- If you can stand some downtime, a much simpler solution is to keep vdevs in their own JBODs and if JBOD becomes detached, just go backwards with transactions in the pool until you are coherent again and restart - at full performance and health.


Comments ?
 
I think the second case may be not as bad as you describe. AFAIK ZFS may resilver only data written while that JBOD was away. I don't remember I did that for RAIDZ3, but at least for mirror I think it was so. Try to experiment first.
 
You have just found a typical tradeoff in designing storage systems. It's sort of similar to the latency <-> throughput tradeoff; in your case it's a tradeoff between higher availability, but at the price of recovery after an outage being longer and more painful. (*)

But it's excellent that you are thinking through these things; most users just go by seat-of-the-pants. But what really matters here is the needs and desires of your end user. Are they the kind of customer where total unavailability of the data is a disaster, but slower access can easily be taken care of? Or are they the kind of customer where the system has to run at least at 90% of rated performance, or else it might as well be dead?

It *seems* like option two is a better one - we can actually lose three *entire* JBODs and the pool keeps running.

Two comments.

First, make sure the system can actually reliably handle failure of an entire JBOD without side effects. Concrete example: recently, I was working on a system that has roughly 360 disks in a half dozen external JBODs, each of which has dual power supplies and dual controller modules (everything is dual-pathed). Service staff was performing power supply maintenance on a running system (which should have been OK), when by mistake they powered the whole JBOD down (probably pulled the wrong power cable, leaving the JBOD with one dead power supply, and a live but unplugged one). In theory, the RAID system above should have survived that, because the remaining JBODs should have given enough redundancy. In practice, the system went down hard. For some inexplicable reason, the kernel (not BSD) wedged itself so hard, it couldn't respond to pings, and console and keyboard also stopped working. So having dual redundant power supplies and an intelligent data layout didn't help at all, since the OS croaked.

Second: You think you might be able to run a PB-size RAIDZ3 file system without any redundancy. But in practice, that won't work for very long. If you look at the error rate of modern drives, it is actually quite likely that during this highly degraded operation, you'll find a single sector error in a disk that's still powered up. And now, without any redundancy, that will cause "game over". Moral of that story: In large systems and with modern large disks, you need more redundancy just to cover the expected number of read errors.

With option 2, the zpool keeps running, which is very nice, but when I reconnect the JBOD, *all* of my vdevs - every single one - needs to resilver, and with that much storage it could take a long, long time ... it could be in a very bad performance state.

Does ZFS have some tuning parameter to adjust the speed of resilvering? As long as only one disk from every vdev was lost, you still have 2-way redundancy, so resilvering to full 3-way redundancy is not urgent at all. Perhaps you can choose to deliberately slow down resilvering, if that gives your customer's workload better performance.

(*) Footnote: In designing storage system, you have to make tradeoffs between three conflicting desires. One is data reliability: Make sure you don't lose the customer data. The second one is availability: Make sure the customer can get to his data when he wants to, not a day or a week later. The third one is performance: the customer wants a certain throughput or latency. What you are doing here is actually called performability: You are looking at variable performance, during different phases of system operation (perfect state, degraded operation with missing JBODs, recover after the JBODs are restored). Looking only at availability is like looking at performability in black and white, and ignoring that there may be times where availability means lower performance. Once you also make the deliberate choice of slowing down resilvering, you are now entering another tradeoff with reliability: the slower you resilver, the higher the chance that further failures while still less redundant cause data loss. Making these tradeoffs is difficult, in particular without a good understanding of the customers cost/benefit tradeoffs, and without analytical models of the impact of your decisions.
 
Does ZFS have some tuning parameter to adjust the speed of resilvering? As long as only one disk from every vdev was lost, you still have 2-way redundancy, so resilvering to full 3-way redundancy is not urgent at all. Perhaps you can choose to deliberately slow down resilvering, if that gives your customer's workload better performance.

Yes, there are several tunables you can set to modify how resilvering works. They can be used to either prioritise resilvering/scrub operations at the expense of throughput to the pool, or to put them into the background.

Code:
# sysctl -d vfs.zfs | egrep -e "scrub|silver" | sort
vfs.zfs.no_scrub_io: Disable scrub I/O
vfs.zfs.no_scrub_prefetch: Disable scrub prefetching
vfs.zfs.resilver_delay: Number of ticks to delay resilver
vfs.zfs.resilver_min_time_ms: Min millisecs to resilver per txg
vfs.zfs.scan_min_time_ms: Min millisecs to scrub per txg
vfs.zfs.scrub_delay: Number of ticks to delay scrub
vfs.zfs.vdev.scrub_max_active: Maximum number of I/O requests of type scrub active for each device
vfs.zfs.vdev.scrub_min_active: Initial number of I/O requests of type scrub active for each device
 
Second: You think you might be able to run a PB-size RAIDZ3 file system without any redundancy. But in practice, that won't work for very long. If you look at the error rate of modern drives, it is actually quite likely that during this highly degraded operation, you'll find a single sector error in a disk that's still powered up. And now, without any redundancy, that will cause "game over". Moral of that story: In large systems and with modern large disks, you need more redundancy just to cover the expected number of read errors.


No, it won't be a PB pool with a single PB sized vdev. Regardless of which way we do it, the max vdev size will be 15 drives - so there will be 21 vdevs of 15 drives each, all of which are their own 15-disk raidz3 vdev ...
 
Back
Top