Let's say we're going to make a single, 1PB zpool. Maybe that is or is not a good idea, but let's just pretend that's what we're doing.
Let's also assume raidz3 with 15-disk vdevs, and we use 45-disk JBOD enclosures.
There's two ways we can lay this out - we can put three 15-disk vdevs into each JBOD, and just add up JBODs until we have a petabyte. This is very simple.
However there is a second, more complicated thing we could do - we could buy 15 JBODs and stripe the vdev across all 15 - so each vdev has one single disk inside of each JBOD. When we build the VDEVs, we just tell ZFS to use non-consecutive disks for the vdev members (#1, #16, #31, and so on ...)
It *seems* like option two is a better one - we can actually lose three *entire* JBODs and the pool keeps running. Option #1, we cannot lose any JBODs - if we lose a connection to any of the JBODs, the zpool is broken immediately.
BUT, I think I would prefer option 1 - please comment on my reasoning:
If I lose an entire JBOD with "normal" organization, the zpool is broken and stops. But, all I need to do is repair the bad connection and bring the JBOD back online and then replay ZFS transactions backwards until the zpool is coherent again. This is downtime, but when the downtime is over, I have a normal, healthy zpool.
With option 2, the zpool keeps running, which is very nice, but when I reconnect the JBOD, *all* of my vdevs - every single one - needs to resilver, and with that much storage it could take a long, long time ... it could be in a very bad performance state.
So my conclusion is:
- If you want high availability, striping vdevs across JBODs is a good solution, but it will be ugly if you lose a JBOD
- If you can stand some downtime, a much simpler solution is to keep vdevs in their own JBODs and if JBOD becomes detached, just go backwards with transactions in the pool until you are coherent again and restart - at full performance and health.
Comments ?
Let's also assume raidz3 with 15-disk vdevs, and we use 45-disk JBOD enclosures.
There's two ways we can lay this out - we can put three 15-disk vdevs into each JBOD, and just add up JBODs until we have a petabyte. This is very simple.
However there is a second, more complicated thing we could do - we could buy 15 JBODs and stripe the vdev across all 15 - so each vdev has one single disk inside of each JBOD. When we build the VDEVs, we just tell ZFS to use non-consecutive disks for the vdev members (#1, #16, #31, and so on ...)
It *seems* like option two is a better one - we can actually lose three *entire* JBODs and the pool keeps running. Option #1, we cannot lose any JBODs - if we lose a connection to any of the JBODs, the zpool is broken immediately.
BUT, I think I would prefer option 1 - please comment on my reasoning:
If I lose an entire JBOD with "normal" organization, the zpool is broken and stops. But, all I need to do is repair the bad connection and bring the JBOD back online and then replay ZFS transactions backwards until the zpool is coherent again. This is downtime, but when the downtime is over, I have a normal, healthy zpool.
With option 2, the zpool keeps running, which is very nice, but when I reconnect the JBOD, *all* of my vdevs - every single one - needs to resilver, and with that much storage it could take a long, long time ... it could be in a very bad performance state.
So my conclusion is:
- If you want high availability, striping vdevs across JBODs is a good solution, but it will be ugly if you lose a JBOD
- If you can stand some downtime, a much simpler solution is to keep vdevs in their own JBODs and if JBOD becomes detached, just go backwards with transactions in the pool until you are coherent again and restart - at full performance and health.
Comments ?