ZFS Mirrors: How to prevent "schizophrenia"

mer · Aug 18, 2023

Eric A. Borisch Thanks and that was a good experiment. The transaction id stuff is interesting and make sense, working mirrors should have the same "this is the last txg_id" because that's what mirrors do?

Degraded mirrors. If one starts from "I have a degraded thing, the degraded part is going to be replaced with a new thing-part" then allowing writes to a degraded mirror makes sense because when the thing-part gets replaced with a new one, the resilver on it will pick up everything written in the degraded mode.

Is that a desirable position? Sixes and threes. For some yes, for some no. Take away a mirror and replace with any other RAID-Z configuration in a degraded state. The only difference to me is that each device in a mirror has all the information standalone. All the other configurations are rebuilding missing data from parity (I think this is true).

I don't know what the correct answer to ralphbsz OP is, but I think actual understanding of the behavior is a good thing.

ralphbsz · Aug 18, 2023

mer said:
Eric A. Borisch Thanks and that was a good experiment. The transaction id stuff is interesting and make sense, working mirrors should have the same "this is the last txg_id" because that's what mirrors do?
Degraded mirrors. If one starts from "I have a degraded thing, the degraded part is going to be replaced with a new thing-part" then allowing writes to a degraded mirror makes sense because when the thing-part gets replaced with a new one, the resilver on it will pick up everything written in the degraded mode.

Is that a desirable position? Sixes and threes. For some yes, for some no. Take away a mirror and replace with any other RAID-Z configuration in a degraded state. The only difference to me is that each device in a mirror has all the information standalone. All the other configurations are rebuilding missing data from parity (I think this is true).

I don't know what the correct answer to ralphbsz OP is, but I think actual understanding of the behavior is a good thing.

There is no "correct" answer. My question was really: what does ZFS do in the real world? There is a tension here, between being maximally correct and safe (no data is lost, corrupted, or made up), maximally available (mirrors should work when one copy is still functioning), and tolerating partitions (if you first write A, then write B, the end result should be a reasonable compromise). A system can choose which of those conflicting goals to come closest to. I think Eric did a great job answering my question, and I hope we all have learned something from the discussion; I certainly have.

Obnoxious side remark: A lot of these questions can be avoided by throwing more hardware at it. For example, about a decade ago I worked on a storage system that required to have at least 11 disks; given our target market was systems with several hundred to low hundred thousand disks, that minimum was reasonable. For a home or small business system, which is expected to work on a 1U rackmount or compact server with just two disks, such a minimum would be impractical.

homeadm · Nov 7, 2023

Very interesting topic, although the outcome was quite predictable. I would like to comment on following fragment:

ralphbsz said:
The easiest way to ensure that the controller, when it is running, always knows the previous state of the disks, is to set a rule that guarantees an overlap between the set of disks written when the controller was last up, and the set of disks that are up when it is starting. To guarantee that overlap, the controller simply only runs when MORE THAN half of the disks are up. Because that guarantees that at least one disk was in both the set of disks running before the shutdown/crash/power outage, and running now.

From this, we can do a simple counting exercise to determine how many disks one can lose and still be running. If there is only one disk, then that one disk has to be present, or else nothing matters; that case is trivial. If there are three disks, we can lose any one disk at a time, and record on the two survivors which disk has gone missing. Interestingly, the same is true for four disks: we can lose only one disk, because if we lose two, we might record the fact that disks C and D are gone on disks A and B, but then restart on disks C and D later. In general, the number of disks you can lose is (N-1)/2 rounded down. This is a real-world application of the CAP theorem: If you want consistency and partitioning, you need to give up some availability, and refuse to operate the system when too many disks are down.

That leaves the painful case of 2 disks. By the logic above, it can NOT tolerate loss of a single disk, if you want consistency in the case of partitioning (and re-partitioning later).

Your calculations and logic behind them are correct. This is visible in Sun's previous RAID solution called Solstice Disk Suite. I still use it on one machine. In SDS, selected slices are combined into virtual block device with desired redundancy (RAID-0/1/5/01/10) for UFS mkfs-ing. Information about all arrays is stored in state database replicas. It's up to you, where you place them and how many copies you create. The key part is, that you need majority of database replicas to boot Solaris. If you have created 4 replicas, you can only lose one.

ZFS Mirrors: How to prevent "schizophrenia"

mer

ralphbsz

homeadm