ZFS Mirrors: How to prevent "schizophrenia"

Question: How does ZFS prevent the "schizophrenic brain" problem when using a mirror vdev, and one disk fails at a time?

Let me explain with an example. A computer has exactly two disks A and B, which are set up as a mirror in ZFS. In normal operation, all data is written to both A and B simultaneously and transactionally. Now disk B fails (for example the SATA connector is loose): no problem, the system continues running in degraded mode on disk A only. Most data is still mirrored, but new data is only on A. No problem so far. Now we do a clean shutdown and reboot, and by coincidence when the system comes up, disk A has become invisible, but B is back online (for example both SATA connectors are loose). The system will start writing even newer data on B in degraded mode. If there is one particular file that has been modified in both periods (when only one disk each was available), we now have two conflicting sets of changes. Given the design of a system that has only two disks and no other form of storage, this seems unavoidable.

But at some point, both drives are suddenly available again (for example the human sys admin noticed that something was fishy, and reseated both connectors). What happens now? The resilvering process will try to apply the changes (transactions) from one disk to the other, but once it gets to the file that was modified twice and has two inconsistent copies on the two disks, what will it do?

By the way, I'm not asking "what should ZFS do", nor "what do other storage systems do". I'm just trying to find out what would happen in a situation with just two drives in ZFS, if you get to the schizophrenia point. If disks were fail-fast (once they become disconnected, they never come back to life, which can always be forced by re-formatting them before admitting them back), this problem would not even arise. But in the real world, disks are not fail-fast, and there is value in re-admitting a previously failed disk, which makes this "schizophrenia" unavoidable. The standard technique for dealing with this is to require at least 3 disks or two disks + a tiebreaker device, and use a quorum majority to allow operation, but ZFS can run with just two disks.
 
when reading volume configs from disks the one with the higher last transaction id is kept. (iirc) maybe something similar is used in case of split brain
 
Interesting question. So you're looking for something like the Jepsen Test for ZFS mirrors. I was not able to find anyone that has done that work in a cursory Google search.

I'm no distributed systems expert, but my feeling is there is no way to avoid losing the committed writes on drive A in the scenario you describe. My reading of the Oracle ZFS documentations suggests that drive A would not be automatically added back to the mirror. Operator intervention would be required, and hopefully the operator would be notified of the impending data loss.

The only way I can think of to recover those writes would be to bring up the pool with just drive A (and hopefully in read-only mode) and copy the data off somewhere.

Strikes me this would not be hard to test, but I'm far too lazy to do it.

Sorry I can't directly answer your question, and I hope I haven't just added noise to your thread.
 
Thank you for the pointer to Jepsen! I didn't know that there is a consulting company that explicitly has turned testing distributed systems (particularly distributed databases) for consistency into a business, but that's a wonderful thing. I had heard about "the guy at database conferences that wears all leather clothing and skewers their consistency claims", but didn't know who he is.

As to the question "there is no way to avoid losing", it's complicated. Ultimately, Eric Brewer's CAP theorem tells you that if you want Availability (meaning the system comes back up after both disks get reconnected), you need to give up Consistency or Partitioning. This is particularly painful with just two disks (or two nodes in a distributes system), because failure of any one forces you to make the painful choice: do I want to continue running, or do I want to endanger my user's data. In large storage system, that problem is solved by always having way more than 2 things, so in the case of the (common) single failure, the bulk of the system still functions correctly. On a commodity PC with just two disks, that's hard to do; the only way I know of finding a third entity (a.k.a. tiebraker or witness disk) is to use the motherboard's BIOS NVRAM, but that is super hacky and causes all matters of other problems.

But the question here is this: What does ZFS do in the real world when this happens? Covecat sketched one scenario: ZFS completely rejects the disk that has the older updates. Honestly, I don't even know how it can define "older" without relying on a clock (Lamport shows his ugly face here). Rejecting the whole disk seems heavy handed: Since in a file system updates are to an individual file or directory, it would often be possible to keep most or all updates, as long as they don't conflict within a single file/directory. Is that what users want (tough question), and is that what ZFS implements (my question)? Note that mixing updates from two copies can lead to inconsistencies which users might not expect; on the other hand, rejecting whole disks (parts of history) leads to loss of perhaps valuable updates. It's a tough choice.

And, as you hinted at: What is the user interface when this happens? Are the error messages clear? How can one dig oneself when it happens? And I agree with you: It would not be hard to test, if one had (a) a spare computer with at least 2 disks (better 3 so the boot disk can stay in), and (b) a spare weekend. I have neither. At work, I get paid to reason about these things, but not in the context of commodity systems with just two disks, nor for ZFS.
 
[...] Now disk B fails (for example the SATA connector is loose): no problem, the system continues running in degraded mode on disk A only. Most data is still mirrored, but new data is only on A. No problem so far.
Agreed.
Now we do a clean shutdown and reboot, and by coincidence when the system comes up, disk A has become invisible, but B is back online (for example both SATA connectors are loose). The system will start writing even newer data on B in degraded mode.
AFAIK, disk B (on its own and before the reboot in a degraded two-way mirror) will not come online after the reboot just fine, it'll be "marked". I'm not sure but I doubt that with only disk B powered (disk A unpowered in your example) the pool will be imported and you'll be able to write to B. I'll see if I can find anything more conclusive.
[...] But in the real world, disks are not fail-fast, and there is value in re-admitting a previously failed disk, which makes this "schizophrenia" unavoidable. The standard technique for dealing with this is to require at least 3 disks or two disks + a tiebreaker device, and use a quorum majority to allow operation, but ZFS can run with just two disks.
In most "normal two-way mirrors" one disk failing a bit (bad block) is a big problem for data integrity; one disk failing completely can be safely detected and overcome. ZFS notices when data isn't complete and fully accompanied with the correct checksums. As I understand it your question revolves about the decision to "choose" between equivalent data where (perhaps different blocks on different disks of the mirror) could/should be reconstructed to its original value as it was sent from the system to the mirror; AFAIK this is a kind of reconstruction that ZFS doesn't do (at least not automatically by itself): the pool containing the ZFS mirror fails at an earlier stage.
 
It sounds like you're speaking from a hypothetical, and since the question is about prevention: Fix the issue before you shut down. Either remove the failed drive completely, or fix the problem (plug the SATA connector back in, fully, zpool online, possibly scrub) before shutting down.

Once you have a pool diverge with two histories on both sides of the mirror, there's no good way to rectify it. You'll mostly just have to pick one version of history and destroy the other side and resilver from scratch; there's no shortcut, you'll have to perform the full mirror again. You can maybe manually salvage some data by importing the other side as a new pool name.

If this is a major risk factor in your view for building out a system, consider something better than a two-disk mirror. A three-disk mirror or raidz would do better. raidz[123] also won't allow this scenario to happen at all.
 
(Disk A failed, but during a reboot disk B vanishes and disk A comes back ...)
AFAIK, disk B (on its own and before the reboot in a degraded two-way mirror) will not come online after the reboot just fine, it'll be "marked".
That would be the ideal solution. But to implement this, ZFS has to be able to remember that disk A is now "stale". Where can it store that information? It has to be on either of the two disks A or B, since there is no other place it can write to. And it can't be stored on disk A, since that disk was unreachable while disk B was in use. Storing it on disk B is pointless, since that one is absent when A comes back. So I don't see a way to implement this in general (without using a third storage location, such as a tie-breaker or witness disk).

It sounds like you're speaking from a hypothetical, and since the question is about prevention: Fix the issue before you shut down. Either remove the failed drive completely, or fix the problem (plug the SATA connector back in, fully, zpool online, possibly scrub) before shutting down.
That works fine for a human-attended system, such as a desktop. Not so much for a server that sits in an unattended basement, and reboots whenever the power goes up and down. (Yes, we have a UPS, but that only gives you 10 minutes for a clean shutdown.)

Once you have a pool diverge with two histories on both sides of the mirror, there's no good way to rectify it. You'll mostly just have to pick one version of history and destroy the other side and resilver from scratch; there's no shortcut, you'll have to perform the full mirror again.
And that is exactly my question: How does ZFS do this when left unattended? As far as I can see, it won't even notice that something went wrong, until suddenly both disks are back online. What does it do then? One good answer would be: Refuse to use the pool completely, wait for human intervention. There are many other possible good answers.

If this is a major risk factor in your view for building out a system, consider something better than a two-disk mirror.
Absolutely agree. If this is for business or professional storage, the suitable number of disks is measured with fingers on multiple hands. What brought on this question is that I'm going to soon re-install my home server, and that will have (a) one large ZFS pool for home, with just two spinning rust drives, and (b) a system/boot ZFS pool, most likely with a single SSD, perhaps using fast async backups using snapshots. Given power consumption and the size of my server box, buying a third spinning drive seems silly.

This should be easy enough to test out with a file-backed pool, and moving the files around between various export/import cycles.
Great idea, didn't think of it. I'll try to find some time for it ASAP.
 
That works fine for a human-attended system, such as a desktop. Not so much for a server that sits in an unattended basement, and reboots whenever the power goes up and down. (Yes, we have a UPS, but that only gives you 10 minutes for a clean shutdown.)

You can use BIOS or UEFI settings to stay powered off after power comes back. If this hypothetical is a real concern to you, use this and manually intervene to bring it back up. Also, set up the ZFS event daemon to send you an email when things are going awry. It should immediately notify you when one side of the mirror disappears.

And that is exactly my question: How does ZFS do this when left unattended? As far as I can see, it won't even notice that something went wrong, until suddenly both disks are back online. What does it do then? One good answer would be: Refuse to use the pool completely, wait for human intervention. There are many other possible good answers.

If both drives are visible, ZFS will import the pool with the uberblock that claims the newest transaction ID (and possibly if the MOS checksum is good too). In this hypothetical scenario, you're concerned that the previously-disconnected drive is now the only one online. It won't know about the other disk with newer transactions. You're just SoL; (feel free to test this with file-backed vdevs to be certain)

What brought on this question is that I'm going to soon re-install my home server [..] buying a third spinning drive seems silly.

I'll say that this hypothetical is sufficiently rare to not worry about it. If you're still worried, buy a third disk and use raidz. It'll be superior to saving $100 in the immediate term.
 
Thank you for the pointer to Jepsen! I didn't know that there is a consulting company that explicitly has turned testing distributed systems (particularly distributed databases) for consistency into a business, but that's a wonderful thing. I had heard about "the guy at database conferences that wears all leather clothing and skewers their consistency claims", but didn't know who he is.
I'm glad you like it! I'm a big fan. I found him years ago when the startup where I was working was considering using Mesos. His thorough and convincing skewering of that system helped me avert that disaster before it happened.

He's definitely a character. The name "Jepsen" comes from a hit song from the time when he stared doing his tests, "Call me Maybe" by Carly Rae Jepsen. The idea was these flaky distributed systems would not call you back in the presence of failures.

As to the question "there is no way to avoid losing", it's complicated. Ultimately, Eric Brewer's CAP theorem tells you that if you want Availability (meaning the system comes back up after both disks get reconnected), you need to give up Consistency or Partitioning. This is particularly painful with just two disks (or two nodes in a distributes system), because failure of any one forces you to make the painful choice: do I want to continue running, or do I want to endanger my user's data. In large storage system, that problem is solved by always having way more than 2 things, so in the case of the (common) single failure, the bulk of the system still functions correctly. On a commodity PC with just two disks, that's hard to do; the only way I know of finding a third entity (a.k.a. tiebraker or witness disk) is to use the motherboard's BIOS NVRAM, but that is super hacky and causes all matters of other problems.
Yeah, some systems go into a degraded mode where writes are not allowed once they fall below a minimum number of nodes. I think that's the only way to be sure. It's not super useful for the two-way mirror case.

But the question here is this: What does ZFS do in the real world when this happens? Covecat sketched one scenario: ZFS completely rejects the disk that has the older updates. Honestly, I don't even know how it can define "older" without relying on a clock (Lamport shows his ugly face here).
Well the disk would not have gone through a clean shutdown, and therefore would be marked dirty. I'm hoping ZFS wouldn't just resuscitate the pool with a single dirty disk.

Rejecting the whole disk seems heavy handed: Since in a file system updates are to an individual file or directory, it would often be possible to keep most or all updates, as long as they don't conflict within a single file/directory. Is that what users want (tough question), and is that what ZFS implements (my question)? Note that mixing updates from two copies can lead to inconsistencies which users might not expect; on the other hand, rejecting whole disks (parts of history) leads to loss of perhaps valuable updates. It's a tough choice.
My feeling is merging the non-conflicting updates would be too complicated and error prone, but I've written exactly zero filesystems.
 
Sure, 3-way mirror works. Until the hypothetical expands to two drives being detached at once. :)
You just dramatically decreased the chances of such a failure, though. You multiply the probability of each event happening with each other. This function decreases dramatically for events whose probability is less than 1, as is the case for disks. Say the probability of one disk failing is 1/3 (which would be ridiculously high), the probability of two disks failing is 1/9.

Also, with a 3-way system you can do things like reject writes when you have less than 2 healthy disks. That doesn't work as well in the 2-disk case, cause any failure leaves you with a read-only system.
 
Also, with a 3-way system you can do things like reject writes when you have less than 2 healthy disks.
In that observation lies the core of the best solution. Let me explain by introducing some formalism. You have a system with N disks. Each of these disks can be up or down; taking that together, you have a N-bit number or string describing the state of the disks. Whenever a disk leaves or comes back, that string changes. Say you have exactly one process who controls the operation of the system (in a single-node system that is trivially the case, in a multi-node system you use group services to make it true, but that's very complicated by itself). The controlling process knows the number N, and it knows the current N-bit string of what disks are up right now. The controller wants to take action whenever the string changes, for example allow the file system to be read and/or written. If the controller is up and running when the string changes (when a disk goes up or down), that is easy to implement: It can keep the previous value of the string in memory, done.

Where it gets hard is: The string of up/down flags can change while the controller is not running. For example, because the whole computer is powered down. Now to notice a change in the string, the controller needs to store the old copy of the string, which includes the knowledge of how many disks were there to begin with (the length of the bit string). Where will it store it? It has to be on the disks themselves, since we don't know of any other storage mechanism. But the disks can themselves come and go! The easiest way to ensure that the controller, when it is running, always knows the previous state of the disks, is to set a rule that guarantees an overlap between the set of disks written when the controller was last up, and the set of disks that are up when it is starting. To guarantee that overlap, the controller simply only runs when MORE THAN half of the disks are up. Because that guarantees that at least one disk was in both the set of disks running before the shutdown/crash/power outage, and running now.

From this, we can do a simple counting exercise to determine how many disks one can lose and still be running. If there is only one disk, then that one disk has to be present, or else nothing matters; that case is trivial. If there are three disks, we can lose any one disk at a time, and record on the two survivors which disk has gone missing. Interestingly, the same is true for four disks: we can lose only one disk, because if we lose two, we might record the fact that disks C and D are gone on disks A and B, but then restart on disks C and D later. In general, the number of disks you can lose is (N-1)/2 rounded down. This is a real-world application of the CAP theorem: If you want consistency and partitioning, you need to give up some availability, and refuse to operate the system when too many disks are down.

That leaves the painful case of 2 disks. By the logic above, it can NOT tolerate loss of a single disk, if you want consistency in the case of partitioning (and re-partitioning later). That's correct, but so impractical that it is usually not accepted by users. After all, they bought the second disk to get redundancy (safe operation when a disk does down), and it is rude to tell them that if a disk goes down, all they can do is read the old one (and perhaps not even that) until redundancy has been restored. Perhaps that is what ZFS does though: If it sees an unclean shutdown (the state of the 2-way mirrored pool was not updated on disk), and one of the mirror copies is missing, it refuses to import the pool. My question is that ZFS might be trying to be too friendly/useful/forgiving/... and allow operation (including writes) on a degraded mirror.

chungy said: "You can use BIOS or UEFI settings to stay powered off after power comes back." Yes, that would work, but it is throwing out the baby with the bathwater. Nearly all the time when the system boots, both disks come back up. Sadly, at our house we have lots of experience with this: We live very near Silicon Valley in California, but in a slightly rural and mountainous area near the coast. Last winter we had a lot of rain and even several snow storms, and our power infrastructure kept failing (as did a lot of other stuff, such as roads, we were cut off from the world several times for a day at a time). By our neighbor's count, between January and March we had 38 days without power. Now, my home server has a UPS (which keeps it up for 10-15 minutes, and allows clean shutdowns), and our house has two generators, one that starts automatically and uses propane, and a second manual gasoline model one that we use to relieve the propane generator. Plus when the power is out, we turn off the generators at night (long enough to save propane/gasoline, not so long that stuff in the refrigerator goes bad). I think my server had at least 100 reboots during the winter months. We got lucky to not completely run out of propane; several neighbors did (because of road damage, the propane delivery trucks couldn't make it for many weeks). And frequently, when the power goes out, there is also no phone or internet service for several days; one of the projects I need to work on before next winter is to set up an emergency router/controller that uses a cell phone modem, and can at least be used to turn the generator on or off (because if the power goes down when we're not at home, we

And to be clear: I'm not terribly worried about this case. The scenario which causes it (disk A down, then simultaneous B down and A back up during an outage) is vanishingly rare in practice. I'm definitely going to continue to use ZFS (it is the best file system available to amateurs), but I was just curious what might happen in this case, and how I might recognize and cure the problem if it happens. And with that, I'm off for the day to work on widening our road, so it is more passable to fire trucks in the summer and propane delivery in the winter.
 
ZFS won't refuse to import a pool missing some, but not all, components of a mirror. I suppose you could script your way to checking the output of zpool import before doing a real import, but this also depends on your pool not being the root file system (since the loader has no concept of avoiding this either)

ZFS mirrors are designed so that all disks but one may fail and you suffer no data loss. If you'd like something more robust, just buy a third disk and use raidz. It'll save a whole lot of headache.
 
And to be clear: I'm not terribly worried about this case. The scenario which causes it (disk A down, then simultaneous B down and A back up during an outage) is vanishingly rare in practice. I'm definitely going to continue to use ZFS (it is the best file system available to amateurs), but I was just curious what might happen in this case, and how I might recognize and cure the problem if it happens. And with that, I'm off for the day to work on widening our road, so it is more passable to fire trucks in the summer and propane delivery in the winter.
That actually sounds like fun to me. I need to get out more.

Have you considered Starlink for Internet service?
 
Have you considered Starlink for Internet service?
Yes. Two neighbors have it. I have DSL (delivered over phone wires). Personally, I don't like the concept of putting data flows into the air, because the air is a limited resource, which means you have to multiplex many flows in the time or frequency domain, using up spectrum. With wires, I can have 25 flows in parallel, using just a cable that is half inch thick and already installed underneath the road or on phone poles. That seems more efficient to me. As far as bandwidth, latency and cost is concerned, they are similar; in our area, DSL is somewhat cheaper for a plan that delivers about 50 Mbit/second, and that is more than enough for a household with 3 adults. Their reliability profile is different: DSL is totally reliable (nearly never out, and when it is out only for a few minutes), except when there is a large-spread power outage, and then DSL is dead as a doornail. In contrast, Starlink seems to go out for half a minute or a minute at a time, frequently or even regularly, but its overall availability is still good.

For me the deciding factor is that I don't like Elon, and his brand of populist hype and dishonesty.
 
Yes. Two neighbors have it. I have DSL (delivered over phone wires). Personally, I don't like the concept of putting data flows into the air, because the air is a limited resource...
I think more people would know this if they'd been around in the bad old days of 10Base2. You could see the single piece of coax everybody was on.

For me the deciding factor is that I don't like Elon, and his brand of populist hype and dishonesty.
It has been sad to see him descend into madness. I'm hoping he'll be too busy fighting culture wars to have time to screw up Spacex/Starlink.
 
It has been sad to see him descend into madness.
I agree. My parents live in bumfcsk Florida where there are no providers. Satelite only with Hughes and others.
Starlink was a given. But now only 18 months later the rate is up to $110 month and it is not issue free.
Speed has diminished to half. They are being charged extra because they live in congestion area.
Horrible that people in rural areas are paying more. They were supposed to be an equalizer.
 
And to be clear: I'm not terribly worried about this case. The scenario which causes it (disk A down, then simultaneous B down and A back up during an outage) is vanishingly rare in practice. I'm definitely going to continue to use ZFS (it is the best file system available to amateurs), but I was just curious what might happen in this case, and how I might recognize and cure the problem if it happens. And with that, I'm off for the day to work on widening our road, so it is more passable to fire trucks in the summer and propane delivery in the winter.
So here is what happens, when tested with files that I could move to hide them while the pool is exported...
  • Created a pool from files A and B
  • wrote a file to the pool (a_and_b)
  • exported the pool
  • 'hid' B
  • Asked zpool to import from the directory holding now only A; got this back:
Code:
~/testing # zpool import -d /root/testing/
   pool: testpool
     id: 9721517078564191623
  state: DEGRADED
status: One or more devices are missing from the system.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
 config:

        testpool             DEGRADED
          mirror-0           DEGRADED
            /root/testing/A  ONLINE
            /root/testing/B  UNAVAIL  cannot open

To ZFS's credit, it does point out that "The fault tolerance of the pool may be compromised if imported." If you want access to your data at this point (after one drive has failed) you can choose to import it and live with being in a degraded state. Note you could import with -o readonly=on to avoid potential split-braining.
  • Imported the pool ( zpool import -d /root/testing testpool). It will happily do this as you still have sufficient replicas, and you are asking to get to your data.
  • Write a new file a_only
  • exported; hid A, revealed B; imported. zpool status testpool now reports:
Code:
  pool: testpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
config:

        NAME                      STATE     READ WRITE CKSUM
        testpool                  DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            13684143910699073401  UNAVAIL      0     0     0  was /root/testing/A
            /root/testing/B       ONLINE       0     0     0

  • There is no a_only file (how could there be?)
  • Write to a b_only file
  • Export; reveal A (both now visible) and import one last time.
  • zpool status testpool now shows:
Code:
  pool: testpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Wed Aug 16 20:48:52 2023
config:

        NAME                 STATE     READ WRITE CKSUM
        testpool             ONLINE       0     0     0
          mirror-0           ONLINE       0     0     0
            /root/testing/A  ONLINE       0     0     1
            /root/testing/B  ONLINE       0     0     0

errors: No known data errors

  • Only b_only and a_and_b (original; while a happy mirror) files exist.
My take-aways:
  • Of course it will let you import with only one device -- you (the user) would be upset if your mirror would refuse to function in this state -- but it will point out that things are not normal. It also does point out (so long as you list potential imports before importing) that redundancy may be degraded by importing it when both devices are not available.
  • If you write to one (as I initially above with A), and then the available device (on a mirror of 2) "swaps" before your next import, it will happily import the other device (here B) as the pool, as it has no way of knowing what you did last summer import.
  • When importing with both available again, 'B' (the one "visible" later in time during the partial-blindness phase) won this time (see next bullet). It scrubs up and says "hey, things are wrong here" (note the cksum error above on 'A'). You can clear this and carry on, but data written to only A (a_only, while B was hidden) is lost at this point. As you've pointed out, there is really no way around this for any configuration of two devices where one disappears, and then the visible state swaps between the two. (Assuming you want to be able to write during the partial-visibility states.)
  • Looking at the uberblock format, and looking at the code, B "won" either because it has a higher txg_id or (if the txg_ids match) because B's latest uberblock is more recent (.ub_timestamp). I tried to do only the same number of commands in each state, so the txg_ids might have lined up. This says that in general the device that has had more (separate) work performed (as measured by transactions) will win on import.
    • If you manage to exactly match transactions, the newer one (timestamp) wins the tie, and failing that, it looks to see if you have multihost protection on, which stores some additional sequencing data.
    • For my own curiosity, I repeated the entire test with more work in the only A state (verified higher txg number on A than B before final import via zdb), and, even though B was modified later, A won on import. (The code doesn't lie.)
  • I was a bit surprised that ZFS didn't say "hey, I can tell these two devices are from the same pool, but their contents have diverged; what would you like to do?" -- because until the import, both a_file and b_file are still recoverable from the available media. OTOH, if this is your boot pool, would you rather it hang, or say "well, this one you've used more, let's go with it" and get back up and running? (Especially for this academically interesting, but likely rare in practice, situation.)
  • I guess my real preference would be option C -- import read-only and continue boot, all while yelling loudly that I need to decide what to do. ralphbsz alluded to this above with "My question is that ZFS might be trying to be too friendly/useful/forgiving/... and allow operation (including writes) on a degraded mirror."
Well that was a fun diversion. Hopefully this scratches the curiosity itch you had.
 
Thank you so much! Very thorough, and it exactly answers my question.

And my summary would be, quite similar to what you wrote: The ZFS developers managed to find a way to build a compromise between correctness and availability. They allow a user to mount a pool with half the disks missing, which gives better availability. They tell the user that things are iffy. When the divergence occurs, they don't rely on magic (which doesn't exist), but report errors. Doing the "absolute best thing" when merging two sets of (potentially conflicting) changes is super hard, so they do something simple and adequate.

What do I learn from this, as a human administrator of a mirrored pool? When I lose one disk in a mirror and have to restart, I need to either make the pool readonly (if I expect the second disk to come back soon), or I need to make sure I never let the second disk be modified and then admitted back in. This is a very good use case for storing out-of-band information, namely putting a yellow sticky note on the console which says "disk B was missing last Tuesday, don't reattach it".

Again, thank you for putting effort into this! I'm planning to upgrade reinstall my only FreeBSD system in the next few weeks, and after that I could perform this experiment (right now, the system is in continuous use with an older ZFS version).
 
So here is what happens, when tested with files that I could move to hide them while the pool is exported...
Wow! Thanks for the thorough experiment!

  • Looking at the uberblock format, and looking at the code, B "won" either because it has a higher txg_id or (if the txg_ids match) because B's latest uberblock is more recent (.ub_timestamp). I tried to do only the same number of commands in each state, so the txg_ids might have lined up. This says that in general the device that has had more (separate) work performed (as measured by transactions) will win on import.
    • If you manage to exactly match transactions, the newer one (timestamp) wins the tie, and failing that, it looks to see if you have multihost protection on, which stores some additional sequencing data.
    • For my own curiosity, I repeated the entire test with more work in the only A state (verified higher txg number on A than B before final import via zdb), and, even though B was modified later, A won on import. (The code doesn't lie.)
  • I was a bit surprised that ZFS didn't say "hey, I can tell these two devices are from the same pool, but their contents have diverged; what would you like to do?" -- because until the import, both a_file and b_file are still recoverable from the available media. OTOH, if this is your boot pool, would you rather it hang, or say "well, this one you've used more, let's go with it" and get back up and running? (Especially for this academically interesting, but likely rare in practice, situation.)
I super don't like this. ZFS can detect that the contents in the two supposedly identical mirrors have diverged. Ideally it should refuse to import A at this point, or to do anything with it other than demand manual recovery. Seems to me it will lose data without yelling sufficiently loudly about this fact.

  • I guess my real preference would be option C -- import read-only and continue boot, all while yelling loudly that I need to decide what to do. ralphbsz alluded to this above with "My question is that ZFS might be trying to be too friendly/useful/forgiving/... and allow operation (including writes) on a degraded mirror."
Agreed!

Well that was a fun diversion. Hopefully this scratches the curiosity itch you had.
Thanks again.
 
I guess this can be tested with mirror and manually disconnecting SATA cable of disk B, (reboot) A and finally both online.
 
Back
Top