ZFS Weird situation with zpool: long time desync of mirror?...

tarkhil · Jan 10, 2021

Hello

My case may look weird for you; indeed, it looks weird for me as well.

I have a server, with Zpool and RAID1 on it. It worked for years and yesterday smartmontools showed me that one of the disks is failing.

I've ordered replacement, and started rebuild and everything went fine... until I've seen that the newest file on fs is back from 2019.

Pool looks pretty healthy, data resilvered just fine, and I have the whole list of requested snapshots - from 2018-2019 years.

I've reattached damaged disk, and now struggling with two zpools with one guid (looks like data on the failing disk was not overwriten and that's my only hope for now).

Of course I've closed that console and cannot be 100% sure that yesterday zpool status showed Ok; but I'm pretty sure that I could not overlook error message. And there were two disks in pool.

The only idea besides temporary portal is that my RAID1 got desynced back in 2019 and everything run until I've replaced the only working disk; but I have no idea how to check for that error. Maybe something could be done with zdb to make me sure?

I repeat, situation looks weird, but it is.

ralphbsz · Jan 10, 2021

tarkhil said:
until I've seen that the newest file on fs is back from 2019.

You probably mean that you checked the ctime or mtime of files, and there are none that were created or modified after 2019. And you are sure that this file system should have files that are newer. Did I understand you right?

I've reattached damaged disk, and now struggling with two zpools with one guid

That is weird ... I don't know how you have two different zpools with the same guid. I would have thought that the guid defines the identity of the zpool. But it seems I'm wrong on that.

The only idea besides temporary portal is that my RAID1 got desynced back in 2019 and everything run until I've replaced the only working disk; but I have no idea how to check for that error.

That is a plausible explanation: In 2019, you disconnected (logically, not necessarily physically) disk B from the zpool, and then for ~2 years didn't notice that you were not running with a RAID-1 zpool, but only with a single disk. In the meantime, disk B contains a frozen-in-time copy of what the world was like in 2019. Now in the meantime, your disk A started failing, and B has become reattached in some manner, and suddenly you are looking at the frozen copy.

Maybe something could be done with zdb to make me sure?

Yes, that's exactly the correct tool. Except that I use it so rarely, I don't have pre-cooked instruction ready for you (to serve and eat, as you can see, I'm hungry). So start reading the man page for zdb yourself. In the meantime, I propose a very simple experiment: Disconnect both disks. Connect one of them, boot, see what you get. Disconnect again, connect the other one, see what you get. Each time, write down exactly the ID numbers of the disks and pools you see. Maybe this simple experiment will help clarify things.

tarkhil · Jan 10, 2021

All 24 monthly snapshots, exactly as configured in /etc/periodic.conf, existed on the zpool. But they were of 2018-2019 years! So, no accidental rollback was possible.

I can't remember exactly what zpool status showed me before the disaster, but I should see that mirror is in the degraded state. I really should. In past 25 years I don't recall anything close to overlooking error in zpool status.

And broken disk seems to be really heavily broken, leaving too little chance for the system to run on it.

The thought that I've missed broken zpool configuration doesn't feels too good; but that looks to be only not-totally-weird possibility.

Right now I had to demolish the pool of 2019 in the last attempt to recover anything. Looks like I'll spend some time checking txg number on my raids weekly for some time.

tarkhil · Jan 11, 2021

After several reboots (!) zpool on old drive suddenly changed from FAULTED to DEGRADED state, imported and is now resilvering.

Next time, I'll screenshot every damned thing. Every. Every command. Every output.

tarkhil · Jan 12, 2021

Code:

  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 622G in 0 days 10:49:15 with 1563 errors on Tue Jan 12 07:44:16 2021
config:

        NAME              STATE     READ WRITE CKSUM
        data              DEGRADED 1.53K     0     0
          mirror-0        DEGRADED 1.53K     0 4.54K
            replacing-0   UNAVAIL      0     0     0
              679321529   FAULTED      0     0     0  was /dev/gpt/data0
              gpt/data1   ONLINE       0     0     0
            gpt/olddata1  ONLINE   1.53K     0 4.54K

What's with it NOW? Is it degraded, unavail, failed or successfully resilvered?...

chrbr · Jan 12, 2021

tarkhil said:
Next time, I'll screenshot every damned thing. Every. Every command. Every output.

You can use script(8) for that task. Just start script some_log_file.txt. Then do something. Finally exit terminates the logging.

tarkhil · Jan 12, 2021

chrbr said:
You can use script(8) for that task. Just start script some_log_file.txt. Then do something. Finally exit terminates the logging.

script is of very little help when working with ephemeral tools like mfsBSD. On reboot, everything will disappear. I'm content with copypaste for now. But things are really strange with it

ZFS Weird situation with zpool: long time desync of mirror?...

tarkhil

ralphbsz

tarkhil

tarkhil

tarkhil

chrbr

tarkhil