ZFS ZFS problem, need help recovering

MortyPied · Jun 26, 2019

Hi All,

Really hope someone can help me here.

FreeBSD 12.0-RELEASE r341666 GENERIC amd64

I've got a 4TB WD sata drive in the machine. At the time the problem occurred I was copying files onto the drive so it was busy. The disk is /dev/ada0. There are other zfs on the server but they are fine so I've not included any of their details below.

I've pasted as much info as I can think of below, please let me know if there is anything else that would help with analysis.

Thank you!

Code:

# zpool status
  pool: data4
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Recovery is possible, but will result in some data loss.
    Returning the pool to its state as of Tue Jun 25 22:34:41 2019
    should correct the problem.  Approximately 5 seconds of data
    must be discarded, irreversibly.  Recovery can be attempted
    by executing 'zpool clear -F data4'.  A scrub of the pool
    is strongly recommended after recovery.
   see: http://illumos.org/msg/ZFS-8000-72
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    data4       FAULTED      0     0     1
      ada0      ONLINE       0     0     6

Code:

# zpool clear -F data4
cannot clear errors for data4: I/O error

Code:

# zdb -l /dev/ada0
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'data4'
    state: 0
    txg: 4
    pool_guid: 8419719978496350932
    hostid: 979813928
    hostname: 'quark'
    top_guid: 1185416255203146580
    guid: 1185416255203146580
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 1185416255203146580
        path: '/dev/ada0'
        whole_disk: 1
        metaslab_array: 67
        metaslab_shift: 35
        ashift: 12
        asize: 4000782221312
        is_log: 0
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

(LABEL 1, 2 and 3 show the same)

Code:

# zdb data4
zdb: can't open 'data4': Input/output error

ZFS_DBGMSG(zdb):
spa_open_common: opening data4
spa_load(data4, config trusted): LOADING
disk vdev '/dev/ada0': best uberblock found for spa data4. txg 35185
spa_load(data4, config untrusted): using uberblock with txg=35185
spa_load(data4, config untrusted): FAILED: unable to open rootbp in dsl_pool_init [error=5]
spa_load(data4, config untrusted): UNLOADING

Code:

# zdb -k data4
zdb: Tried to read config of pool "data4" but spa_get_stats() failed with error 5

ZFS_DBGMSG(zdb):
spa_open_common: opening data4
spa_load(data4, config trusted): LOADING
disk vdev '/dev/ada0': best uberblock found for spa data4. txg 35185
spa_load(data4, config untrusted): using uberblock with txg=35185
spa_load(data4, config untrusted): FAILED: unable to open rootbp in dsl_pool_init [error=5]
spa_load(data4, config untrusted): UNLOADING
spa_load(data4, config untrusted): spa_load_retry: rewind, max txg: 35184
spa_load(data4, config untrusted): LOADING
disk vdev '/dev/ada0': best uberblock found for spa data4. txg 35184
spa_load(data4, config untrusted): using uberblock with txg=35184
spa_load(data4, config trusted): spa_load_verify found 1 metadata errors and 19 data errors
spa_load(data4, config trusted): LOADED
spa_load(data4, config trusted): UNLOADING

ralphbsz · Jun 26, 2019

You say the disk is /dev/ada0, but you also say that there are other ZFS on the machine. Is ZFS really using the whole disk without partitions, and the other ZFS file systems or pools are on other physical disks? Or is the disk partitioned? That will matter for how to diagnose this.

It says you have an IO error. That error should show up in dmesg and/or /var/log/messages. Unless the IO error is simply that the disk vanished or is no longer present (which you can verify).

To start diagnosing the root cause of the IO error, if it is not something obvious (loose connector?), look at the output from smartctl, to see what the disk health is.

MortyPied · Jun 26, 2019

Thanks for the reply ralphbsz.

Yes, ZFS has the whole of ada0. ada1p2 is zroot and there's also disk ada2 with a couple of ufs partitions. ada0 is on a PCI SIL3114 4 port SATA raid controller hence the strange ordering.

ada0 is only mentioned in /var/log/messages or dmesg in relation to boot messages, no other mentions of the disk at all.

Code:

Jun 26 21:01:53 blah kernel: ada0 at ata2 bus 0 scbus0 target 0 lun 0
Jun 26 21:01:53 blah kernel: ada0: <WDC WD40PURZ-85TTDY0 80.00A80> ACS-3 ATA SATA 3.x device
Jun 26 21:01:53 blah kernel: ada0: Serial Number WD-WCC7K2XADXC6
Jun 26 21:01:53 blah kernel: ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
Jun 26 21:01:53 blah kernel: ada0: 3815447MB (7814037168 512 byte sectors)

I've taken the machine down and reseated sata cable at both ends for that disk.

I've run a smartctl short offline self test - "Completed without error". I've got a long test running now.

More smartctl output:

Code:

# smartctl -a /dev/ada0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40PURZ-85TTDY0
Serial Number:    WD-WCC7K2XADXC6
LU WWN Device Id: 5 0014ee 210244c8e
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Wed Jun 26 21:28:20 2019 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249)    Self-test routine in progress...
                    90% of test remaining.
Total time to complete Offline 
data collection:         (43680) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 463) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   169   169   021    Pre-fail  Always       -       6533
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       53
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       8
194 Temperature_Celsius     0x0022   112   111   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        52         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Thanks.

ralphbsz · Jun 27, 2019

Using a whole disk for a file system is bad style, but will function perfectly well, and it is simple. Why is it bad style? First, using a partition allows you to give the partition a symbolic name, so if the discovery order changes (and /dev/ada0 becomes /dev/ada3), you don't get confused. By the way, ZFS doesn't become confused. Also, you can make the partition a little smaller than the whole disk (perhaps 1% smaller). Then if your disk becomes ill, you can buy a replacement disk, and copy the data online, even if the replacement disk is a slightly different model with slightly smaller capacity. But this really is just a style issue, not a functionality issue, and doesn't explain your problem.

The SMART data shows that your disk does not know of any problems. That probably means that any IO errors were not internal errors of the disk, but communication errors. But you already reseated the connector, which is where 90% of communication errors happen.

The thing which is disturbing is this. The various ZFS commands above clearly said IO error. And AFAIK, all IO errors are logged in /var/log/messages (with the default kernel and syslog configuration, and I assume you haven't messed with those). But there were no IO errors in /var/log/messages. These facts contract each other.

How to go forward? You already did the first step, reseating the cable, and demonstrating (with smartctl) that communication with the disk is possible. Here's what I would do next: Read the disk, just to prove that it works. Try this command: dd if=/dev/ada0 of=/dev/null bs=1M (and I can't remember whether the M in megabyte is uppercase or lowercase, please figure it out). This would run for several hours if you let it, so after half hour interrupt it with control C. If this runs for at least that long, it pretty clearly demonstrates that the disk is readable without errors. If it fails with errors, you have a diagnosable (and perhaps fixable) problem.

If it succeeds ... then it gets ugly. ZFS says you have IO errors, while doing IO to the disk clearly says that you don't. In that case, all I can suggest is retrying zdb and "zpool clear" commands, and start a scrub if possible.

MortyPied · Sep 27, 2019

Hi,

Long time with no updates on my part....mostly due to trying to get additional RAM for the machine as the zpool commands sometimes failed with "out of memory". I could not get a second RAM stick which worked with the original RAM, ended up buy all new RAM, but that's another story....

/var/log/messages continues to only report ada0 boot messages, nothing else, eg:

Code:

Sep 11 22:21:54 quark kernel: ada0 at ata2 bus 0 scbus0 target 0 lun 0
Sep 11 22:21:54 quark kernel: ada0: <WDC WD40PURZ-85TTDY0 80.00A80> ACS-3 ATA SATA 3.x device
Sep 11 22:21:54 quark kernel: ada0: Serial Number WD-WCC7K2XADXC6
Sep 11 22:21:54 quark kernel: ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
Sep 11 22:21:54 quark kernel: ada0: 3815447MB (7814037168 512 byte sectors)

The suggested "dd if=/dev/ada0 of=/dev/null bs=1M" works fine, in fact I had already dd'ed the entire disk to an external drive just so I had a backup (and yes, it took many many hours)

Other than that, all the z-commands run in earlier posts still report the same now as they did then. scrub returns:

Code:

# zpool scrub data4
cannot scrub 'data4': pool is currently unavailable

Any further advice gratefully received

PMc · Sep 27, 2019

Interesting story. I widely agree to ralphbsz. I think, the dd command gives the most substantial info beyond what SMART might tell (which is just something the disk figures out internally and which therefore is to be taken with a grain of salt).

So, what we might have here, is an image of a pool which is complete and readable, but which contains inconsistent data in such a way that zfs considers it an I/O error. I vaguely remember having seen this before, i.e. that there can be errors that are not due to flawed read operation, but flawed data coming from a successful read.

We can consider this pool as just an image of data, in fact nothing else than a file (albeit quite a big one). And this file appears to contain inconsistent data in such a way that zfs is not able to continue.
We could now try to either create a copy of the disk and try to read that (to make certain that the problem has nothing to do with the disk) and/or attach the disk to a different computer (to make certain that the problem has nothing to do with the computer).
One might also try to debug the thing with different versions of zfs.

Then, when made certain that the problem is indeed in the data, the next approach would be to surgically repair that data. I never had to do such, so I do not know how difficult that might get. It certainly requires learning a bit about the internal proceedings of zfs. I would start with looking closely info the possible features provided by zdb, and into the code of the functions reported by zdb, to figure out what actually happens there.
If that is too difficult, there are also companies offering data rescue services, which should have experience with such things.

Ordoban · Oct 2, 2019

After I had to recover files from a damaged ZFS pool too, I wrote this tool: https://github.com/Stefan311/ZfsSpy. It's very limited, but had worked for me.
One on GitHub hint me to this interesting post: https://groups.google.com/forum/#!msg/mailing.freebsd.hackers/QzQA32BmaqQ/AoLEZrfkElMJ
It describes how to recover data with build in zfs commands.

You can try to experiment with zpool hidden flags. Look at this command:
zpool import -N -o readonly=on -f -R /pool <pool>
It will try to import pool in readonly mode so no data would be written
on it. It also doesn't mount anything on import so if any fs is damaged
you have less chances triggering a coredump. Also zpool import has a
hidden -T switch that gives you ability to select transaction that you
want to try to restore. You'll need a list of available transaction though:
zdb -ul <vdev>
This one when given a vdev lists all uberblocks with their respective
transaction ids. You can take the highest one (it's not the last one)
and try to mount pool with:
zpool import -N -o readonly=on -f -R /pool -F -T <transaction_id> <pool>
Then check available filesystems.

You should not only try the highest transaction_id. Also try older transactions, like 2nd, 3rd, 4th... highest.

MortyPied · Nov 5, 2019

OK, fixed!

I carefully read the links which Ordoban posted and did this:

Code:

zpool export data4
zpool import -F data4

That's all there was to it. All the files were available. I still have on idea what originally caused the problem.

Thanks to everyone who contributed to this post!