ZFS data errors on seemingly working disk

For the past months I have collected data on my external ZFS disk connected via USB3. Pool was working as expected, zfs status wasn't showing any errors. Not sure if I run zpool scrub on it but I think it was run and it was OK.
But few nights ago I started copying (rsync-ing) data to another disk. Everything was fine until it started to show errors:

I thought that I accidently moved USB cable and disconnect disk while reading it but moving it to internal HDD slot didn't help.

After running zpool clear and scrubing it (again):
Code:
  pool: bckp-ext
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:49:08 with 708391 errors on Thu Sep 12 18:45:40 2019
config:

    NAME        STATE     READ WRITE CKSUM
    bckp-ext    ONLINE       0     0  699K
      ada1p1    ONLINE       0     0 1.38M

errors: 708357 data errors, use '-v' for a list

rsync errors:
Code:
slike svedska - sortirano/Linköping Gamla/17030104.jpg
 read errors mapping "/mnt/bckp-ext/slike/slike svedska - sortirano/Linköping Gamla/17030104.jpg": Input/output error (5)
slike svedska - sortirano/Linköping Gamla/17030105.jpg
 read errors mapping "/mnt/bckp-ext/slike/slike svedska - sortirano/Linköping Gamla/17030105.jpg": Input/output error (5)
slike svedska - sortirano/Linköping Gamla/17030106.jpg
...
WARNING: slike svedska - sortirano/Linköping Gamla/17030104.jpg failed verification -- update discarded (will try again).
WARNING: slike svedska - sortirano/Linköping Gamla/17030105.jpg failed verification -- update discarded (will try again).
WARNING: slike svedska - sortirano/Linköping Gamla/17030106.jpg failed verification -- update discarded (will try again).

Code:
zpool status -xv | grep 1703010
...
        /mnt/bckp-ext/slike/slike svedska - sortirano/Linköping Gamla/17030104.jpg
        /mnt/bckp-ext/slike/slike svedska - sortirano/Linköping Gamla/17030105.jpg
        /mnt/bckp-ext/slike/slike svedska - sortirano/Linköping Gamla/17030106.jpg
...

Code:
# zdb -l /dev/ada1p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'bckp-ext'
    state: 0
    txg: 15108279
    pool_guid: 7930193435851463028
    hostid: 1061846452
    hostname: 'ProjectBSD'
    top_guid: 3352315782749581932
    guid: 3352315782749581932
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 3352315782749581932
        path: '/dev/ada1p1'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 31
        ashift: 12
        asize: 246955900928
        is_log: 0
        DTL: 102
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'bckp-ext'
    state: 0
    txg: 15108279
    pool_guid: 7930193435851463028
    hostid: 1061846452
    hostname: 'ProjectBSD'
    top_guid: 3352315782749581932
    guid: 3352315782749581932
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 3352315782749581932
        path: '/dev/ada1p1'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 31
        ashift: 12
        asize: 246955900928
        is_log: 0
        DTL: 102
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 2
------------------------------------
    version: 5000
    name: 'bckp-ext'
    state: 0
    txg: 15108279
    pool_guid: 7930193435851463028
    hostid: 1061846452
    hostname: 'ProjectBSD'
    top_guid: 3352315782749581932
    guid: 3352315782749581932
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 3352315782749581932
        path: '/dev/ada1p1'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 31
        ashift: 12
        asize: 246955900928
        is_log: 0
        DTL: 102
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 3
------------------------------------
    version: 5000
    name: 'bckp-ext'
    state: 0
    txg: 15108279
    pool_guid: 7930193435851463028
    hostid: 1061846452
    hostname: 'ProjectBSD'
    top_guid: 3352315782749581932
    guid: 3352315782749581932
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 3352315782749581932
        path: '/dev/ada1p1'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 31
        ashift: 12
        asize: 246955900928
        is_log: 0
        DTL: 102
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

smartctl -a /dev/ada0
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Scorpio Blue Serial ATA (AF)
Device Model:     WDC WD2500BPVT-22JJ5T0
Serial Number:    WD-WX31A13F3617
LU WWN Device Id: 5 0014ee 6587db803
Firmware Version: 01.01A01
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Sep 12 18:36:11 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         ( 7260) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  75) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x7035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   144   141   021    Pre-fail  Always       -       1800
  4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15175
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1759
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       210
191 G-Sense_Error_Rate      0x0032   016   016   000    Old_age   Always       -       84
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       66
193 Load_Cycle_Count        0x0032   131   131   000    Old_age   Always       -       208731
194 Temperature_Celsius     0x0022   105   100   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Smartctl output wasn't changed much over days:
Code:
16c16
< Local Time is:    Wed Sep 11 07:47:27 2019 CEST
---
> Local Time is:    Thu Sep 12 17:57:27 2019 CEST
61c61
<   4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15139
---
>   4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15175
64c64
<   9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1725
---
>   9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1759
70,71c70,71
< 193 Load_Cycle_Count        0x0032   131   131   000    Old_age   Always       -       208689
< 194 Temperature_Celsius     0x0022   116   100   000    Old_age   Always       -       27
---
> 193 Load_Cycle_Count        0x0032   131   131   000    Old_age   Always       -       208731
> 194 Temperature_Celsius     0x0022   108   100   000    Old_age   Always       -       35

There are some valuable things on that disk (vacation photos from past years and stuff like that) which were copied into that temporary disk (not the best backup strategy, I know) and any help will be greatly appreciated :'‑(
 
Your ZFS disk is ada1, but you show the smart output from ada0. Pls add the output of the correct disk.

Before messing around too much with the disk in question, make an image with a tool like dd_rescue.
 
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
The action shown in zpool status is what one should do. Sadly, without backup, none of those two options is for you.

You may have applied force/vibration or whatever to your disk, what caused the data corruption.
See SMART value 191.
Errors due to removing USB cable should be clearable by zpool clear as long as no data got corrupted.

Note, that in a single disk pool, ZFS is as prone to data corruption as other filesystems are.
The only real difference is, that it knows about the corruption.

Without backup, those files are lost.
 
Have you tried to run a short or long test with smartctl(8)? Start with the short test first, if that doesn't give any errors try the long test.
 
If you need to backup to an external disk, which is what it sounds like you're doing, and you only have one external, consider using copies=2. It won't help you if the drive entirely dies, but I'd say about 2/3rds of the HDD failures I see are partial failures; i.e. one or more sectors fail to read, but the disk is still otherwise usable.

There's no substitute for multiple backups though. Perhaps look at cloud backup options if your internet speeds are fast enough and the multi-disk is too much hassle?
 
Without backup, those files are lost.
That is the scary truth and expected chain of events :eek:
It was meant as temporary solution until I got this proper backup disk but it seems that temporary solution started dying before I got to copied it. It happens :oops:

Have you tried to run a short or long test with smartctl(8)? Start with the short test first, if that doesn't give any errors try the long test.

10 hours after starting long test:
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%      1761         473690216
# 2  Short offline       Completed without error       00%      1760         -
Yup, it seems that disk is dying.
How to prevent this next time? Run smartctl every night/weekend at 4AM? Use another disk for extra copies?

If you need to backup to an external disk, which is what it sounds like you're doing, and you only have one external, consider using copies=2. It won't help you if the drive entirely dies, but I'd say about 2/3rds of the HDD failures I see are partial failures; i.e. one or more sectors fail to read, but the disk is still otherwise usable.
This is the same tactic I used on disk for which I had known it is bad. It worked that time :cool:

There's no substitute for multiple backups though. Perhaps look at cloud backup options if your internet speeds are fast enough and the multi-disk is too much hassle?
I am in process of making home "server" to which I will connect all the drives.
After this situation maybe I should buy another disk for backup (1 TB driver, 930 GB ZFS partition) and put it in RAIDZ1?
Server will be laptop with USB connected drives. Data are mostly vacation photos, music, OS (desktop and cellphone) backups, Android images and stuff to which I do not need to have some speedy access.
Putting at least photos online would prevent (at least part of) this situation

Thanks for the support guys!
Now I'll try to hunt my files across other disks, SD cards and cellphones and try to get at least some files ?
 
It was meant as temporary solution until I got this proper backup disk ...
Old saying: There are two kinds of people. Those that religiously do backups, and those that haven't lost data yet.
Congratulations: you just graduated from the second group to the first one. I'm sorry about that.

How to prevent this next time? Run smartctl every night/weekend at 4AM? Use another disk for extra copies?
Yes, yes, and yes to the third question you didn't ask.
  1. Use RAID: multiple disks, with redundant storage of your data. The minimal version of RAID is using two disks, also known as mirrored (with just two disks you don't use RAIDZ1, you just mirror it). It has reasonable (but not great) reliability, it is relatively cheap (only need a second disk), but has pretty awful efficiency (factor of 2). If you want great reliability, you need to move to a version of RAID that can handle dual faults, because with modern disks being so large, the probability of getting a second read error while rebuilding from the complete failure of the first disk is not small. The problem is that the smallest version (3 disks and 3-way redundancy) has really awful space efficiency. Multi-fault reliability makes more sense if your data is bigger, and you can use more disks (like 2 or 3 extra disks in a group to 5 or 10 disks). That's pretty impractical for home use. But any form of redundancy is the single most important defense against the fact that disks are inherently unreliable.
  2. Yes, running smartctl regularly (every hour, every day) is good. At the minimum, run the smartd daemon, and configure it so you get warnings if the disk reports that it is failing, or if certain error counters start increasing. Running a full smartctl test once a day or week is not generally accepted as a good solution.
  3. The thing you didn't mention: Use ZFS scrub. It has the great advantage that the whole disk is exercised, and all the data is read. With ZFS having checksums on all the data, this both forces the disk to determine that the tracks are still readable (physically in hardware), and it checks that the data is actually good. Scrubbing really helps with data reliability.
Finally, starting some good backup system should be obvious. Consider what will happen if you have a small site disaster (like a house fire); if that worries you, your backup should be off-site. You might want to think about using a cloud storage provider for holding your backups; the big cloud companies and the specialized cloud storage companies all have excellent data reliability. Matter-of-fact, here is a strategy you might be able to use: If you know very quickly when new data is added to your file system, and you have enough bandwidth to back it up to a remote cloud provider very quickly, and you can adjust your workflows so you don't have to rely on the data being actually well protected against disk failure for an hour or two, then you might not need RAID at all, and be able to survive with a single disk, using the cloud as your redundancy and backup.
 
Back
Top