Other Offline uncorrectable sectors

Del.Mar · Nov 30, 2020

Hi there!
I have several errors in console during the day

Bash:

Nov 30 21:15:58 beast smartd[625]: Device: /dev/ada1, 996 Currently unreadable (pending) sectors
Nov 30 21:15:58 beast smartd[625]: Device: /dev/ada1, 737 Offline uncorrectable sectors
Nov 30 21:15:58 beast smartd[625]: Device: /dev/ada3, 1742 Currently unreadable (pending) sectors
Nov 30 21:15:58 beast smartd[625]: Device: /dev/ada3, 139 Offline uncorrectable sectors

I can't find any solution to fix this errors.
My system is FreeBSD 11.4-RELEASE-p2 amd64
smartctl -l selftest /dev/ada1

Bash:

smartctl 7.1 2019-12-30 r5022 [FreeBSD 11.4-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     47724         3511462393
# 2  Extended offline    Completed: read failure       60%     46363         613541240
# 3  Extended offline    Completed: read failure       10%     44203         4010471768
# 4  Short offline       Completed: read failure       10%     43881         1565558144
# 5  Extended offline    Completed without error       00%      1201         -
# 6  Conveyance offline  Completed without error       00%         5         -

smartctl -l selftest /dev/ada3

Bash:

smartctl 7.1 2019-12-30 r5022 [FreeBSD 11.4-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       70%       442         1878828456
# 2  Extended offline    Interrupted (host reset)      40%     64616         -
# 3  Extended offline    Completed without error       00%      1214         -
# 4  Conveyance offline  Completed without error       00%         0         -

My zfs pool works normally

Bash:

  pool: storage
state: ONLINE
  scan: scrub repaired 35.5M in 0 days 10:25:55 with 0 errors on Fri Oct  2 20:47:59 2020
config:

        NAME           STATE     READ WRITE CKSUM
        storage        ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0

errors: No known data errors

SirDice · Nov 30, 2020

Vovas said:
I can't find any solution to fix this errors.

Replace the disk(s), they're only going to multiply. It means the spare bit of disk is now full and it cannot remap bad sectors any more. Because two of the three disks of your RAID-Z are showing these errors you are balancing on the edge of a cliff.

ralphbsz · Nov 30, 2020

The good news: Your zpool can tolerate 1 fault (it is using the RAID-Z encoding). It has not been overwhelmed by disk faults yet.

The bad news: Your disks are having some faults; disks ada1 and ada3 have roughly 1000 sector errors each. This might mean that they are failing.

The messages on the console are not so much errors, but information: There is a daemon called "smartd" running on your machine, which regularly contacts the disks, and asks them for their internal health monitoring data, which is called SMART data. The disks are reporting that they have a certain number of sectors that are no longer readable (let's not waste time on the difference between pending and offline).

The interesting news: in spite of the fact that your disks are having unreadable sectors, they pass self-test.

Here is where I would start: What is the current error count? You get this by running smartctl -a /dev/adaX, and then search in there for lines 196, 197 and 198. The easiest way to read that: smartctl -a /dev/adaX | egrep "^19[678]".

Once we have that, we have to figure out whether the disks are really failing. When did this start? How far back do you have data? Has the count been slowly increasing over the last few months or weeks? How old are the disks? Are they the same model and age? I find it terribly suspicious that two disks are suddenly failing at about the same time. That would indicate either a common cause (like large mechanical vibration, or extremely bad air quality, or temperature effects), or disks that are exactly the same age and from the same manufacturing batch (and even then, it is very unlikely), or something is wrong.

In the meantime: Begin obtaining replacement disks. You will 99% likely have to replace these. And start a zpool scrub, to see whether you really don't have any errors. And make sure your backups are up-to-date and in a known location.

Del.Mar · Dec 1, 2020

ralphbsz said:
Here is where I would start: What is the current error count? You get this by running smartctl -a /dev/adaX, and then search in there for lines 196, 197 and 198. The easiest way to read that: smartctl -a /dev/adaX | egrep "^19[678]".

Thanks for reply! Here is output:

Code:

root@beast:/home/vovas # smartctl -a /dev/ada1 | egrep "^19[678]"
196 Reallocated_Event_Count 0x0032   083   083   000    Old_age   Always       -       117
197 Current_Pending_Sector  0x0032   198   198   000    Old_age   Always       -       996
198 Offline_Uncorrectable   0x0030   199   199   000    Old_age   Offline      -       737
root@beast:/home/vovas # smartctl -a /dev/ada3 | egrep "^19[678]"
196 Reallocated_Event_Count 0x0032   090   090   000    Old_age   Always       -       110
197 Current_Pending_Sector  0x0032   197   197   000    Old_age   Always       -       1742
198 Offline_Uncorrectable   0x0030   200   199   000    Old_age   Offline      -       139

ralphbsz said:
When did this start?

Around month. Maybe more. I've found this error after transmission daemon closed after core dump.

ralphbsz said:
How far back do you have data?

It's not important. Just movies and tv shows, nothing more. 7 years.

ralphbsz said:
Has the count been slowly increasing over the last few months or weeks?

Maybe, I can't answer correctly.

ralphbsz said:
How old are the disks? Are they the same model and age?

More than 7 years. All of them same model:

Code:

ada1: <WDC WD30EZRX-00MMMB0 80.00A80> ATA8-ACS SATA 3.x device
ada1: Serial Number WD-WCAWZ2542724
ada2: <WDC WD30EZRX-00MMMB0 80.00A80> ATA8-ACS SATA 3.x device
ada2: Serial Number WD-WCAWZ2490581
ada3: <WDC WD30EZRX-00DC0B0 80.00A80> ACS-2 ATA SATA 3.x device
ada3: Serial Number WD-WMC1T0439039

ralphbsz said:
Begin obtaining replacement disks.

Already started.

ralphbsz said:
And make sure your backups are up-to-date and in a known location.

All important data saved to another storage. Thanks!

ralphbsz · Dec 1, 2020

In theory, counter 196 (reallocated event count) should have started going up earlier, before 197 and 198 started showing up. It means that a disk sector went bad, but the drive was able to store the data in a reallocation area (for example because the next operation was a write). When count 196 starts increasing is typically a good time to start replacing the disk.

What is interesting is: Two disks out of 3 failing so close to each other in time, after 7 years of use. Even with correlation from the same batch this is surprising.

SirDice · Dec 1, 2020

ralphbsz said:
What is interesting is: Two disks out of 3 failing so close to each other in time, after 7 years of use.

Timing doesn't seem to match up. Assuming both disks are the same age and have been in the same system the timing of the short and long tests don't line up. At some point test were run, perhaps with a previous installation of smartd. On ada3 the first failed test happened at 442 hours, but time seems to have gone backwards. The test before that happened at 64616 hours. I don't know what the size of the data is but it looks like it wraps around at 16 bit (65535). Without looking at 240 (Head_Flying_Hours) that would make ada3 much older than ada1.

As the rest of the sort and long tests don't line up either there's certainly a difference in age. As there's also a large gap between the first and more recent tests it stands to reason this was not tested for a long period. It's therefor possible ada3 has been having uncorrectable errors for a while before they started showing up on ada1.

Del.Mar · Dec 1, 2020

I've made zpool scrub and here results:

Bash:

  pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 6.66M in 0 days 07:04:07 with 0 errors on Tue Dec  1 23:13:40 2020
config:

        NAME           STATE     READ WRITE CKSUM
        storage        ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     1
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0    42

errors: No known data errors

SirDice · Dec 1, 2020

Looks like you've been lucky so far and nothing has been permanantly damaged yet. I'd replace ada3 first, that appears to be the oldest drive.

Del.Mar · Dec 3, 2020

SirDice said:
Looks like you've been lucky so far and nothing has been permanantly damaged yet.

Yea

Thanks for info and help, guys!

One more question: how to change hdd in my storage correct? Should I use replace command?

SirDice · Dec 3, 2020

Vovas said:
Should I use replace command?

Yes.

Other Offline uncorrectable sectors

Del.Mar

SirDice

Administrator

ralphbsz

Del.Mar

ralphbsz

SirDice

Administrator

Del.Mar

SirDice

Administrator

Del.Mar

SirDice

Administrator