Other Smartctl usage

balanga · Oct 5, 2019

What do I need to look for when running smartctl to identify how healthy a disk is, ie whether it is likely to fail in the near future. Are there any key fields to look out for?

Here is what I get from one of my disks when running smartctl -a /dev/ada01

Code:

smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Travelstar 5K250
Device Model:     Hitachi HTS542512K9SA00
Serial Number:    071102BB0200WBGW1RAC
LU WWN Device Id: 5 000cca 530cc4c7a
Firmware Version: BB2OC32P
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 3f
SATA Version is:  SATA 2.5, 1.5 Gb/s
Local Time is:    Sat Oct  5 22:20:52 2019 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  645) seconds.
Offline data collection
capabilities:              (0x51) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  56) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   095   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0023   253   100   033    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       4344
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       131 (0 19)
  7 Seek_Error_Rate         0x000f   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       11319
10 Spin_Retry_Count        0x0033   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       4079
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   086   086   000    Old_age   Always       -       75956238
188 Command_Timeout         0x0032   100   073   000    Old_age   Always       -       30067261474
190 Airflow_Temperature_Cel 0x0022   065   038   000    Old_age   Always       -       35 (Min/Max 30/41)
191 G-Sense_Error_Rate      0x000a   100   092   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   098   098   000    Old_age   Always       -       29688261
193 Load_Cycle_Count        0x0032   052   052   000    Old_age   Always       -       481304
194 Temperature_Celsius     0x0022   157   088   000    Old_age   Always       -       35 (Min/Max 7/62)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       18
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x002a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x002a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 5003 hours (208 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 8e 51 d6 e4  Error: UNC 1 sectors at LBA = 0x04d6518e = 81154446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 87 51 d6 e4 00      00:48:52.900  READ DMA
  ca 00 08 67 26 11 e0 00      00:48:52.900  WRITE DMA
  c8 00 08 7f 51 d6 e4 00      00:48:52.900  READ DMA
  c8 00 08 0f 73 0d e3 00      00:48:52.900  READ DMA
  c8 00 20 df 3d 1e e2 00      00:48:52.900  READ DMA

Error 7 occurred at disk power-on lifetime: 5003 hours (208 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 11 8e 51 d6 e4  Error: UNC 17 sectors at LBA = 0x04d6518e = 81154446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 7f 51 d6 e4 00      00:48:48.800  READ DMA
  c8 00 08 2f 73 0d e3 00      00:48:48.800  READ DMA
  c8 00 08 1f 73 0d e3 00      00:48:48.800  READ DMA
  c8 00 08 f7 f3 5b e0 00      00:48:48.800  READ DMA
  c8 00 08 4f 00 00 e0 00      00:48:48.800  READ DMA

Error 6 occurred at disk power-on lifetime: 5003 hours (208 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 8e 51 d6 e4  Error: UNC 1 sectors at LBA = 0x04d6518e = 81154446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 87 51 d6 e4 00      00:47:30.800  READ DMA
  c8 00 08 7f 51 d6 e4 00      00:47:30.800  READ DMA
  c8 00 20 7f 51 d6 e4 00      00:47:26.900  READ DMA
  ca 00 20 7f 0f 00 e0 00      00:47:26.900  WRITE DMA
  ca 00 20 5f 4d 4c e4 00      00:47:26.900  WRITE DMA

Error 5 occurred at disk power-on lifetime: 5003 hours (208 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 11 8e 51 d6 e4  Error: UNC 17 sectors at LBA = 0x04d6518e = 81154446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 7f 51 d6 e4 00      00:47:26.900  READ DMA
  ca 00 20 7f 0f 00 e0 00      00:47:26.900  WRITE DMA
  ca 00 20 5f 4d 4c e4 00      00:47:26.900  WRITE DMA
  c8 00 20 7f 0f 00 e0 00      00:47:26.900  READ DMA
  ca 00 20 1f 55 4c e4 00      00:47:26.900  WRITE DMA

Error 4 occurred at disk power-on lifetime: 5003 hours (208 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 8e 51 d6 e4  Error: UNC 1 sectors at LBA = 0x04d6518e = 81154446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 87 51 d6 e4 00      00:43:49.500  READ DMA
  ca 00 20 bf 72 0d e3 00      00:43:49.500  WRITE DMA
  c8 00 80 b7 2c 75 e1 00      00:43:49.500  READ DMA
  c8 00 38 51 49 0b e0 00      00:43:49.500  READ DMA
  c8 00 08 7f 51 d6 e4 00      00:43:49.500  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11310         -
# 2  Extended offline    Completed without error       00%     11283         -
# 3  Short offline       Completed without error       00%      7461         -
# 4  Short offline       Completed without error       00%      7460         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

CyberCr33p · Oct 5, 2019

The disk is dying. Better backup your data and replace it.

obsigna · Oct 5, 2019

CyberCr33p said:
The disk is dying. Better backup your data and replace it.

Have a closer look. The error messages are from a single incident at a power on life time of 5003 hours - perhaps due to a power disruption. The extended offline test completed without error 00% at a power on life time of 11310 hours. The Load_Cycle_Count is quite high already, though, and I suggest to disable APM of the disk.

CyberCr33p · Oct 6, 2019

Check Reallocated_Sector_Ct

obsigna · Oct 6, 2019

OK, it is not zero, but far away from the threshold. In case the number keeps growing, then yes, replace the disk. In case not, this might have been caused by a single incident, and must not necessarily be an indication of a forthcoming failure.

PMc · Oct 6, 2019

CyberCr33p said:
Check Reallocated_Sector_Ct

No issue unless growing. It is always possible for a sector to go bad, which is why one needs a backup anyway (a drive can also die at any time no matter what smartctl tells).

What I don't like is line 192 power_off_retract. On all my hitachi, power_off_retract is equal to load_cycle_count, so something seems wrong here. it does not necessarily kill the drive, but these should be monitored and the cause fixed - might be related to bad power quality.
Line 187 reported_uncorrect is also something to observe.

ralphbsz · Oct 6, 2019

Careful: With SATA SMART, there is very little standardization of what the various counters mean. The Travelstar is an old series of 2.5" laptop hard drives (started out life under IBM, then was Hitachi-branded for many years, and now it is a WD product). It is quite possible that power_off_retract on this drive has something to do with various sleep states; if you retract the head completely from the platters, it reduces the power consumption of the drive (much less air drag), but the drive can quickly be re-enabled, without spinning the platters back up (which causes wear and tear and is power eating).

Honestly, I don't know whether this drive is good or bad. Counters 5 and 196 (the reallocated stuff) means that it had some media damage, which had to be reallocated. That's not really supposed to happen, but does. If this was one catastrophic event, perhaps caused by mechanical shock, or foreign object entered the drive and settled on the platter, and doesn't repeat, then it doesn't mean that the drive will have more errors in the future. On the other hand, if it slowly grows, that's bad. The good news is that counter 197 (pending errors) is zero, so all problems have been corrected.

Also: using SMART for failure prediction is roughly 50% accurate: 50% of the drives that predict failure actually continue working; and 50% of all drives that fail show no failure prediction beforehand. So whether to replace this drive or not depends a lot on what the RAID and backup situation is.

SirDice · Oct 7, 2019

CyberCr33p said:
Check Reallocated_Sector_Ct

Current_pending_sectors and offline_uncorrectable are still 0. The reallocated_sector_ct just indicates some sectors have been mapped to a "spare" bit of the drive, which is fine, it's supposed to do that.

rowan194 · Oct 17, 2019

Trying to predict the future based on SMART data is a bit like trying to predict the expected lifetime of a car that sometimes makes a funny noise when you start it: nothing serious could happen for 10 years, or the engine could seize tomorrow.

In general it's more important to look at the trend over time, rather than absolute figures.

I do recall that Google released a paper which found the risk of drive failure was higher once any bad sectors were reported, but that risk is still a measure of probability, rather than "this drive will die in X days", and they have a huge number of drives.

So those bad sectors are of some concern, and Reported_Uncorrect is unusually high.

Temperature is also something that sticks out: minimum of 7, maximum of 67? The former seems rather cold, and the latter is way too hot for a HDD. (It also indicates the possibility of the raw values being bogus.) Back when I used Seagates, their software considered anything higher than 55C potentially voiding warranty.

Drives can fail at any time. Back up your data.

Other Smartctl usage

balanga

CyberCr33p

obsigna

Profile disabled

CyberCr33p

obsigna

Profile disabled

PMc

ralphbsz

SirDice

Administrator

rowan194