Other Interpreting smart data for hard disk

Can someone interprete this. Is it dangerous ? Because my zpool contains this disk and i don't want to loose it because of not enough "copies".
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   174   170   021    Pre-fail  Always       -       2258
  4 Start_Stop_Count        0x0032   091   091   000    Old_age   Always       -       9648
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   100   253   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   009   009   000    Old_age   Always       -       66541
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   091   091   000    Old_age   Always       -       9505
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       8590065670
190 Airflow_Temperature_Cel 0x0022   062   057   040    Old_age   Always       -       38 (Min/Max 24/38)
192 Power-Off_Retract_Count 0x0032   192   192   000    Old_age   Always       -       6742
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2905
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
do web search lookup on smart attribute 188 and see what the general concensus is. Other than a massive number of command timeout I see nothing else that raises an eyebrow.

Is the attrib 188 value changing over time or does it represent a burst condition that occured all at once?
 
The important part is that 196/7/8 are all zero: no uncorrectable read errors.

The temperature of 38 degrees is excellent. Probably the sweet spot for disks.

The only concerning thing is the age of the disk: 7.6 years, with about 10K power cycles and start/stops. Age alone doesn't kill disks, and they can go 10 to 20 years, but with every year the probability of mechanical failure (often related to lubrication) gets higher.
 
I was able to remove this vdev on my old disk from the zpool.
Note i can't remember a dying spinning disk.
My last dying disk was an SSD drive. It was Western Digital and died just after six months.
All data was lost.
 
The command_timeout value is the total number of interface commands (IOs) that were aborted due to a timeout. This is pretty much industry standard. The mechanism for timeouts is very different between SATA and SAS though. Getting some timeouts is normal, 8 billion (!) of them is surprising. But even if they continue growing, that does not indicate that the disk itself is at risk of data loss, only that the interface (for example wiring) is flaky.
 
The only concerning thing is the age of the disk: 7.6 years, with about 10K power cycles and start/stops. Age alone doesn't kill disks, and they can go 10 to 20 years, but with every year the probability of mechanical failure (often related to lubrication) gets higher.
Beware of bitrot though. A 10 year old drive may present OK mechanically and you should be able to write data, but existing data can get corrupted as the magnetic fields gradually degrade. I've had that happen in a UFS production system where a base script got changed and cause booting to fail.
 
Generally speaking we need to know the make and model of the drive to interpret all the SMART data.

The "188 Command_Timeout" value of 8590065670 can probably only be made sense of with that knowledge.

The TrueNAS Hard Drive Troubleshooting Guide is worth reading and bookmarking.

As indicated, that is almost certainly a multi-number bitmap.

Generally speaking the only way to really make sense of such SMART attributes is to boot Windows and use the manufacturer's tool (same tool as used for firmware updates). This is way worse with SSDs.
 
Beware of bitrot though. A 10 year old drive may present OK mechanically and you should be able to write data, but existing data can get corrupted as the magnetic fields gradually degrade.
Most of the time, the internal error correction and detection will catch that. These undetected read errors should be at a level of 10^-17 per bit read. A more likely cause for data corruption is (a) software bug, and (b) memory error. The best way to help is to use a file system that CRC-checks as much as possible; for example ZFS.

I don't know if this is correct.

8590065670 is 0x200020006d.
109 timeouts, 2 five-second timeouts, and 2 seven-secouond timeouts.
That seems very plausible! Thank you for looking this up.

Generally speaking the only way to really make sense of such SMART attributes is to boot Windows and use the manufacturer's tool (same tool as used for firmware updates).
If you have access to the detailed technical manuals for the drives, they will explain how to decode the SMART data. Those manuals are typically not published to consumers, and only available under NDA. Or worse; I remember dealing with this about a decade ago, and getting this information out of <X> and <Y> (big manufacturers) required setting up direct engineer <-> engineer contact. Only to find that the two manufacturers handled it totally differently, and it changes from disk model to disk model, causing our code to be chock full of if statements.
 
Most of the time, the internal error correction and detection will catch that. These undetected read errors should be at a level of 10^-17 per bit read. A more likely cause for data corruption is (a) software bug, and (b) memory error. The best way to help is to use a file system that CRC-checks as much as possible; for example ZFS.
Which is why I never use UFS anymore (or SATA for that matter). I had a read-only mounted root file system where one of the /etc/rc.d scripts suddenly had a cd change into a ce command (which doesn't exist).
No disk errors, nothing in SMART just invisible corruption and a broken boot one day.
My conclusion was: don't trust SMART and cycle enterprise disks at about the 7 year mark.
 
I don't know if this is correct.

8590065670 is 0x200020006d.
109 timeouts, 2 five-second timeouts, and 2 seven-secouond timeouts.
Assuming the respondent parsed the data correctly, I think that number is more than typical for a disk that has been hotswapped in a sata/nas channel. Other than OPs confidence in "old drives", in the words of officer Barbrady "nothing to see here people. move along"
 
Back
Top