UFS SMART false positive?

Today I got an e-mail from smartd:

Code:
This message was generated by the smartd daemon running on:

  host name:  server
  DNS domain: example.com

The following warning/error was logged by the smartd daemon:

Device: /dev/ada1, 16 Offline uncorrectable sectors

Device info:
ST4000NM0245-1Z2107, S/N:ZC1129RT, WWN:5-000c50-0a1dfca1c, FW:SS03, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

When I woke up I run smartctl -a /dev/sda1 and got:

Code:
...
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       16
...

When I was ready to replace the disk I run the same command and got:

Code:
...
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
...

Also in the logs I don't see anything else related to system not possible to write / read from the disk which I see in other cases when smartd reports issue with "uncorrectable sectors". The logs I see are:

Code:
Mar 11 04:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 04:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 04:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 04:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 05:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 05:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 05:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 05:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 06:11:12 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 06:11:12 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 06:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 06:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 07:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 07:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 07:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 07:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 08:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 08:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 08:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 08:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 09:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 09:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 09:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 09:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 10:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 10:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 10:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 10:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 11:11:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 11:11:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors
Mar 11 11:41:11 server smartd[96106]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Mar 11 11:41:11 server smartd[96106]: Device: /dev/ada1, 16 Offline uncorrectable sectors

Do you think the disk needs replacement or bad sectors rewritten successfully and "just disappear"?
 
I suspect that those sectors was bad and was reallocated from the spare zone. How many reallocated sectors this hard disk have ? Anyway if the data on this disk matter then make a backup and replace the disk.
 
It's strange because it says:

Code:
 5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

Now I run a smartctl "long" test and so far no issues:

Code:
# 1  Extended offline    Self-test routine in progress 10%     25613         -

The last 10% I think it will take more time than the first 90%.
 
In another forum I found:

A pending sector can also disappear if the next attempt to write to it succeeds. It only gets reallocated if that also fails.

Maybe this is what happened.
 
Few months ago I had same warnings on my two disks, but they did not disappear. Instead the value was slowly increasing.
So, I've moved all important data to other disk drives, just in case :) and after a few months after initial warning, both disk died at the same time. I could not read or write any data anymore...
 
Both parameters can be false positive, in case of power fluctuation, for example. To ensure this, you can dd /dev/zero to entire disk. The values will return to 0 or move to reallocated sectors (=disk will be forced to write to pending sectors, the second check will succeed or sectors will be marked as bad).
 
In another forum I found:
A pending sector can also disappear if the next attempt to write to it succeeds. It only gets reallocated if that also fails.
Maybe this is what happened.
That's the most likely explanation. What happens if a disk drive can't read a sector? It counts how much such sectors it has, and if someone tries to read it, it return a read error. But if someone overwrites that sector (or more likely the whole track that the sector is on), than the sector is good again, assuming that the disk surface is not physically damaged, or that the disk still has enough free space to revector it on.

Read the error message carefully. It just says that there are sectors that are unreadable and uncorractable: the number of read errors had overwhelmed the redundancy that's built into the on-disk data. Overwriting that with good data (with error correction redundancy) will fix that.

Still, an increase in the number of unreadable sectors is a a bad sign, and correlated with disk failure. But correlation doesn't mean a guarantee; this disk might work great for the next 5 years, or it might die in 2 days. Impossible to be certain about. I would start making really good backups.
 
Thank you all for the replies.

Τhese messages in logs written when I took a new backup. So the new backup maybe overwrote these sectors and fix the issue.

I have "daily", "weekly", "monthly" backups on these disks and also "daily", "yesterday" and 12 "weekly" in 3 remote backup servers so we are good. Also these disks use RAID-1.
Of course I will keep an eye on it.
 
I doubt it's a false positive.

"Pending" means the sector is marginal (eg required the redundancy of error correction to successfully read), and will be remapped the next time it's written to.

I've also been surprised that supposedly uncorrectable sectors quietly disappear from SMART once they're remapped. To me that's not really a good indication of long term drive health - I'd really prefer a permanent, cumulative account of bad sectors - but that's how some manufacturers do it.

I have one server with a HGST drive that periodically throws up a pending sector, but a zero fill clears all indications of a fault, as if it never existed... for another 6 months, anyway...
 
Well, the definition of parameter 197 is *CURRENT* pending sectors: how many sectors are right now unreadable. Clearly, it is a bad sign if any of those have ever existed in the past. But given its definition, it really does need to go back down when the sector is no longer unreadable.

And the presence of currently unreadable sectors doesn't have to mean that the sector itself (the media on the platter) is marginal; it can also be caused by vibration, or gunk on the head that can sometimes clear itself. On the other hand, usually unreadable sectors do come from either the platter or the head being in bad shape, which is why the presence of non-zero 197 values is correlated with likely disk failure.

One of the big problems with SMART, as implemented in SATA, is that the meaning of the parameters is both badly designed and not standardized among vendors. The SCSI version of SMART is a little better (it allows the disk to self-diagnose and report that it feels it might fail, known as PFA); but then, the SCSI version has other problems (inconsistent way the information is handled, between vendors).

I would be very careful about using that drive. If this is a commercial setting with maintenance and valuable data, I would replace the drive; in an amateur or low-cost setting, where the cost of the disk drive matters, but the time of the sys admin is not that valuable, I would be sure to have good backups.
 
I've also been surprised that supposedly uncorrectable sectors quietly disappear from SMART once they're remapped. To me that's not really a good indication of long term drive health - I'd really prefer a permanent, cumulative account of bad sectors - but that's how some manufacturers do it.

You're looking at the rightmost field, the internal representation of the data. Those are manufacturer-dependent; they are probably not meant to be understood. If they figure, that's nice, if they don't figure that's probably not too worrisome.
And since the SMART stuff cannot reliably predict a drive failure (there are may ways a drive can fail which cannot be monitored in advance), I recommend this strategy:
  • have a mirroring of the data you want to keep (ideally different brands of disks)
  • in addition, have a tested backup of the data you want to keep (ideally off-site)
  • log the smartctl -x output every week or month
  • occasionally look at these logs if some value does continuously go in a certain direction.
  • ignore the rest.
I have one server with a HGST drive that periodically throws up a pending sector, but a zero fill clears all indications of a fault, as if it never existed... for another 6 months, anyway...

That's how it should work. The 197 "current pending sector" should disappear after writing all the disk. On HGST the 196 "reallocated event" may then increase.
 
Back
Top