Other SMART Self-test Num 197: Current_Pending_Sector

Hi,

I have an old Samsung SpinPoint F3 disk with the smartd reporting a problem with its daily email.

When I check it with smartctl, it passes the "self-assessment", and the only significant difference between the problem disk and it's mirror pair (which has no problems) is ID#197:
Code:
# smartctl -H -a /dev/da0
...
SMART overall-health self-assessment test result: PASSED
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       1
...
SMART Error Log Version: 1
No Errors Logged
I can dd everything off the disk just fine.

Hoping somebody familiar with SMART diagnostics can tell me how significant ID# 197 is.

Why are "unreadable sectors" not fatal? Why doesn't bad block forwarding make it go away? Why does it persist? Can it be reset?

Cheers,
 
Whenever I suspect a drive has bad sectors I use ultimatebootcd to tests it using the brand specific diagnostic tool if possible or vivard. After a full scan sectors are either good or marked as unreadable and remapped. In both cases current pending sectors should be reduced.

From the S.M.A.R.T Wikipedia page:

Current Pending Sector Count[45]Low
Lower

Critical

[4][40][43]
Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.[58]
However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.
 
There appears to be an unreadable sector on the disk which is not touched when I read the whole disk with dd.

That infers that the unreadable sector is located in an area accessible by the firmware, but not (currently) accessible by the operating system.

So, yes, it probably needs some low level destructive diagnostic tests...
 
For SATA disks, counter 197 is one of the most important signals that indicates that the disk is ill. Unfortunately that does not mean that the disk will fail, it only increases the probability of it. My standard joke is this: Of all disks that get a SMART error, only about 50% actually do fail; the rest continue working just fine. Of all disks that actually fail, only 50% get a SMART error beforehand.

You have two choices. You can continue using the disk, but you have a higher risk that it will fail. How much higher? I don't know. You should probably be particularly good about doing backups, or making sure the disk is used in a redundant system where it failure will not cause hardship (like data loss, great inconvenience, loss of money, ...). The other choice is to replace the disk, copy all the data to a new disk, then throw it away (or use it as a paperweight or as a scratch disk for unimportant stuff.

Which choice to take depends on how valuable the data is, how valuable your time is, whether you already have good backups or RAID, whether this computer is business-critical and always has to be up, and so on and so on. I can't give you a "1 size fits all" answer.
 
There appears to be an unreadable sector on the disk which is not touched when I read the whole disk with dd.

That infers that the unreadable sector is located in an area accessible by the firmware, but not (currently) accessible by the operating system.
Not necessarily. It is possible that the drive detected an error while reading a sector, but later it was able to reconstruct the contents using error recovery data. In other words, the contents could be read, but the sector is considered “weak” by the firmware and is marked for remapping.

In this particular situation (just one sector affected, and the drive is part of a mirror) I wouldn't replace it yet. But it might be a good idea to have a spare drive available, just in case.
 
For SATA disks, counter 197 is one of the most important signals that indicates that the disk is ill. Unfortunately that does not mean that the disk will fail, it only increases the probability of it. My standard joke is this: Of all disks that get a SMART error, only about 50% actually do fail; the rest continue working just fine. Of all disks that actually fail, only 50% get a SMART error beforehand.
That sounds about right. My client has one machine with one disk that has a single "pending" sector. It has had it for the past 2 or 3 years and the machine is still humming along nicely. It's part of a mirror so I'm not too worried about this one. I've also had other machines that had disks that simply went "poof" and pretty much died on the spot. No warning, no nothing. Just a whole bunch of DMA errors and time-outs all of a sudden.
 
Not necessarily. It is possible that the drive detected an error while reading a sector, but later it was able to reconstruct the contents using error recovery data. In other words, the contents could be read, but the sector is considered “weak” by the firmware and is marked for remapping.
I'm guessing here, but I would have thought that:
  1. if a bad sector could be read (with the help of ECC) then it would be forwarded immediately; and
  2. if a bad sector could not be read, the firmware would raise the "current_sector_pending" flag and forward the sector next time it was written.
I had not considered that a sector that needed ECC to be read would be left by the firmware to deal with later.

That's why I was surprised when dd read the whole disk OK.

So clearly my conceptual model is flawed.

I can also report that the "current_sector_pending" error flag has been set quite a few days now. My impression is that it is not going to go away.
 
Hard disk firmware is extremely complex. I've heard rumors that it's over a million lines of code. A very large fraction of it deals with error handling. Quite a bit of it is very old ... for example, there are rumors that most of the firmware that's on today's WD drives is still stuff written by old IBM engineers, and IBM sold its hard drive division to Hitachi and then to WD about 15 or 20 years ago. Similarly, I've heard that there is still code in Seagate drives that was written by CDC people in Colo Springs or OK City.

And SMART has only been added in the last decade or so, and sometimes I get the feeling that disk drive vendors are not doing it very seriously. Certainly, SMART implementations are full of interesting mis-features (one could call them bugs if one weren't charitable).
 
I'm guessing here, but I would have thought that:
  1. if a bad sector could be read (with the help of ECC) then it would be forwarded immediately; and
  2. if a bad sector could not be read, the firmware would raise the "current_sector_pending" flag and forward the sector next time it was written.
I had not considered that a sector that needed ECC to be read would be left by the firmware to deal with later.
That's not what I meant. This is the situation that I meant: A sector could not be read, and error correction did not help. The firmware sets the “current_sector_pending” flag for that sector. However, the next time the operating system tried to read the sector, it succeeded (probably with the help of error correction). So it could return valid data for the sector, but still the “pending” flag is kept so the sector will be replaced next time it is written.
I can also report that the "current_sector_pending" error flag has been set quite a few days now. My impression is that it is not going to go away.
I assume it will go away as soon as the sector gets written to. If it's part of a mirror, you can force a resync of the mirror (by removing the suspicious drive from the mirror and re-adding it), rewriting the whole disk. Then the “current_sector_pending” should be gone. However, as someone else mentioned, drive firmware is a very complex thing, and sometimes it doesn't behave as you'd expect it to.
 
Hard disk firmware is extremely complex. I've heard rumors that it's over a million lines of code. A very large fraction of it deals with error handling. Quite a bit of it is very old ... for example, there are rumors that most of the firmware that's on today's WD drives is still stuff written by old IBM engineers, and IBM sold its hard drive division to Hitachi and then to WD about 15 or 20 years ago. Similarly, I've heard that there is still code in Seagate drives that was written by CDC people in Colo Springs or OK City.
That doesn't necessarily have to be a bad thing. There are still pieces of code in FreeBSD that are more than 30 years old.

I found this in an e-mail signature some time ago:

Instead of asking why a piece of software is using “1970s technology”,
start asking why software is ignoring 30 years of accumulated wisdom.​
 
Something I might do in such case:
1. Get the data mirrored away (to enough other disks to fulfill desired redundancy), and take the disk out of the pool.
2. Overwrite the whole disk one or two times and read it back.
3. See what comes of this, and if any of the SMART numbers did increase. If it looks good, reassign the disk to duty, otherwise dump it.

This should not only fixup those "pending sectors", it should also recognize other weak sectors (which might be currently unused).
There should be self-test tools in the SMART, which should do basically the same. But as I don't know what these tools precisely do, I would prefer to use dd, where I know what it does.
 
This disk was well backed up, so I broke the mirror and ran a write pass over 100% of the disk using dd(1). #197 went back to zero during that first write pass.

I'm still playing with it, but the problem appears cured, at least for the time being:
Code:
# smartctl -H -a /dev/da0 | diff -b /tmp/smartctl.da0.20190809 -
17c17
< Local Time is:    Fri Aug  9 10:19:04 2019 AEST
---
> Local Time is:    Fri Aug  9 20:44:29 2019 AEST
66c66
<   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       67560
---
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       67570
70c70
< 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       2
---
> 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       4
75c75
< 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       1
---
> 197 Current_Pending_Sector  0x0032   252   100   000    Old_age   Always       -       0
78c78
< 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       22
---
> 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       23
As an aside, there was an interesting performance spiral in the write speed across the disk. 155 MB/sec at the outset (outer cylinders) descending monotonically to 80 MB/sec at the end (inner cylinders). 117 MB/s average. So a very marked "angular velocity" impact on performance.
 
The outer versus inner effect is perfectly logical. Modern disks have a fixed bit density per linear distance along the track, and change the data rate to match. A factor of 2 is normal.

The question is whether your disk is actually completely cured and expected to have normal reliability. I don't know, and it is virtually impossible to know. What was the cause of that error? Was it a rare vibration event while writing, which caused the track to be wiggly (goes left right) or not written well (a lot of bad bits, but within the capability of the ECC), and afterwards reads were difficult? Was it a small defect in the surface? Has the sector been remapped to handle the surface defect? Was it a temporary problem, for example a head crash which caused lubricant from the platter to transfer onto the head, but now the lubricant has come off again? All good questions.
Personally, I would punch my data onto cards made from metal foil. Seriously, I would use RAID, like a mirror pair. Oh wait, you already are. Good. And then do good backups.
 
Back
Top