Bizzare Disk failure / "Periph destroyed" | Lost

# uname -a
Code:
FreeBSD MACH1 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r286666: Wed Aug 12 19:31:38 UTC 2015  root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  i386

All disks on the system are new (less than a couple of months), except one {am1...];
all are labelled

# df
==>
Code:
Filesystem  Capacity  Mounted on
/dev/gpt/gptrootfs  4%  /
devfs  100%  /dev
/dev/gpt/am1usrfs  0%  /disk_01
/dev/gpt/am2usrfs  79%  /disk_02
/dev/gpt/am6usrfs  0%  /disk_06
fdescfs  100%  /dev/fd

# gpart show -l
provided the following '/dev/ada' assignments
Code:
ada0   am2usrfs   i.e. /disk_02
ada1   am6usrfs   i.e. /disk_06
ada2   am1usrfs   i.e. /disk_01
ada3   gptrootfs   i.e. /

After a few minutes, I'll received several messages on the console and in
/var/log/messages
Code:
May  5 22:00:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:00:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  5 22:01:00 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
...
...
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, FAILED SMART self-check. BACK UP DATA NOW!
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, 1891 Currently unreadable (pending) sectors
May  6 08:00:59 MACH1 smartd[586]: Device: /dev/ada3, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.


Which would indicate that am1usrfs and gptrootfs failure could be imminent, i.e. /disk_01 & / (the system disk).

Oddly enough, it is the ada1, am6usrfs (/disk_06) that would functionally disappear from the system - requiring a shutdown on restart it brings it back up again. The disk will remain visible from a df -k but a ls ==> a not configured state. At this stage dmesg shows a "periph destroyed" message.

The whole sequence seems bizarre, ada3 failing messages (continues to be functional), but ada1 being dropped or destroyed instead and with an apparent clicking sound too - seems strange.

Can anyone explain?

Thanks!
 
Last edited by a moderator:
I have received those messages (but without the "FAILED SMART self-check" message) earlier when I accidentialy hit my PC while making a backup with rsync.
The messages only lasted a second or so and then dissapeared. No data corruption appeared as I ckecked and rsynced once more after the first run finished.
Just a little chance, but you may have to much vibration from the disks or something else. I'm running a simple ZFS mirror with 2 disks, so there is not a lot of vibration, unless I hit my PC.
I thought if the vibration had lasted a little longer and the read errors continued for a longer time a disk (or both) might have gone offline...
 
Code:
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
These are not temporary errors. And judging by the amount the disk should have been replaced a long time ago. Same for ada3. Your posted logs don't show anything for ada1 did it have similar errors? And are these SSD or plain old HDD? I've noticed SSDs tend to go offline suddenly after a number of uncorrectable errors.
 
I have received those messages (but without the "FAILED SMART self-check" message) earlier when I accidentialy hit my PC while making a backup with rsync.
The messages only lasted a second or so and then dissapeared. No data corruption appeared as I ckecked and rsynced once more after the first run finished.
Just a little chance, but you may have to much vibration from the disks or something else. I'm running a simple ZFS mirror with 2 disks, so there is not a lot of vibration, unless I hit my PC.
I thought if the vibration had lasted a little longer and the read errors continued for a longer time a disk (or both) might have gone offline...

No mechanical injury, vibration etc!
 
Code:
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Currently unreadable (pending) sectors
May  5 22:30:59 MACH1 smartd[586]: Device: /dev/ada2, 6307 Offline uncorrectable sectors
These are not temporary errors. And judging by the amount the disk should have been replaced a long time ago. Same for ada3. Your posted logs don't show anything for ada1 did it have similar errors? And are these SSD or plain old HDD? I've noticed SSDs tend to go offline suddenly after a number of uncorrectable errors.

All are plain old HDD
Yes! ada1 just disappears, its a new disk just a week or two old - sprouts no error messages.
Other disks - ada2 and ada3 are relatively new too ~ 2 and 4 months respectively. sprouts error messages.
ada3 system disk w/ clean installation
Ironically, ada0 is old, real old - several yrs, no error messages apparently stable like hell.
 
I'd return ada1, 2 and 3. At least 2 and 3 have offline uncorrectable errors and 1 is pretty dodgy. If they're this new they should never have been sold. Maybe it's a bad batch, maybe it's refurbished. In any case this should be covered by warranty.
 
Always suspect the cables (Pournelle's law).
A variation on that, which I rather like: Always wiggle the wires first.

I agree with previous comments that there's no way you should be seeing that sort of error count on drives less than 6 months old, unless they have been physically abused. I'd certainly try swapping out the cables, possibly check the voltage being supplied to the drives with a meter or scope (only do this if you are competent to safely work on live electronics). It certainly seems like the drives are just plain bad and should be replaced ASAP, but I wouldn't want to entirely rule out a false positive due to bad cables/power/cooling/etc.
 
Wiggle, I've certainly done, more like a samba flounce. I don't have access to a scope but my meter give a +12V & +5V respectively and a null on the +3.3V lead (which is normal I think on many SATA's). My SATA cables are the longer type, more than the 18" cables and I have wondered if there could be significant voltage drop? Similarly, I have wondered about the SATA channels/ports on the controller; I have switched these around but to no avail.
What is disturbing would a MOBO, cable or controller shortcoming result in a permanent rather than a temporary disk failure - i.e. loss of partitions and inability and failure to lay down a new file system?
 
What is disturbing would a MOBO, cable or controller shortcoming result in a permanent rather than a temporary disk failure - i.e. loss of partitions and inability and failure to lay down a new file system?

Well, a problem outside the drive might just be generating a false positive from the drive's SMART stuff, maybe. If you have another system you could test in, trying the drives one at a time in that system might be worth a shot. If you get the same general problems and SMART error counts on a known good system+controller+port+cable+etc, then it's probably the drives themselves that are bad. Honestly, I think the problem is the drives themselves, either a bad manufacturing batch or something bad happened to them at some point between the factory and now, it just never hurts to rule out the rest.

One last resort thing that could be tried is a low level format with camcontrol(8) (total data loss, obviously). I can't say I've ever LL formatted a SATA drive (have done it plenty of times on SCSI drives), so actually not sure if it's expected to work. If something bad happened to the drives in terms of a magnetic blip corrupting the formatting (e.g. they sat next to a giant electric motor for a while), instead of a physical type of defect, that might cure it. At your own risk. I recommend being quite careful if it seems to fix them, in case there's still an underlying issue, so extended testing post-format is strongly recommended.
 
...
One last resort thing that could be tried is a low level format with camcontrol(8) (total data loss, obviously). I can't say I've ever LL formatted a SATA drive (have done it plenty of times on SCSI drives), so actually not sure if it's expected to work. If something bad happened to the drives in terms of a magnetic blip corrupting the formatting (e.g. they sat next to a giant electric motor for a while), instead of a physical type of defect, that might cure it. At your own risk. I recommend being quite careful if it seems to fix them, in case there's still an underlying issue, so extended testing post-format is strongly recommended.
You can't low-level format a modern hard drive. That's done at the factory and its geometry is fixed from then on. The best you can do is overwrite the entire drive with zeros (dd if=/dev/zero of=/dev/...) and rely on the disk's bad sector reallocation to take out the bad sectors.
 
You can't low-level format a modern hard drive. That's done at the factory and its geometry is fixed from then on. The best you can do is overwrite the entire drive with zeros (dd if=/dev/zero of=/dev/...) and rely on the disk's bad sector reallocation to take out the bad sectors.

Yeah, that is what I was uncertain about. I've done real low-level formats on fast+wide SCSI and older many times, even on some LVD Ultra-SCSI, where it certainly took long enough (i.e. hours) to give the appearance of laying down new format markers. I've just never had the need or inclination to do it in more recent years, so uncertain about expected behaviour on current SATA and SAS.
 
Even the very old IDE drives from early '90s were the same, you couldn't low level format them. The "last mohicans" were indeed those SCSI drives, after them all drives in use have the low level format completely hidden from the user after factory initialization.
 
Back
Top