Disk errors

nzalexnz · Oct 8, 2011

I have 3x2TB disks set-up with radiz1. All disks are less than 6 months old.

A week or so ago, i started getting some errors saying that one of the disks was offline, i think this was due to a faulty molex connector. I now seem to have some bad blocks on two of the disks. I was about 90% through copying a 10GB file to my NAS when the disks went offline, causing the file to become corrupt.

Code:

Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada0, 8 Offline uncorrectable sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada1, 16 Offline uncorrectable sectors

zpool status:

Code:

  pool: Media
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 0h6m with 44 errors on Sat Oct  8 10:35:41 2011
config:

        NAME                                            STATE     READ WRITE CKSUM
        Media                                           ONLINE       3     0   130
          raidz1                                        ONLINE       3     0   343
            gptid/8c56d190-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0  83.6M resilvered
            gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       3     0     0  64K resilvered
            gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Media@auto-20110929.0400-2w:/new/file.mkv

I have deleted the above file and scrubbed the pool, but it continues to deleted that file...

I am also getting a lot of checksum errors:

Code:

Oct  8 10:34:58 NAS root: ZFS: zpool I/O failure, zpool=Media error=86
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95 offset=913825554432 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95 offset=913825558528 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95 offset=913825488896 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95 offset=913825492992 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95 offset=913825488896 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95 offset=913825492992 size=65536
Oct  8 10:34:58 NAS root: ZFS: zpool I/O failure, zpool=Media error=86
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95 offset=913825554432 size=65536
Oct  8 10:34:58 NAS root: ZFS: checksum mismatch, zpool=Media path=/dev/gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95 offset=913825558528 size=65536
Oct  8 10:34:58 NAS root: ZFS: zpool I/O failure, zpool=Media error=86

Looking at the above, i would say i need to replace two disks, is that right?

Why does zpool status -v continue to show the permanent error even after the file has been deleted?

I have tried to take one of the disks offline so i can replace it, but i comes back with the below. Why could this be?

Code:

[...alex@NAS] /mnt/Media# zpool offline Media gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95
cannot offline gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95: no valid replicas

Also, it seems weird that the disks don't appear as ada0,ada1 and ada2.

Thanks for you help!

nzalexnz · Oct 10, 2011

*bump*

How can i offline the drive? ;s

AndyUKG · Oct 10, 2011

nzalexnz said:
I have tried to take one of the disks offline so i can replace it

Is that because you cannot physically connect more than 3 disks at a time? You don't need to "offline" the disk to replace it if you can connect the new drive at the same time.

Given the issues you are seeing it could equally be a controller problem. What disk controller are you using? Assuming its SATA you could try, if possible, to disable ACHI and/or run at a slower bus speed and see if the drives behave reliably then...

thanks Andy.

Zhwazi · Oct 10, 2011

You can clear the error with "zpool clear". See manpage for more:

http://www.freebsd.org/cgi/man.cgi?query=zpool&sektion=8

I do tech support for a major server OEM, and one problem we see occasionally is where you will get a bad block on one disk, and have another drive go offline for whatever reason. When this happens, no RAID system worth having will trust the data that is on the drive that is failed, because it will be out of date. However, it will not be able to rebuild that one block on the drive because to calculate the missing block, it needs the n-1 other blocks on the stripe to xor together to recalculate it. It's a problem that is easily mistaken for having two bad drives, but you may have zero or one bad drives more likely, and just got unlucky with a bad block forming while the one drive fell offline. Determining why the drive went offline is another problem and something you should investigate (look up smart tools to start). This diagnosis might change as new evidence surfaces, of course, but unless you already have spare drives, replacements may not be needed. This error will often carry over to any new drives introduced into the array because no matter how many blank replacement drives you add to the array you'll never put the original contents of the bad block back.

I can't find any info on what error=86 is, but an I/O error is consistent with the phenomenon I'm talking about.

nzalexnz · Oct 11, 2011

Wow! Thanks for the replies!

I think the reason the drive went offline in the first place was due to a faulty molex -> SATA power connector. When i was looking at one of these the pin had slipped out.

I have run a zpool clear. Since writing the post i learned that you cannot clear "errors: Permanent errors have been detected in the following files:" as it is a permanent error.

All the disks are SATA. I am using the on-board RAID controller. My motherboard is a Asus P55 LX. I can plug in more disks, however i will be replacing these under warranty, this means i will not be able to put a new disk in until i take the bad one back. For this reason i am scared to just unplug the drive, get it replaced and then possibly run in to more issues with the new drive.

What would you suggestion be? Is there a way to determine if one drive is actually bad? I could copy all of the data off and do a low level format and re-build the array? Would this get rid of the "bad blocks"?

Thanks again for your help!

Zhwazi · Oct 11, 2011

It looks like you already have SMART tools installed, read the manpage for instructions on how to test a disk.

Fully backing up and then restoring the backup to a freshly recreated zpool should fix the permanent errors it's talking about. Another thing you may be able to do is overwrite the affected file with a good version from backup. I'm not sure if this second option would repair it but it should be easier and be worth a shot if you have backups.

nzalexnz · Oct 13, 2011

I have already done a long test with SMART tools, it has come back with the disk errors posted above.

Code:

Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada0, 8 Offline uncorrectable sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada1, 16 Currently unreadable (pending) sectors
Oct  8 10:48:15 NAS smartd[57469]: Device: /dev/ada1, 16 Offline uncorrectable sectors

You had a theory to say that these may not actually be physical errors? If I do a low level format and re-create the zpool, according to your theory will it stop reporting these uncorrectable errors?

Is there another way to offline the drives? Alternatively, if just unplug the drive and replace it with a new one, will the zpool be able to resilver it and add it back into the pool correctly?

I am worried that if i replace the drive, and there are issues i won't be able to plug the original one back in.

jem · Oct 13, 2011

What make and model are these disks?

Certain Samsung disks had a firmware bug that caused false i/o errors. Installing a firmware update fixed the issue.

# camcontrol devlist -v

Zhwazi · Oct 13, 2011

nzalexnz said:
You had a theory to say that these may not actually be physical errors? If i do a low level format and re-create the zpool, according to your theory will it stop reporting these uncorrectable errors?

Is there another way to offline the drives? Alternatively, if just unplug the drive and replace it with a new one, will the zpool be able to resilver it and add it back into the pool correctly?

I am worried that if i replace the drive, and there are issues i won't be able to plug the original one back in.

With three disks in a RAID-5 or RAID-Z, you need at least two disks worth of data to recreate the third disk perfectly if it should fail. If disk 2 is fine, disk 1 has some bad blocks that it hasn't found yet (normally a few bad blocks will just be remapped by the drive, and doesn't warrant replacement), and disk 0 flat out dies (including if it only loses power), then even if you fix the problem with disk 0, you will not be able to recover the parts of disk 0 that were on the same stripe with the bad blocks on disk 1. There is no way to recover it, thus it is a permanent error. So the fact that this happened means that at some point there was a bad block, and probably on one of the disks that is in the array now, however often a bad block just needs to be written over again and it will be fine.

So to answer the questions:

Sort of. This condition means that there are hardware problems, but it can persist even after fixing hardware problems.
If you back your data up and recreate the pool, yes, it should stop reporting permanent errors. It's up to you to determine if it is worth it, a permanent error doesn't do additional harm once it has been detected.
See the zpool manpage for instructions on removing and re-adding disks and setting hot spares.
If you replace the disk, it will resilver the 99.9999% of the disk that doesn't have permanent errors. It will not be possible to clear up a permanent error by doing this though. It is as permanent as the zpool.

nzalexnz · Oct 14, 2011

jem said:
What make and model are these disks?

Certain Samsung disks had a firmware bug that caused false i/o errors. Installing a firmware update fixed the issue.

# camcontrol devlist -v

Code:

[...alex@NAS] /> sudo camcontrol devlist -v
sudo: Command not found.
[...alex@NAS] />

They are Seagate 2TB Barracuda Green SATA3 5900.1 64MB Internal HDD ST2000DL003.
I did a bit of a google search and found a few people having issues. Is this one of the models that gives false i/o errors?

I just ran another scrub.

Code:

  pool: Media
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 0h6m with 44 errors on Fri Oct 14 18:23:57 2011
config:

	NAME                                            STATE     READ WRITE CKSUM
	Media                                           ONLINE       1     0    44
	  raidz1                                        ONLINE       1     0   116
	    gptid/8c56d190-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0  69.1M resilvered
	    gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       1     0     0  64K resilvered
	    gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

Thanks again for you help Zhwazi! Not sure what i am going to do now... I think maybe replace disk1?

Also, any idea why zpool status doesn't show the disks as ada0,1,2?

nzalexnz · Oct 14, 2011

Woops, didn't run the command with the correct permissions.

Code:

[...alex@NAS] /# camcontrol devlist -v
scbus0 on ata2 bus 0:
<SAMSUNG CDRW/DVD SM-352B T807>    at scbus0 target 0 lun 0 (cd0,pass0)
<>                                 at scbus0 target -1 lun -1 ()
scbus1 on ata3 bus 0:
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ahcich0 bus 0:
<ST2000DL003-9VT166 CC32>          at scbus2 target 0 lun 0 (ada0,pass1)
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on ahcich1 bus 0:
<ST2000DL003-9VT166 CC32>          at scbus3 target 0 lun 0 (ada1,pass2)
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on ahcich2 bus 0:
<ST2000DL003-9VT166 CC32>          at scbus4 target 0 lun 0 (ada2,pass3)
<>                                 at scbus4 target -1 lun -1 ()
scbus5 on ahcich3 bus 0:
<>                                 at scbus5 target -1 lun -1 ()
scbus6 on ahcich4 bus 0:
<>                                 at scbus6 target -1 lun -1 ()
scbus7 on ahcich5 bus 0:
<>                                 at scbus7 target -1 lun -1 ()
scbus8 on umass-sim0 bus 0:
<USB 2.0 Flash  Disk 1100>         at scbus8 target 0 lun 0 (da0,pass4)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)

nzalexnz · Oct 17, 2011

Things seemed to have gone from bad to worse.

So, i got the pool to this state after replacing one of the drives.

Code:

...alex@NAS] /mnt# zpool status
  pool: Media
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	Media                                           DEGRADED     0     0     0
	  raidz1                                        DEGRADED     0     0     0
	    gptid/8c56d190-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0
	    replacing                                   DEGRADED     0     0     0
	      11075215194828456675                      UNAVAIL      0     0     0  was /dev/gptid/8cd2bb1c-db54-11e0-b3ad-e0cb4eb75f95
	      ada1p2                                    ONLINE       0     0     0
	    gptid/8d56b574-db54-11e0-b3ad-e0cb4eb75f95  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

If i try to detach or offline the "replacing" drive i get:

Code:

cannot detach 11075215194828456675: no valid replicas
cannot offline 11075215194828456675: no valid replicas

I tried to export the pool and re-import. It seems to export ok, but when i try to import it again i get:

Code:

cannot mount '/Media': failed to create mountpoint

Not sure what to try next ;<

peetaur · Dec 9, 2011

How strange. When I replace a disk, it goes away, only saying "replacing" while the resilvering is going on... unless it is a hot spare. Is that a hot spare? It looks like it is not, or it would say spares below, saying "INUSE" for the ada1p2 disk.

And a mistake I see, is that you didn't label your disk. You should always label disks. Looks like you have a partition on there... if it is a gpt partition, instead of using the ada1p2 name, use gpt/...

Did you try "remove" instead of "detach"? But detach sounds like the right command to use.

"failed to create mountpoint" sounds vaguely familiar. I think I had that once and just did a simple rmdir to remove the old mountpoint.

Perhaps the simplest solution is to get 3 USB/eSATA/firewire disks and back up the pool, and then recreate it. If I had this problem, the first thing I would do is back it up, even if I plan on playing around to fix it (just to learn something) rather than recreate.

peetaur · Mar 2, 2012

Just wanted to follow up on this, in case it is useful to someone.

Now I believe the "no labels" is not necessarily a mistake. The only time it seems to cause trouble is with 8.2-RELEASE, and not with newer versions.

And, something I did not know back in December: the solution to a very similar problem (cannot detach <disk>: no valid replicas) is to scrub, and then when scrub is complete, clear.

Code:

zpool scrub Media
zpool clear Media
zpool detach Media 11075215194828456675

(if detach doesn't work, try remove)

Then clean up your system:

# zpool status -v Media

for each file listed:

# rm <file>

and then if you have a backup, replace the file.