RAID-Z1 can't recover one drive

spauldo · Aug 5, 2014

I have a four-disk RAID-Z1 pool set up on my server (FreeBSD 10-STABLE, updated a couple weeks ago from source, the pool has had zpool upgrade run on it). I'm wanting to replace it with a simple mirror, but I've used all my SATA ports.

So, I pulled a drive, and popped in one of my new shiny 4 TB drives. I mean, it's RAID-Z1, right? It can still read the pool, at least well enough for me to copy the data over.

This is what I get:

Code:

root@marge:/home/spauldo# zpool status jails
  pool: jails
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
	replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

	NAME                    STATE     READ WRITE CKSUM
	jails                   FAULTED      0     0     1
	  raidz1-0              DEGRADED     0     0     8
	    ada3p2              ONLINE       0     0     0
	    ada4p2              ONLINE       0     0     0
	    ada5p2              ONLINE       0     0     0
	    237032139794884934  UNAVAIL      0     0     0  was /dev/ada6p2
root@marge:/home/spauldo#

Shouldn't the status be DEGRADED? I've lost drives before, but I've never had this issue.

Searches have gained me nothing with this - everyone else I found with this issue lost more than one drive. I'm hesitant to try anything too much because I can (I assume) simply plug the drive back in again and run with it.

I'm not too worried about copying the data - there are a couple other ways I can do this - but I'm worried that either I'm missing something or that something on my machine is screwed. RAID-Z1 should be able to do what I'm planning, right?

usdmatt · Aug 5, 2014

Is that the full zpool status output? It looks strange with it going straight back to the prompt after that last drive. There's usually a line showing if there are any data errors (unless they've changed that in newer releases)

The only thing that stands out as being a possible cause of the FAULTED pool is that checksum error on the pool itself. If you just pulled the disk, is it possible to plug it back in and see if the pool recovers? If it does, run a scrub before removing anything again. (In fact it's a good idea to run a scrub before doing anything planned like this, as it can fix errors it finds and should leave the pool 100% 'intact' (at least until the next disk error). If ZFS finds errors after you've removed a disk, that data just becomes unusable.)

You don't show the exact process you used to 'pull' the disk, but I would recommend running zpool offline jails disk before physically removing a disk.

spauldo · Aug 5, 2014

That's the full output of zpool status jails.

My procedure was pretty much:

Shut down the server.
Open the case, remove the drive and insert the new drive.
Turn on the server.
Run zpool status jails and start cursing, reading docs, searching, etc.

A scrub probably would have been a good idea. I didn't really think it was necessary since that zpool is only about two weeks old. I didn't consider offlining it because, well, you don't get the chance to offline your disks when one dies unexpectedly, and that's what RAID is for. This isn't the first time I've done this (although the other times were with FreeBSD 9) so not being able to run with the missing drive was unexpected.

I'll plug in the other drive after I finish the copying I'm doing and see what's up. If I lose everything on that zpool, it's not the end of the world - I've got backups of the data (not the jails though). I was mostly concerned that either I stumbled onto some kind of bug or (more likely) I had some kind of misunderstanding going on.

spauldo · Aug 5, 2014

OK, I think I figured this out.

The whole point of this exercise was to replace a drive that was having SMART errors (which is not the drive I pulled, BTW). When I took the system down, I didn't realize there were checksum errors on that one drive.

After replacing the drive I removed, the pool came up, but it's saying there was an unrecoverable error on the drive that had SMART errors. It lists two checksum errors after that entry according to zpool status. Strange that it didn't mention any problem with that drive when I was trying to figure out why the pool was faulted.

Anyway, I'm going to scrub and see what happens. It doesn't matter that much, since I don't need to pull that stunt again in order to copy the things I need, but it'll satisfy my curiosity.

Thanks for the help.

junovitch@ · Aug 6, 2014

I don't see any mentioning of you creating partitions on the new disks and running zpool replace jails /dev/ada6p2. Autoreplace is off by default. See the man page for zpool().

nakal · Aug 7, 2014

Next time first add the new drive, restore the pool and then remove the faulty one. This is the proper way to do it without risking much data.

RAID-Z1 can't recover one drive

spauldo

usdmatt

spauldo

spauldo

junovitch@

nakal