Zpool add mistake

Mussolini · Mar 28, 2014

Hi all,

I'm running in a problem which is really scaring me.
The fact is, I have a storage configured as RAIDZ with 12 x 3TB disks in a 24 bays supermicro chassis.. I was asked to upgrade this server adding more 12 X 3TB disks and then, doubling the capacity. But, when I executed the zpool add command, I just forgot to write the raidz in the command line. Like this:

I made:

Code:

zpool add Storage dev1 dev2 dev3 dev4 dev5 dev6 dev7 dev8 dev9 dev10 dev11 dev12

Instead of:

Code:

zpool add Storage raidz dev1 dev2 dev3 dev4 dev5 dev6 dev7 dev8 dev9 dev10 dev11 dev12

So then, I add 12 x 3TB alone disks to my RAIDZ pool with no redundancy.

The question is: Is there a way to remove those disks from the pool ? Or, is there a way to solve this problem without destroying the entire pool? Since I don't have any other place to backup 30TB and then put it back again.

Thanks in advance,

Danilo

phoenix · Mar 28, 2014

This is why that command errors out telling you that you are doing something crazy, and forcing you to re-do the command with -f (which should send up big red flags).

If you actually forced the add command to complete, then you are screwed. You cannot remove vdevs from a pool. You have to destroy the pool and start from scratch.

Or, find another 12 drives and attach one to each of the existing 12 to create 12 new mirror vdevs to at least keep redundancy in the pool. But that will lead to very mis-matched performance.

The only, real, proper fix is to destroy the pool and start from scratch.

And read the error/warning messages that zpool emits.

Mussolini · Mar 28, 2014

Hi Phoenix,

Thanks for the replay.
I was afraid someone would tell me this. Actually, the only warning I remember, was a partition issue. Because one disk was part of a old Pool.

Any way, I'll try to find out some away to backup the files to restart the Pool from Scratch.

Thanks again,

Danilo

phoenix · Mar 28, 2014

zpool will emit "mismatched vdevs; use -f to force" and exited without doing anything. You would then have had to manually run the command a second time, adding the -f to the commandline, in order to add the drives as single-disk vdevs.

I know it's not what you wanted to hear.

But that's the reason that error message exists.

Good luck! I've had to rebuild multi-TB pools from scratch and recover data from backups, so no it's not a pleasant experience. But, live and learn.

Mussolini · Mar 28, 2014

All right!

I'm sure it's not a pleasant experience...

Thanks a lot for clarifying me.

Best.

Sebulon · Apr 1, 2014

Hi @Mussolini!

Since you´re going to have to redo everything from scratch, I strongly advise you to use raidz2 instead of single parity raidz with any disks over 500 GB.

/Sebulon

Mussolini · Apr 1, 2014

Sebulon said:
Hi @Mussolini!

Since you´re going to have to redo everything from scratch, I strongly advise you to use raidz2 instead of single parity raidz with any disks over 500 GB.

/Sebulon

Hi Sebulon,
Thanks for the advise.

Actually, In these cases (regarding 24 disks) I use to create two raidz pools of 12 disks and put then together. So I get one disk redundancy for each 12 disks.
Don't you think this is a good idea ?

Thanks,

Danilo

usdmatt · Apr 1, 2014

General advice is that in an array over about 2 TB is size, use dual redundancy (RAID6/RAID-Z2) instead of single. I would definitely not put 12 3TB disks in RAID-Z1. The fact that you have a second group of disks with the same level of redundancy is fairly irrelevant. If you lose a disk in one vdev, ZFS will need to read all the data from the remaining 11 disks to rebuild the missing disk. The chances of read/checksum errors happening during rebuild is actually pretty high with this many disks (especially if it's a few years down the line) and these will show up as the dreaded 'Permanent errors have been detected in the following files...' messages in zpool status.

With RAID-Z2, for an error to cause actual data loss during a disk rebuild, 2 disks need to both have errors that affect the same stripe, which is much less likely than just getting a read/checksum error in the first place. It's a lot easier to lose a bit more space and have a robust RAID, than have to deal with permanent data errors when they happen.

If performance was at all important I would actually go for 4x6 RAID-Z2 in your case, but you lose a hell of a lot of disk space that way. 24TB of raw space vs 6TB with your suggested setup. It may also be worth looking at 2x11 RAID-Z3 with 2 cold spares if the data being stored is important. Both these options (4x6, 2x11) are also optimal in terms of 'disks per vdev' whereas 12 disks in RAID-Z1/2 isn't, although that's a moot point if you're not overly concerned about performance. (If you're accessing the data over <= 1Gb ethernet it probably won't make much difference if the vdevs are optimised or not)

phoenix · Apr 2, 2014

Mussolini said:
Actually, In these cases (regarding 24 disks) I use to create two raidz pools of 12 disks and put then together. So I get one disk redundancy for each 12 disks.
Don't you think this is a good idea ?

We use 6-disk raidz2 vdevs in our storage servers. With 2 TB drives, and 24 drive bays, we get 4 vdevs of 8 TB each (4 data disks * 2 TB/disk) for a total of 32 TB storage in the pool. Each vdev can lose 2 disks without losing the pool. And spreading the I/O across 4 vdevs improves overall performance of the pool.

Our largest pool has 57 disks in it, also using 6-disk raidz2 vdevs (and a couple of spare disks), with available space to hold 90 disks.

When using raidz vdevs, you really, really, really want to avoid using 10+ disks per vdev. Optimal sizing is 6 or 8 for raidz2 vdevs. Why? Because trying to scrub or resilver vdevs with that many disks in it will drag your box into the gutter.

My original ZFS storage box had a single raidz2 vdev using 24 disks. Ran fine, until the first disk died. After 3 weeks of trying, it was still resilvering and crashing.

Just don't do it. Keep the number of drives in a single vdev under 10!

Mussolini · Apr 2, 2014

Hi Guys,

I really appreciate the information you guys are inserting here. For sure, I'll consider all of this in my next builds.
I use ZFS since 2009 and I don't know why, I've never known about the recommendation to not create vdevs with more than 10 disks. I have to say that the space to me always was the priority, of course, with the minimum of security. Maybe because of the area I work, post production house. Available space always was/is something missing.
Also, during this time, I've never had problems when resilvering or scrubbing this kind of pools. Several times I had failed disks in the pool and they were resilvered without problems in a reasonable speed. Maybe that's why I've never had big concerns regarding the redundancy.

Considering the chassis I usually use (Supermicro 16, 24 and 36 bays), I guess a good principle would be create 8 disks vdevs.

I also didn't know the resilvering was faster in RAIDZ2 pools, what's the explanation for that ?

Thanks a lot.

Best,
Danilo

Sebulon · Apr 2, 2014

@@phoenix

That´s most true for you since you´re a big dedup-user right? I mean, in our SuperMicro chassis (24's & 36's), I usually set up 8+8+8, or 10+10+10+6 raidz2, no spares. A resilver takes a couple of hours, IO peak over 1 GB/s. So I´m guessing that your recommendation holds true if you´re using dedup, and otherwise one may be fine going up to even 12.

I also didn't know the resilvering was faster in RAIDZ2 pools, what's the explanation for that ?

How did you get that idea? If anything it´s quite the opposite

/Sebulon

Mussolini · Apr 3, 2014

Hi guys,

Regarding the first post, I did a restart on the server today. After this restart I got this:

Code:

[root@ZBoox003LX mdotti]# zpool status
  pool: Storage
 state: UNAVAIL
status: One or more devices could not be used because the label is missing 
	or invalid.  There are insufficient replicas for the pool to continue
	functioning.
action: Destroy and re-create the pool from
	a backup source.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
  scan: none requested
config:

	NAME                              STATE     READ WRITE CKSUM
	Storage                           UNAVAIL      0     0     0  insufficient replicas
	  raidz1-0                        ONLINE       0     0     0
	    scsi-35000c500608a6800        ONLINE       0     0     0
	    scsi-35000c500608abf2d        ONLINE       0     0     0
	    scsi-35000c500608acc31        ONLINE       0     0     0
	    scsi-35000c500608acccf        ONLINE       0     0     0
	    scsi-35000c500608a7310        ONLINE       0     0     0
	    scsi-35000c500608ad1b8        ONLINE       0     0     0
	    scsi-35000c500505fd38f        ONLINE       0     0     0
	    scsi-35000c500608aa29e        ONLINE       0     0     0
	    scsi-35000c500608a8484        ONLINE       0     0     0
	    scsi-35000c500608a6073        ONLINE       0     0     0
	    scsi-35000c500608a448f        ONLINE       0     0     0
	    scsi-35000c500608a6783        ONLINE       0     0     0
	  pci-0000:05:00.0-scsi-0:0:22:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:23:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:21:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:24:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:25:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:26:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:27:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:28:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:29:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:30:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:31:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:32:0  UNAVAIL      0     0     0

This meas after restart the disks are out of order ??? Tell I can fix this please!!!

Thanks.

phoenix · Apr 3, 2014

Sebulon said:
@@phoenix
That´s most true for you since you´re a big dedup-user right? I mean, in our SuperMicro chassis (24's & 36's), I usually set up 8+8+8, or 10+10+10+6 raidz2, no spares. A resilver takes a couple of hours, IO peak over 1 GB/s. So I´m guessing that your recommendation holds true if you´re using dedup, and otherwise one may be fine going up to even 12.

Only 2 of the 4 storage servers use dedupe now. They all use 6-disk raidz2 vdevs. Resilver on the dedupe pools takes several days for a 2 TB drive, and scrub takes almost a month; on the non-dedupe pools it takes several hours to resilver a 2 TB drive and several days to scrub the pool.

I've seen I/O throughput (according to "zpool iostat") top 600 MB/s, but that's limited by the gigabit link between servers. Haven't watched resilver rates.

Mussolini · Apr 3, 2014

Mussolini said:

Hi guys,

Regarding the first post, I did a restart on the server today. After this restart I got this:

Code:

[root@ZBoox003LX mdotti]# zpool status
  pool: Storage
 state: UNAVAIL
status: One or more devices could not be used because the label is missing 
	or invalid.  There are insufficient replicas for the pool to continue
	functioning.
action: Destroy and re-create the pool from
	a backup source.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
  scan: none requested
config:

	NAME                              STATE     READ WRITE CKSUM
	Storage                           UNAVAIL      0     0     0  insufficient replicas
	  raidz1-0                        ONLINE       0     0     0
	    scsi-35000c500608a6800        ONLINE       0     0     0
	    scsi-35000c500608abf2d        ONLINE       0     0     0
	    scsi-35000c500608acc31        ONLINE       0     0     0
	    scsi-35000c500608acccf        ONLINE       0     0     0
	    scsi-35000c500608a7310        ONLINE       0     0     0
	    scsi-35000c500608ad1b8        ONLINE       0     0     0
	    scsi-35000c500505fd38f        ONLINE       0     0     0
	    scsi-35000c500608aa29e        ONLINE       0     0     0
	    scsi-35000c500608a8484        ONLINE       0     0     0
	    scsi-35000c500608a6073        ONLINE       0     0     0
	    scsi-35000c500608a448f        ONLINE       0     0     0
	    scsi-35000c500608a6783        ONLINE       0     0     0
	  pci-0000:05:00.0-scsi-0:0:22:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:23:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:21:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:24:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:25:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:26:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:27:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:28:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:29:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:30:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:31:0  UNAVAIL      0     0     0
	  pci-0000:05:00.0-scsi-0:0:32:0  UNAVAIL      0     0     0

This meas after restart the disks are out of order ??? Tell I can fix this please!!!

Thanks.

Well, well....
Found the problem.... I made this restart to install a 10Gb card to get higher throughputs on this backup operation. Because of that, the symlinks has changed from pci-0000:05:00.0-scsi to pci-0000:07:00.0-scsi. I though when referencing to disks by path, doesn't matter how many cards you have inside since you don't change the controller's slot. After taking the 10Gb card out, The pool has mounted again. :e
But, this was a complete hour of horror!!!

Best,

Danilo

Sebulon · Apr 3, 2014

Labels, labels, labels...

Is that expected behaviour, though?

Thought ZFS scanned the on-disk metadata to find back the correct disks anyway, but that might be tricky when added like that...

/Sebulon

phoenix · Apr 3, 2014

Note: you are using ZFS on a Linux server. This is a FreeBSD forum. You would get better answers to zfsonlinux issues (like why PCI devices are being renumbered) on Linux-oriented forums.

Mussolini · Apr 3, 2014

Yes, you are right.
I'm using Linux for a short time because of specific reasons. I just forgot this is a BSD forum and not a ZFS forum.

Thanks