Is ZFS able to detect hotswap?

Henu · Jun 28, 2012

Hello

I have a raidz pool with two SATA disks. I use FreeBSD 9.0 and the disks are shown as /dev/adaX, and as far as I know, that means that hotswap should work. If I pull the disk off, one of the /dev/adaX devices disappear and I get some information to dmesg. And if I put it back, a new /dev/adaX appears, so hot swapping works pretty nice.

However, ZFS does not detect this. Command [cmd=]zpool status[/cmd] happily says that all disks are online. Is this normal? Should ZFS detect when /dev/adaX becomes unavailable?

Sebulon · Jun 28, 2012

@Henu

It will notice something is up when it tries to do something towards that pool, before that it lives happily unknowing. If you have your pool mounted where thereÂ´s constant activity it notices right away. However if it has noticed and status changes to eg. OFFLINE or FAULTED, you have to zpool online or replace it manually.

/Sebulon

Henu · Jun 29, 2012

I did some testing, and was able to define my problem more clearly. Here is a short version about it:

If I pull out the disk, the corresponding device disappears nicely, but if I put the disk back, the device does not show up. I think this is because ZFS is still somehow "using" it. This can be concluded from the facts that [cmd=]zpool status[/cmd] still says all disks are online, even if one is removed physically and FreeBSD has removed its device from /dev/. Another fact is, that the device will appear immediately when I put the removed disk offline from zfs (remember, I already put the disk back earlier and was waiting for it to appear!)

I'm able to get the disk back online without any major data loss if I just reboot the machine while disk is back at its slot. This is good thing.

But the bad thing is, that I'm unable to get it back online without reboot. This kind of scenario might happen if somebody accidentally pulls out wrong disk in a system that must not be rebooted.

If I put the disk offline (the one that has been "accidentally removed" and then put back), /dev/adaX will appear, but how am I supposed to get it back online? If I say [cmd=]zpool online <pool> adaX[/cmd] I get error:

Code:

warning: device 'adaX' onlined, but remains in faulted state

and after this [cmd=]zpool status[/cmd] says it's not online, but faulted.

So the case and the final question is this: User accidentally removes physical disk from zpool. How to fix this without reboot and without rebuild (that will take hours)?

And by the way, I do have hardware that supports hotswapping, so that should not be an issue.

throAU · Jun 30, 2012

Henu said:
So the case and the final question is this: User accidentally removes physical disk from zpool. How to fix this without reboot and without rebuild (that will take hours)?

I don't think you'll be able to avoid an array rebuild. ZFS needs to verify that the data is OK, and the only way to do that is via a rebuild. There's no way for ZFS to know what has happened to the disk while it has been out of the array, so it needs to rebuild it.

phoenix · Jul 1, 2012

Depends on how long the drive is out of the pool, and what commands you used to "remove" it.

If you zpool offline a drive, wait 5-10 minutes, and then zpool online the drive, ZFS is smart enough to realise it's the same disk, and only resilver data that was written while the drive was offline. That can take as little as a few minutes, as it's only copying over new data.

If you zpool offline a drive, wait 5-10 minutes, and then zpool replace the drive, ZFS treats it as a new drive, and the resilver has to rebuild *all* the data on the drive. This can take as little as an hour, and as much as a week or two, depending on the size of the vdev, the type of vdev, the number of drives in the vdev, and the amount of data in the vdev.

If you pull the drive, wait for ZFS to notice it's gone and mark it offline, then plug it back in, how long it takes to resilver depends on the next command you issue (online vs replace).

In other words, ZFS is smart; but it does exactly what you tell it to, so be sure to tell it to do the right thing.

(It also depends on the type of drive, SATA vs SAS, the type of controller, whether it's a true hotswap setup or just a hotplug setup, etc, etc, etc.)

Henu · Jul 2, 2012

phoenix said:
If you pull the drive, wait for ZFS to notice it's gone and mark it offline, then plug it back in, how long it takes to resilver depends on the next command you issue (online vs replace).

But how do I know when ZFS has noticed it? As I mentioned before, zpool status still says all disks are online, even if one is removed physically.

And if I just put it offline manually after a while, I can only do replace, not online (if I want to avoid rebooting). This is because zpool online just says:

Code:

warning: device 'adaX' onlined, but remains in faulted state

and marks the disk faulted.

Sebulon · Jul 3, 2012

I think this is because ZFS is still somehow "using" it

No, it just hasn't noticed it's gone yet. If it had, it would have marked the drive OFFLINE by itself.

Another fact is, that the device will appear immediately when I put the removed disk offline from zfs (remember, I already put the disk back earlier and was waiting for it to appear!)

Let me see if I understood that correctly:

1. Pull out adaX; /dev/adaX gone.
2. Put disk back in; but no /dev/adaX.
3. Execute zpool offline poolname adaX; then /dev/adaX appears?

Cause that's definitely not how it's supposed to happen.

/Sebulon

Henu · Jul 4, 2012

Sebulon said:
1. Pull out adaX; /dev/adaX gone.
2. Put disk back in; but no /dev/adaX.
3. Execute zpool offline poolname adaX; then /dev/adaX appears?

It happens exactly like that. Btw do you know how quickly zfs should notice that disk is pulled out?

Maybe I should pull one of disks out and tell here when it notices it...

Henu · Jul 5, 2012

Henu said:
Maybe I should pull one of disks out and tell here when it notices it...

Well, it has now been almost 24 hours with physical disk removed, and # zpool status still says are disks are online. I suppose it should be faster to notice it?

phoenix · Jul 5, 2012

Have you done any writes to the pool?

Sebulon · Jul 5, 2012

phoenix said:
Have you done any writes to the pool?

Or reads?

/Sebulon

Henu · Jul 6, 2012

I created one 200 MB file and then copied it. And it still says every disk is online.

Terri_Kennedy · Jul 9, 2012

Henu said:
Well, it has now been almost 24 hours with physical disk removed, and # zpool status still says are disks are online. I suppose it should be faster to notice it?

At least on FreeBSD, there's nothing "auto" about the "autoreplace=on" zpool property. There's been a bunch of discussion about this in the past (on freebsd-fs@, I think) and some people suggested using devd to detect the change and force the replacement to be used.

hopla · Jul 9, 2012

Henu said:
Well, it has now been almost 24 hours with physical disk removed, and # zpool status still says are disks are online. I suppose it should be faster to notice it?

Are the disks connected directly or is there a RAID card in between? Because, maybe, the RAID card could be faking that the disk is still connected and take up any writes in its NVRAM buffer. As some fancy 'you can only do this with HW RAID' feature...

It's a long shot, but you never know with hardware RAID...

Have you tried removing the disk and then doing a zpool scrub? If it doesn't detect the removed disk then you have some truly weird stuff going on :/

Henu · Jul 10, 2012

hopla said:
Are the disks connected directly or is there a RAID card in between?

The disks are connected directly to the motherboard.

hopla said:
Have you tried removing the disk and then doing a zpool scrub? If it doesn't detect the removed disk then you have some truly weird stuff going on :/

I tried to do zpool scrub, and now it marked missing disk as "UNAVAIL". Before that, it still showed both of disks as "ONLINE", although I removed the disk days ago, as you can see from older posts.

Sebulon · Jul 10, 2012

Henu said:
It happens exactly like that. Btw do you know how quickly zfs should notice that disk is pulled out?

Maybe I should pull one of disks out and tell here when it notices it...

That's what's so wierd; it's not supposed to act like that.

This is a correct behaviour:
1. Pull out adaX; /dev/adaX gone.
2. Put disk back in; /dev/adaX reappears.
3. Execute zpool online/replace poolname adaX manually.

Could you please:
# ls -lah /dev/adaX
# tail -f /var/log/messages
Then pull out the "adaX" drive, wait 30s then push it back in and wait another 30s. After that you abort tail, then:
# ls -lah /dev/adaX
again, and paste the output of that sequence here.

You have also stated that your hard drives was connected directly to the motherboard, so it would be nice to know what motherboard it is? And could you paste the output of:
# grep ata /var/run/dmesg.boot
# kldstat
# uname -a

/Sebulon

Henu · Jul 11, 2012

Sebulon said:
Could you please:
# ls -lah /dev/adaX
# tail -f /var/log/messages
Then pull out the "adaX" drive, wait 30s then push it back in and wait another 30s. After that you abort tail, then:
[CMD=""]ls -lah /dev/adaX[/CMD]
again, and paste the output of that sequence here.

I have a little different configuration, but I'm sure this is what you wanted. Also, this same message came to dmesg during the sequence. I also cleaned everything useless from messages.log.

  [root@machine ~]# ls -lah /dev/ada*

crw-r-----  1 root  operator    0,  75 Jul 11 12:14 /dev/ada0

crw-r-----  1 root  operator    0,  76 Jul 11 12:14 /dev/ada1



[root@machine ~]# tail -f /var/log/messages.log 

Jul 11 13:26:55 machine (ada0:ahcich0:0:0:0): lost device



[root@machine ~]# ls -lah /dev/ada*

crw-r-----  1 root  operator    0,  76 Jul 11 12:14 /dev/ada1

Sebulon said:
You have also stated that your hard drives was connected directly to the motherboard, so it would be nice to know what motherboard it is? And could you paste the output of:
# grep ata /var/run/dmesg.boot
# kldstat
# uname -a

The motherboard should be Intel D525MW.

  [root@machine ~]# grep -i -E "(ada|ata)" /var/run/dmesg.boot

ahci0: <Intel ICH7 AHCI SATA controller> port 0x20b8-0x20bf,0x20cc-0x20cf,0x20b0-0x20b7,0x20c8-0x20cb,0x20a0-0x20af mem 0xf0284000-0xf02843ff irq 18 at device 31.2 on pci0

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0

ada0: <WDC WD20EARX-00PASB0 51.0AB51> ATA-8 SATA 3.x device

ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

ada0: Command Queueing enabled

ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

ada1 at ahcich1 bus 0 scbus1 target 0 lun 0

ada1: <WDC WD20EARX-00PASB0 51.0AB51> ATA-8 SATA 3.x device

ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

ada1: Command Queueing enabled

ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

  [root@machine ~]# kldstat 

Id Refs Address            Size     Name

 1   29 0xffffffff80200000 9a1940   kernel

 2    1 0xffffffff80ba2000 25a598   zfs.ko

 3    2 0xffffffff80dfd000 25d80    krpc.ko

 4    2 0xffffffff80e23000 5e20     opensolaris.ko

 5    1 0xffffffff80e29000 225f8    geom_eli.ko

 6    2 0xffffffff80e4c000 3c8a0    crypto.ko

 7    2 0xffffffff80e89000 13798    zlib.ko

 8    1 0xffffffff80e9d000 9d8      accf_data.ko

 9    1 0xffffffff80e9e000 18e8     accf_http.ko

10    1 0xffffffff80ea0000 3688     speaker.ko

11    1 0xffffffff80ea4000 2c08     coretemp.ko

12    1 0xffffffff80ea7000 61b8     tpm.ko

13    1 0xffffffff81012000 a96b     fuse.ko

  [root@machine ~]# uname -a

FreeBSD machine 9.0-RELEASE FreeBSD 9.0-RELEASE #0 r229307+f19379b: Mon May 28 21:53:48 UTC 2012     root@machine2:/usr/obj/usr/src/sys/COMPANY  amd64

usdmatt · Jul 11, 2012

Very strange. I find it amazing that the system can apparently write to the pool with the disk out and not FAULT/UNAVAIL the device or clock up read/write errors. Can we see the zpool status output after writing to the phantom device and maybe the mount output as well (an obvious mistake that I wouldn't expect anyone to make but we have to rule out the possibility of the pool not being mounted and reads/writes not going to it. I really just can't believe ZFS isn't noticing at all...).

The disks are also on a hot-swap capable backplane I assume.

As you say the fact that ZFS still has the device online may cause it to have a 'hold' on the device, stopping the system from re-creating it, which would tie up with the fact that it appears as soon as you offline the device in the zpool.

Also the output of zpool online is confusing. I'm not sure why it would online the drive successfully (suggesting it's found the ZFS metadata and it ties up with the zpool), but bring it online faulted. If it's actually faulted then it shouldn't say it's onlined it (as far as I'm aware you can't have an online faulted device, makes no sense...). If the disk isn't faulted then it should resilver any changes to bring it in sync.

Interestingly, looking at the ZFS source, just after printing "warning: device '%s' onlined, but remains in faulted state\n", it should also print "use 'zpool clear' to restore a faulted device\n" if the device is in FAULTED state. What state does the disk actually show in zpool status after you get that message?

The idea of clearing a faulted device to restore doesn't really make any sense to me but if it says it in the source it may be worth trying the following after re-inserting the disk (It appears you're just testing at the moment so I assume you're not worried about data loss and it should just error out if it doesn't want to let you do it).

Code:

zpool offline pool adaX
zpool online pool adaX
zpool clear pool adaX

Sebulon · Jul 11, 2012

@Henu

OK, sorry to have to be so asking, but this is some serious wierd behaviour. Could you do that one more time, except like:
# ls -lah /dev/adaX
# tail -f /var/log/messages
Then pull out the "adaX" drive, wait 30s then push it back in and wait another 30s. Then while still having tail active, from another terminal:
# zpool offline pool adaX
After that you abort tail, then:
# ls -lah /dev/adaX
again, and paste the output.

This is probably something fs@freebsd.org could be interested in seeing. You're likely to have to account for the shape of your kernconf, how the sources was attained, any make-tweaks, and so on.

My best advice would be to redo this sequence from 9.0 Live and see if the behaviour is the same, to rule out anything wierd with your installation.

/Sebulon

Henu · Jul 12, 2012

usdmatt said:
Can we see the zpool status output after writing to the phantom device and maybe the mount output as well

Here is a writing of 200 MB when another of disks was pulled out the whole time.

  [root@machine ~]# zpool status storage

  pool: storage

 state: ONLINE

status: One or more devices has experienced an unrecoverable error.  An

	attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	using 'zpool clear' or replace the device with 'zpool replace'.

   see: [url]http://www.sun.com/msg/ZFS-8000-9P[/url]

 scan: resilvered 291G in 1h19m with 0 errors on Thu Jul 12 12:11:30 2012

config:



	NAME        STATE     READ WRITE CKSUM

	storage     ONLINE       0     0     0

	  raidz1-0  ONLINE       0     0     0

	    ada1    ONLINE      91 9.31K     0

	    ada0    ONLINE       0     0     0



errors: No known data errors

That resilvering message is from previous resilvering, ignore it.

  [root@machine ~]# mount|grep storage

storage on /storage (zfs, local, noatime, nfsv4acls)

storage/backup on /storage/backup (zfs, local, noatime, nfsv4acls)

storage/system on /storage/system (zfs, local, noatime, nfsv4acls)

usdmatt said:
The disks are also on a hot-swap capable backplane I assume.

The casing is Chenbro ES34069 and I believe it has hot-swap capability.

usdmatt said:
Also the output of zpool online is confusing. I'm not sure why it would online the drive successfully (suggesting it's found the ZFS metadata and it ties up with the zpool), but bring it online faulted. If it's actually faulted then it shouldn't say it's onlined it (as far as I'm aware you can't have an online faulted device, makes no sense...). If the disk isn't faulted then it should resilver any changes to bring it in sync.

Interestingly, looking at the ZFS source, just after printing "warning: device '%s' onlined, but remains in faulted state\n", it should also print "use 'zpool clear' to restore a faulted device\n" if the device is in FAULTED state. What state does the disk actually show in zpool status after you get that message?

Oh I'm sorry, it does print "use 'zpool clear' to restore a faulted device", I just though that is not important, so I removed that line from the output that I pasted here.

zpool status after 1) removing disk, 2) saying "zpool offline storage ada0", 3) putting disk back and 4) saying "zpool online storage ada0" says:

  [root@machine ~]# zpool status storage

  pool: storage

 state: DEGRADED

status: One or more devices are faulted in response to persistent errors.

	Sufficient replicas exist for the pool to continue functioning in a

	degraded state.

action: Replace the faulted device, or use 'zpool clear' to mark the device

	repaired.

 scan: resilvered 291G in 1h19m with 0 errors on Thu Jul 12 08:57:51 2012

config:



	NAME        STATE     READ WRITE CKSUM

	storage     DEGRADED     0     0     0

	  raidz1-0  DEGRADED     0     0     0

	    ada1    ONLINE       0     0     0

	    ada0    FAULTED      3     0     0  too many errors



errors: No known data errors

Resilvering message again from previous resilvering.

usdmatt said:
The idea of clearing a faulted device to restore doesn't really make any sense to me but if it says it in the source it may be worth trying the following after re-inserting the disk (It appears you're just testing at the moment so I assume you're not worried about data loss and it should just error out if it doesn't want to let you do it).

Code:

zpool offline pool adaX zpool online pool adaX zpool clear pool adaX

I do not have any real data at these disks. I'm just trying to make the hotswapping work as smoothly as possible and to be as error tolerate as possible in case of accidental removings of wrong disks and both software and hardware is currently for this purpose only. So I can do whatever you/I want to those disks

But anyway, saying "zpool clear storage ada0", "zpool clear storage", etc. does not do anything in this case.

Sorry for the slow replying time, but every time I need to resilver, it takes more than hour. I also have other things to do and I do this at office hours only :/

usdmatt · Jul 12, 2012

Ok well I've just tested the same thing with a FreeBSD 9.0 live CD on a HP DL120 that happens to be in my office at the moment.

Offlining the disk with zpool, then removing the disk, reinserting and onlining works perfectly as expected. The following messages are printed the instant the disk is removed:

Code:

lost device
removing device entry

If I pull the disk out while it's online, only the first message, 'lost device', is printed. ZFS does clock up write errors though. Not sure whether ZFS will fault the disk on FreeBSD after enough write errors?

Putting the disk back in does not recreate the device. As soon as the disk is offlined in the pool, the 'removing device entry' message is printed (which should of come up after 'lost device'), immediately followed by all the new device messages. At this point there's no way to get the disk back online in the pool. I suspect it should be possible to 'zpool clear' the device as in the message but as reported, this does nothing.

It appears the only answer for this at the moment is a reboot (didn't try this but the first post mentioned it brought everything back correctly). Further investigation would probably need to be done on the fs mailing list.

The main two issues I see are:

1) An online disk in zpool affects the ability of FreeBSD to remove/recreate the device on hot-swap. This could be deemed expected behaviour as the device is in use, and it's easy enough to get round, you just have to offline the device in the pool when you realise you've pulled out the wrong disk, or pulled out the right disk without offlining it first.

2) If a disk in ZFS is pulled live, you can't bring it online even when it's back as a valid, working device in the OS. It would be interesting to see what happens in Solaris/OpenIndiana when this is done. This is probably one for the ZFS devs to look at, although they could just come to the conclusion that you need to reboot in this case (if it fixes it) as you shouldn't really pull an online disk (although you'd have an argument if it works fine on other OS's).

usdmatt · Jul 12, 2012

One other thing, exporting the pool allows FreeBSD to complete the device removal and reattach. You can then import and the disk is back online. At least this is what just happened for me. Obviously you'd want to run a scrub which will probably find (and hopefully fix) errors but this may be an acceptable way of getting it back online without reboot.

Won't help if your running root on your pool though.

hopla · Jul 12, 2012

usdmatt, are you saying that you see the same behaviour? I.e. a removed drive does not put your zpool in DEGRADED/FAULTED state? (because that is the issue Henu is seeing right?)

I don't quite understand what your other points are, but let me just add this: when I add/replace a drive, I always have to run # camcontrol rescan all to have it be (re)detected by FreeBSD. That might also work for your 2nd issue?

usdmatt · Jul 12, 2012

Yes, if you pull out a hot-swap disk while it's still ONLINE in ZFS, it stays ONLINE. writing to the pool just clocks up write errors. I didn't test for long enough to see if ZFS finally decides it's had enough errors and faults it, but Henu suggests it never does. It may do on Solaris but unfortunately the integration between system/devices & ZFS doesn't seem as good on FreeBSD.

I don't think a rescan would help although I can try. With the drives not used for anything, I can pull/insert as much as I like and the devices disappear/reappear automatically as expected. It does seem that ZFS has a hold on the device. As I said it's easy to get round this by offlining the disk in ZFS although ZFS on FreeBSD should fault or unavail the disk ideally.

The question is whether it should be possible to get this drive back online in ZFS without having to export/import or reboot. It would be good to know if this is even possible on Solaris or the other OpenSolaris based OS's.