ZFS crashes on scrub or device removal

mix_room · Apr 26, 2010

I have a ZFS-pool which I wanted to use for testing purposes, my aim is to learn more about the pool, how to recover from disk-failures etc before I place it in semi-productive environment. The pool is based on files created with [cmd=""]mkfile[/cmd], and configured as below.

Code:

hostname# zpool status
pool: myzfs
state: UNAVAIL
scrub: none requested
config:

        NAME                          STATE     READ WRITE CKSUM
        myzfs                            UNAVAIL      0     0     0  insufficient replicas
          raidz2                          UNAVAIL      0     0     0  corrupted data
            /home/mix_room/ZFSTEST/file1  ONLINE       0     0     0
            /home/mix_room/ZFSTEST/file2  ONLINE       0     0     0
            /home/mix_room/ZFSTEST/file3  ONLINE       0     0     0
            /home/mix_room/ZFSTEST/file4  ONLINE       0     0     0
            /home/mix_room/ZFSTEST/file5  ONLINE       0     0     0

The creation was fine, but whenever I try to either a) scrub the pool b) offline a device or c) destroy the entire pool my computer crashes. I would like to determine why this is, perhaps someone could point me in the direction where I can proceed to determine the cause of these problems. Fortunately my machine, which is located away from me, restarts itsself most of the time.

Code:

uname -a 
FreeBSD host.domain.tld 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Jan  5 21:11:58 UTC 2010     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

eyebone · Apr 26, 2010

hej mix_room,

this problem sounds somehow related to another thread we're discussion here, have a look at:

https://forums.freebsd.org/showthread.php?t=13561

best regards,

mix_room · Apr 26, 2010

Seems like it might be a similar problem.

I tried disabling prefetch, but the problem remains. When I scrub the pool the machine reboots.

[cmd="/boot/loader.conf"]
snd_hda_load="YES"
vfs.zfs.prefetch_disable=1
[/cmd]

mix_room · Apr 29, 2010

Last time I tried to scrub my zfs pool the machine dies and would not restart automagically, so that caused a slight delay before I could get access to it.

Console logs show the following error:

Code:

Apr 26 14:30:48 hostname root: ZFS: vdev failure, zpool=myzfs type=vdev.open_failed

Could this be the cause of the crash?

User23 · Apr 29, 2010

is your /home filesystem clean? or is a background fsck running?

mix_room · Apr 30, 2010

I think [cmd=""]bgfsck[/cmd] might very well be running. But as the machine died hard again I have no possibility of checking at the moment. Will have to visit it again to reboot it.

I don't see why the filesystem being dirty should cause ZFS to crash the machine.

phoenix · Apr 30, 2010

Check /var/log/messages and /var/log/console.log (you enabled that in /etc/syslog.conf, right?) above that error message to see if there's anything about harddrive timeouts or errors. It sounds like one of the drives couldn't be accessed.

mix_room · Apr 30, 2010

phoenix said:
Check /var/log/messages and /var/log/console.log (you enabled that in /etc/syslog.conf, right?) above that error message to see if there's anything about harddrive timeouts or errors. It sounds like one of the drives couldn't be accessed.

Yes, /var/log/console.log was enables in /etc/syslog.conf. That the device cannot be accessed make perfect sense as I deleted it, # rm FILE, in order to simulate a hard-drive failure. I'm not using physical drives as I don't ahve enough of them.

However it still doesn't make sense to me that ZFS can kill the entire machine just by running a scrub. I have a hot-spare which should take over, but it doesn't, here I am assuming some mistaken setting in the hot-spare portion, but why does it crash and not return the status of the pool as completely defective?

phoenix · Apr 30, 2010

My understanding is that FreeBSD doesn't support hot-spares in ZFS, only cold-spares where you have to manually call "zpool replace" using the spare vdev. I've seen messages on the mailing lists about patches to enable hot-spares, but don't know for sure if they've been committed yet.

mix_room · Apr 30, 2010

After your earlier comment about file being missing I checked. I arranged another machine with FBSD 8.0 and ZFS. Scrubbing with all files present works fine, returns without errors. When I remove one of the files, scrub no longer works, the machine dies and reboots. The file that I previously deleted is also returned. Seems peculiar behaviour to me.

davidgurvich · May 1, 2010

Since you are deliberately trying to cause a filesystem error, perhaps scrub is finding the problem and a reboot is required to repair?

mix_room · May 1, 2010

You might be right, but if that is true, I would consider it a bug as it is unexpected behavior. I don't want to submit a PR before I know what is causing it, so I need to find out more.

Jago · May 5, 2010

I am pretty sure you aren't supposed to run scrub on a pool with missing/faulted devices. Reattach or replace said devices, wait for the resilver to finish, THEN run scrub.

mix_room · May 5, 2010

Jago said:
I am pretty sure you aren't supposed to run scrub on a pool with missing/faulted devices. Reattach or replace said devices, wait for the resilver to finish, THEN run scrub.

Can't do that as offlining the device is one of the things that causes the machine to hang.

And in any case: IF there is a problem with running a scrub when a device is missing, then this should be written into the documentation. I was not able to find any indication that this is the case.

phoenix · May 5, 2010

There's no problem with running a scrub on a pool with a missing device. I've done that multiple times over the past 10 days trying to fix an error with a faulted replacement drive in a raidz2 vdev ("unable to replace a replacing drive", nasty software bug that's not fixed until ZFSv20-something -- luckily the original drive still worked and I could use that).

You also don't need to run a scrub after a re-silver. A re-silver is a scrub. The only difference is that a scrub is read-only while a re-silver is read/write.

One other thing to try is to configure it with real disks and not files for the vdevs. If that works, then there's an issue with the file-backed vdev support.

Are you using ZVSv13 or v14? I think 8.0 only includes v13. You may want to upgrade to 8-STABLE, update the pool to v14, and then try to scrub/offline/etc.

mix_room · May 6, 2010

phoenix said:
One other thing to try is to configure it with real disks and not files for the vdevs. If that works, then there's an issue with the file-backed vdev support.

Hmm, that might be difficult without actually owning the disks. Should it work with memory disks aswell? Or do I have to go out any buy disks?

Are you using ZVSv13 or v14? I think 8.0 only includes v13. You may want to upgrade to 8-STABLE, update the pool to v14, and then try to scrub/offline/etc.

Will try to do that.

mix_room · May 6, 2010

To answer Phoenix's question, I was running ZFSv13.

Using # mdconfig -a did not help, it still crashes when I remove a drive and then try to scrub.

Is there anyone who has some drives that they could test this on.

Which mailing-list shall I submit the problem to, or shall I submit a PR?

phoenix · May 6, 2010

I know it works on real disks.

I just spent the last 10 days doing this. Scrub with 1 disk missing from a raidz2 vdev works without issues.

I was trying to recover from a situation where a replacement disk died during the resilver process, and I could not replace it due to a software bug "can't replace a replacing drive". One of the steps was to scrub the pool with the replacement drive in (not seen by pool) and without the drive plugged in.

The scrub worked perfectly in both situations.

(The eventual solution was to plug in the old drive which put the pool back into the original state. The bug is a real one in OpenSolaris but supposedly fixed in ZFSv20-something.)

mix_room · May 6, 2010

phoenix said:
I know it works on real disks. I just spent the last 10 days doing this. Scrub with 1 disk missing from a raidz2 vdev works without issues.

Ok, then I'll trust you on that one.

Just found out that the same error seems to appear on both i386 and amd64 architectures when running 8.0-RELEASE-p2, both using ZFSv13.

I didn't have a machine which I feel that I can update to 8.0-STABLE, the only spare machine I have was mistakenly updated to -CURRENT, so that is the route I'm going to go down.

This bug is leading me to consider buying a telnet power-strip so that I can remotely reboot my computer when it isn't responding. Very annoying to have to travel by car to power-cycle the machine.

mix_room · May 16, 2010

Finally got my laptop upgraded to -STABLE.

Am running ZFS pool version 14.

Code:

uname -rs
FreeBSD 8.0-STABLE-201004

The problem remains when using memory or file-backed devices, but not when using disks.

Test 1)

Code:

mkfile 512m file{1,2,3,4,5}
zpool create tank raidz5 file{1,2,3,4,5}
zpool scrub ->> ok
rm file1 
zpool scrub ->> crash

Test 2), where da0a is a 512MB partition on a usb-drive

Code:

mkfile 512m file{1,2,3,4,5}
zpool create -f tank raidz2 da0a file{2,3,4,5}
zpool scrub ->> ok
REMOVE da0
zpool scrub ->> ok, but faulty array as expected

Seems as though there is a problem with the file-backed vdev.

whyde · Jun 11, 2010

"zpool offline" hangs w/ file-backed storage?

I am also having a "hard hang" when trying to do a "zpool offline" of a file-backed device from a raidz2 pool created from 4 disks and 1 sparse file, using 8.1-BETA1.

I was trying to create a 5-drive raidz2 pool, initially in a 4-drive + 1-sparse-file configuration, where I'd immediately degrade the pool to just the 4 drives by offlining the sparse file. This is because the 5th drive contains data already that I want to migrate to the raidz2 pool, then repurpose the 5th drive as the last drive in the pool.

The "zpool create -f mypool raidz2 disk1 disk2 disk3 disk4 sparsefile" works OK.
The "zpool offline mypool sparsefile" immediately hangs the system hard, requiring a power cycle.

After the first reboot, I deleted the sparsefile by hand, causing the pool to degrade when it next came up on its own. However, it still hangs when I try to offline the sparsefile.

I'm leaning toward the theory that it has something to do with file-backed storage at this point, but don't know what I can do to help diagnose the problem.

Code:
The problem remains when using memory or file-backed devices, but not when using disks.

Test 1)

Code:

mkfile 512m file{1,2,3,4,5} zpool create tank raidz5 file{1,2,3,4,5} zpool scrub ->> ok rm file1 zpool scrub ->> crash

Test 2), where da0a is a 512MB partition on a usb-drive

Code:

mkfile 512m file{1,2,3,4,5} zpool create -f tank raidz2 da0a file{2,3,4,5} zpool scrub ->> ok REMOVE da0 zpool scrub ->> ok, but faulty array as expected

Seems as though there is a problem with the file-backed vdev.

whyde · Jun 11, 2010

Whoops, wrong forum.

OK, I'll go ahead and censure myself here, since this was technically happening under PC-BSD 8.1-BETA1. However, I believe the problems are related to what's at issue under FreeBSD also.

whyde said:
I am also having a "hard hang" when trying to do a "zpool offline" of a file-backed device from a raidz2 pool created from 4 disks and 1 sparse file, using 8.1-BETA1.