ZFS kernel panic

I am having serious ZFS kernel panics, and I don't know how to fix it.

I bought a used server to set up my "big box" home file server, intending to run a pretty basic setup with 8.2R and using ZFS to manage the storage pool. I'm new at ZFS (I didn't get much exposure to FBSD 5 thru 7) so I wanted to take time to familarize myself, set up some dummy file-backed pools and poke them with a stick until I understood what I'd need to do in the event of a real disk failure.

So I made some empty backing files of 100Mb each using dd, and created a raidz pool from them (this is from memory, because the history evaporates when the box panics, but I've done it a few times by now):
Code:
cd /usr/z
dd if=dev/zero of=disk0 size=204800
dd if=dev/zero of=disk1 size=204800
dd if=dev/zero of=disk2 size=204800
dd if=dev/zero of=disk3 size=204800
zpool create tank raidz /usr/z/disk0 /usr/z/disk1 /usr/z/disk2
echo hellohowareyou > /tank/hello
zpool add tank spare /usr/z/disk3
zpool status tank
So far, so good, a "zpool status" shows me nothing unexpected. Now I simulate a disk failure, and try to make zfs aware of it:
Code:
dd if=dev/zero of=disk1 size=204800
zpool scrub tank
And that's when the kernel panic happens. After this, I can reboot over and over again, and anytime I bring ZFS online by doing a zpool command (it's *not* enabled in rc.conf) to list my pools, or show their status, or whatever...boom, another kernel panic.

This happens in at least two FreeBSD distros. I started with 8.2R-i386 and fought to make it work (because my box can only run 32-bit VMs on top of ESXi, which is a separate challenge, but for this test I was running on the metal), but gave up and tried 8.2R-amd64, and got the same kernel panic under the same circumstances. It is always a page fault 12, supervisor read, page not present. The instruction, stack and frame pointer values are all (respectively) the same from crash to crash.

As a bonus, my box seems incapable of doing a crash dump. It looks like it starts, but then it locks up and doesn't dump, nor automatically reboot; after I bounce it, crashinfo says there's nothing there.

My hardware is a Dell PowerEdge 1800, dual Xeon 3Ghz (Nocona I think; HTT yes, VT no), 2Gb RAM. It also has a Dell CERC SATA 1.5/6ch RAID controller, but for these exercises I wasn't using that, rather a separate disk on the onboard SATA port.

I'm not sure where to go from here. I ran Memtest86+ for a while, and got a passing grade. I (basically) tried two different OSes, on different HDDs, and got the same failure. That points to hardware, but the failure mode points to software. What else can I try?
 
Use mdconfig to create file backed md devs. Use this md devs to create your Pool. Do your self a favor and don't use ZFS on i386 or amd64 with less than 4GiB of RAM. (2Gb = 2Gigabit = 256MiB?)
 
Thank you Crest, using md devices worked well. I'm still curious about the kernel panics when using files directly, as this is explicitly supported by ZFS -- I wonder if it's worth making a bug report?
 
mcgee said:
I'm still curious about the kernel panics when using files directly, as this is explicitly supported by ZFS -- I wonder if it's worth making a bug report?
If it isn't too much trouble, you might want to try it on 8-STABLE - either by updating your tree and re-building, or from a recent snapshot. There have been a lot of changes in the ZFS code since 8.2-RELEASE - in particular, ZFS v28 was MFC'd. Needless to say, don't upgrade any pools to v28 if you want to be able to use them on older releases.
 
Yes, ZFS supports using files for the backing store of a vdev. However, it rarely works in practise. :) It's really only meant for testing and prototyping, and should not be used for any decision making processes.

Use real block devices (like md(4)-backed files) if you want to do real testing with real failure modes.
 
phoenix, that's exactly what I thought I was doing, setting up a sandbox where I could play with ZFS to learn the ropes before trying to use it in production. If file-backed vdevs were just a bad idea, didn't work well, or didn't work at all, that would be one thing...but the consistent, full-bore, show stopping kernel panics are something else, and IMO something that must be fixed. Per Terry's suggestion I am spinning up some VMs to try to elicit the same bug on -STABLE, both i386 and amd64...in between other tasks, so it may take a few days....
 
Terry_Kennedy said:
If it isn't too much trouble, you might want to try it on 8-STABLE - either by updating your tree and re-building, or from a recent snapshot. There have been a lot of changes in the ZFS code since 8.2-RELEASE - in particular, ZFS v28 was MFC'd. Needless to say, don't upgrade any pools to v28 if you want to be able to use them on older releases.

A bit delayed, no matter. So I spun up a new ESXi VM with 8 CPUs and 12Gb RAM, put on 8.2-STABLE amd64 from sources, and repeated my file-backed experiment on ZFS v28. I got the same kernel panic.

This time I caught the crash dump, though I don't really know how to read it. One possibly useful thing I spotted, in the segment
Code:
Loaded symbols for /boot/kernel/opensolaris.ko
is this:

Code:
#5  0xffffffff808c6d4f in trap (frame=0xffffff83642386f0)
    at /usr/src/sys/amd64/amd64/trap.c:477
#6  0xffffffff808aec24 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:228
#7  0xffffffff8107d6ed in vdev_file_io_start (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:157
#8  0xffffffff81098137 in zio_vdev_io_start (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2374
#9  0xffffffff81097d63 in zio_execute (zio=0xffffff010af24a50)
    at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1196

What I think I see is that everything was fine until vdev_file_io_start, which is what generated the page fault that cascaded into a panic. So if I were debugging, I guess I'd start there. But I'm not a kernel hacker.
 
Back
Top