ZFS raidz2 keeps freezing 8.0 64bit system

big_girl · Oct 10, 2010

Hi all,

First, I initially posted something about this at http://forums.freebsd.org/showthread.php?t=4623 but the problem is actually not apparently related to kvm_getenvv but rather seems related to ZFS, so I hope it belongs here.

I created one zpool approx 3 months ago on 8.0 x86_64 and it was a 6 volume (1TB wd1001fals disks) raidz2 pool, using ZFS v13. I also use gnome. Starting a couple weeks ago, typically during very long rsync transfers to computers across a local switch, I began to experience system hangs where the whole machine froze partway through the transfer. The other machine was fine. This has gotten worse to the point where the system will totally freeze within seconds of mounting the ZFS and not even doing anything with it. It sits on top of geli encrypted volumes. I ran a scrub yesterday but the system froze about 3/4 of the way through, after approximately 6 hours.

I just did a clean install on a separate disk with 8.0, and got the same problem. It also happened with 8.1 on an SSD.

At the crash in the fresh install of 8.0, after attaching the geli volumes and typing

Code:

zfs import {raidz2.volume}

the system hangs and the error message I see is:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address = 0x28
fault code = supervisor read data, page not present
instruction pointer = (didn't copy here)
frame pointer = ""
code segment = ""
processor eflags = interrupt enabled, resume, IOPL=0
current process = 15 (txg_thread_enter)
trap number = 12
panic: page fault
cpuid = 3
Uptime: 9m4s
Cannot dump. Device not defined or unavailable.
Automatic reboot....

Then she freezes..

I also ran memtest but got no errors. In the machine are also a SCSI card and an LSI 8port SAS controller.

As Galactic_Dominator suggested in the other thread, I examined the drives for physical damage. I used WD 'Data Lifeguard Diagnostic Tools' and did the 'quick test' because it doesn't make any changes but no errors were found. I could do the more extensive test but would rather not since the drives are all pretty new (< 1 year) and it potentially makes changes if it finds errors. I'm afraid changes might make volumes unusable; but then again, two of them could be lost and the volume could still be recovered..

Any help on how to import this ZFS volume would be MUCH obliged!!

Thanks in advance,
-bg

big_girl · Oct 11, 2010

Update - removing the LSI card and booting the system disk from USB via a SATA to USB adapter also froze within seconds after geli attaching the six volumes and typing

Code:

zpool status -x

Funny thing is, it quickly returned 'All pools are healthy' before it froze. I worked with it for 20min or so before attaching those volumes, ran fsck, etc., but as soon as that ZFS volume was touched, the whole thing froze.

Galactic_Dominator · Oct 11, 2010

I specifically didn't suggest memtest because it can only validate memory is bad. It cannot prove it is good. There are certain memory errors memtest cannot detect. Please use the procedure I gave earlier to validate your memory. It possible the ZFS/GELI combo is stressing your memory in ways normal operations don't. As I said earlier, it looks like a hardware issue. If there is any way you can take the drives to another system, I suggest to you try it there. If you have another computer, you could boot from an mfsbsd cd, import the pool and see what happens. If it works, you've at least ruled out a large set of possibilities.

Also an extended hard drive test isn't usually destructive. I can't remember all of them though so I can't say for certain on your setup.

It may come down to having the crash dump analyzed.

big_girl · Oct 11, 2010

Thanks for following up -- I swapped out the memory into two sets and got the same freeze.

There were some issues with the WD1001FALS drives and needing to use a program to adjust the timeout (to make it longer) but I never did this and I didn't find anything recommending this either way for freeBSD + ZFS.

I'll move the pool over to another system and see what happens.. it will be a few hours before I can attempt this..

Thanks again,
-bg

Galactic_Dominator · Oct 11, 2010

big_girl said:
Thanks for following up -- I swapped out the memory into two sets and got the same freeze.

Well that pretty much rules out RAM then, but still would be nice to see what happens on another system. CPU, L1-2 cache and memory controller are still in play.

big_girl said:
There were some issues with the WD1001FALS drives and needing to use a program to adjust the timeout (to make it longer) but I never did this and I didn't find anything recommending this either way for freeBSD + ZFS.

I believe you're talking about the wdidle utility, and that in theory should not have anything to do with the issue you're seeing. All that does prevent the drive from parking so often. Maybe you mean something else.

If all that fails, try the STABLE mailing list. There are some good ZFS/SMART people there who may have better ideas.

big_girl · Oct 11, 2010

I wonder if there's any utility in swapping out one of the six disks in the RAIDZ2 and seeing if I get the freeze? Very little extra work besides the reboots.. and would rule out a damaged HD?

Or a waste of time?

Thanks again,
-bg

Galactic_Dominator · Oct 11, 2010

I think that would work. You can try it before you try the pool elsewhere. Worth a shot I guess. I don't think I've ever seen a bad HD cause a panic, but then again I've never used your type of setup either.

EDIT: I should have said I've ever seen a bad HD cause a panic when it's part of an abstracted reduntant device like gmirror or raidz. A single drive setup with a disappearing device of course can and does cause a panic.

big_girl · Oct 12, 2010

Nope, no luck. Omitted two drives each of three ways and got the freeze. Omitting all drives gave no freeze.

Next to another system..

big_girl · Oct 12, 2010

Don't know if this is helpful or not, but on the generic 8.0 R4 64bit kernel, I get the same instruction pointer for each freeze, which is

Code:

0x20:0xffffffff80e662d3

The attributes of the code segment are also the same each time, while the stack and frame pointer addresses vary.

At

HTML:

http://www.freebsd.org/doc/en/books/faq/advanced.html

one can use 'nm' to get more info about the calling function, but I am unsure of how to get the kernel name so I haven't succeeded with 'nm'.

Thanks,
-bg

big_girl · Oct 13, 2010

Unfortunately on another system the exact same problem occurs shortly after importing the volume.

The last thing before trying the disks on a different system was an attempted scrub a couple of nights ago, and it crashed after getting nearly finished. When I imported the RAIDZ2 ZFS volume on the new system just now, after typing

Code:

zpool status

it informed me the scrub had resumed, but then I got the same freeze and error message on the screen again with the same instruction pointer address after 1 minute of importing.

This was also on 8.0 R4 64bit.

Thanks,
-bg

big_girl · Oct 14, 2010

Pure excitement -- my laptop, a lenovo g530, is also running 8.0 64bit, and appears to be showing the early signs of the same problems. I've got a geli encrypted USB disk I connect to it, and also a geli encrypted partition, both with ZFS volumes.

Either ZFS v13 on 8.0 doesn't really work or I'm making systematic errors that cause all of my volumes to eventually cause system hangs that are not recoverable.

Please help.

Love always,
-bg

Galactic_Dominator · Oct 14, 2010

There have been a lot of ZFS improvements since 8.0. I didn't suggest it earlier because you indicated you are worried about data loss, but upgrading to STABLE would be a logical step to see if any of those improvements resolve the issue. I seriously doubt such an upgrade would eat your data, but you never know. Otherwise, take to the stable or fs mailing list, people there are far more of an expert in the area.

big_girl · Oct 14, 2010

Word. I can easily dd the disks to duplicates in case of doing something risky, but the 2nd thing I tried previously was installing 8.1, then importing the RAIDZ2 volume. Same freeze.

I definitely feel like the data is there and is ok (plus I have a recent backup) but there's definitely something I'm missing..

Thanks,
-bg

Galactic_Dominator · Oct 14, 2010

This may also help:

PR kern/117158

big_girl · Oct 14, 2010

Thanks for this - I always type

Code:

geli_autodetach="NO"

in /etc/rc.conf from the beginning at install time, so that's likely/hopefully not it (including laptop and other system I used to test the other day).

But I'm pretty convinced I'm bunging something up, so here's more info:

Since the main system (with the 6x1TB wd1001fals RAIDZ2 volume - this is the one that usually panics within a minute of issuing a zpool command; however, if I run [cmd=]zpool scrub[/cmd] right away after geli-attaching the 6 volumes, that has gone for approx 6 hours, almost to completion) has 8GB ram, I didn't tune the ZFS (v13) at all.

I never decrypt/mount at boot time. I boot up the computer into gnome, start a root terminal session, decrypt the key (on a UFS2 USB stick), use the key to attach the volumes, unmount/destroy the key, and finally mount the zfs.

I don't have any entries in fstab for any of these volumes.

I'm trying to brainstorm anything else that's typical of how I've been setting up these geli/ZFS volumes.

Does anything here stand out as a potential point of failure? Like I said, I did try once with an 8.1 install to import the pool, but appeared to get the same freeze requiring a hard restart.

Thanks again,
-bg

EDIT - a couple of other thoughts - the freezes on the big volume coincided also with my switch from scp to rsync for fairly large (50GB-800GB) file transfers over a local network switch (if you google rsync, ZFS and FreeBSD there are apparently some issues); I think the last crash came when doing the biggest transfer so far, which was supposed to be 800GB and crashed after about 350GB. It was also to a ZFS filesystem folder within the zpool; otherwise, everything else in the zpool is in folders (not separate filesystems). I usually set ownership to root and permissions to 400 within the zpool (which should include the ZFS filesystem in it as well as all the folders created directly in the zpool folder which mounts at / when I type zfs mount) and set ownership to me and permissions of the ZFS filesystem to something like 755, and do the transfer as me (not root).

The other thing I can think of is that also during a large file transfer, maybe a week or so before the problems started, my stupid cousin came to visit and started pushing buttons on my computers; she rebooted by freeBSD box with the RAIDZ2 array while it was transferring files to a backup zpool (that was a 4 x 1TB RAIDZ) via either rsync or scp.

Galactic_Dominator · Oct 14, 2010

I don't think there's anything wrong with the setup, seems logical and definitely shouldn't cause a panic. I use GELI/UFS volumes in a similar manner.

Perhaps export the decrypted GELI over ISCSI, then with a OSOL host connect to ISCSI, and import the pool. See what happens.

The other system to connected the pool to earlier, did it have the identical SATA controller?

big_girl · Oct 14, 2010

Yes, for the other system I used the same LSI card; on the main system where the volume lives I've used either the onboard (Intel G35) or the LSI card.

For the card I added

Code:

mpt_load="YES"

to /etc/rc.conf recently after the problems started but had forgotten to do it previously.

It will take a while, but that is a good idea to decrypt and export the volumes to new disks. That way I can see whether the issue stems from geli or from ZFS.

Thanks,
-bg

Galactic_Dominator · Oct 14, 2010

You can try setting kern.geom.eli.debug. 3 is the most verbose.

# zdb may also provide some info.

big_girl · Oct 14, 2010

dnode.c

geli-attaching the disks with verbosity showed nothing awry, then (still as root) running [cmd=]zdb -v tank.zfs[/cmd] ran for a while, printed out a bunch of file names and attributes, then produced this error (but no freeze!):

Code:

Assertion failed: (size <=(1ULL << 17) (0x2c0000 <= 0x2000)), file 
/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c, line 264.

Abort

Trying to re-produce the error now then will load from 8.1 release and run [cmd=]zdb[/cmd] again.

-bg

big_girl · Oct 14, 2010

Incidentally, it appears that *maybe* the issue with the ZFS volume on the USB disk had to do with the 'path=' often being incorrect; i.e. when I typed 'zdb' it was revealed that the 'path=' parameter for this volume was set to 'da0.eli' when the disk was actually at 'da1.eli'. Typing [cmd=]zdb <usb zfs disk name>[/cmd] threw an error about not finding it. Unplugging & replugging so it was back on 'da0.eli' allowed [cmd=]zdb <usb zfs disk name>[/cmd] to run to completion & report no errors.

But I can't help thinking that issues with unmounting and recent hangs might be related in the case of the USB drive.

The problematic RAIDZ2 volume had 'path=' set correctly.

-bg

big_girl · Oct 16, 2010

Alright, as previously running [cmd=]zdb -v tank.zfs[/cmd] produced the error I printed, I then ran [cmd=]zdb tank.zfs[/cmd] on the same RAIDZ2 volume again (I omitted the verbose argument this time) in an attempt to reproduce the error.

However, instead of running for a short while (in minutes) as previously, then crashing, this time it ran for approximately 36 hours, then finally bonked with this error:

Code:

Assertion failed: fsize <= (1ULL<< 17) (0x15ce800 <= 0x2000)), 
file /usr/src/cddl/lib/zpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c, line 422.
Abort

For the heck of it, I then tried (without mounting the ZFS volume, as I cannot without generating a panic and system freeze) [cmd=]zpool export tank.zfs[/cmd]

But then got the

Code:

Fatal trap 12: page fault while in kernel mode

error after a few seconds. This time it didn't freeze up but rather rebooted.

I wonder if the next step might be trying 8.1 Release and the 'zdb' command?

Thanks again,
-bg

Galactic_Dominator · Oct 16, 2010

Like I said earlier, 8-STABLE is a better choice. You can then try upgrading to zpool version 15 then and all the other zfs fixes/improvements since then. Although I'm not sure about the wisdom of advising you to do the upgrade when a scrub will not successfully complete, but then again given where it's at now...

big_girl · Oct 16, 2010

Ahh, I see -- before your post I mistakenly assumed 8.1 RELEASE was what I wanted but clearly I want STABLE. I'll use this month's 8.1-STABLE amd64 snapshot from ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/201010/ and give it a shot.

And as I said earlier, I do have a recent backup so it's not a huge issue if the RAIDZ2 gets mangled.

After booting into 8-STABLE, do I have to [cmd=] zpool import [/cmd] the pool BEFORE upgrading it to v15? Also, is there any wisdom in trying [cmd=]zdb[/cmd] or any other troubleshooting first on this pool (before import/upgrade) after booting 8-STABLE?

And as I mentioned previously, I have one zfs filesystem within the zpool; it was my understanding that the zpool should be upgraded first, and then any separate zfs filesystems within the pool also need to be upgraded separately.

DD - thanks for your comment about

  and [code] -- I'll be more careful.



Thanks and all the best,

-bg

big_girl · Oct 17, 2010

Getting killed here.. Same exact issue (Fatal trap 12) unfortunately after running 8-STABLE (ZFS v15) and attempting to import the zpool.

Will run [cmd=]zdb[/cmd] again on this zpool and report how it crashes..

-bg

big_girl · Oct 17, 2010

[cmd=]zdb[/cmd] won't run as it doesn't know about the pool..