ZFS Panic Galore

hackish · May 3, 2012

I'm running a 32 bit version with zfs. There is 4GB of RAM on this machine.

Lately it has started crashing every month or so. I had more RAM added and upgraded to 8.3. I was trying the PAE kernel but it could crash after an hour or so.

So I backed down to the regular GENERIC kernel but if I'm lucky it stays up for an hour.

Any insight into what might be wrong?

Beeblebrox · May 4, 2012

Looks like an HDD problem to me....
Are you sure that the system is not having HDD time-outs, which means loosing the connection with the HDD? Look through all the messages in the system's logs for any hint of what's going on.

Install and run sysutils/smartmontools.
# smartctl -a /dev/ada0
will show how many and what type of errors you have had re the HDD.

hackish · May 4, 2012

There are no hints in the system logs.

Here is some of the output

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD5000AAKS-00V1A0
Serial Number:    WD-WCAWF3101822
LU WWN Device Id: 5 0014ee 1027d88ad
Firmware Version: 05.01D05
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May  4 14:36:31 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

snip...

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART overall-health self-assessment test result: PASSED

To me it would appear that the drive is in good health.

hackish · May 4, 2012

Here is another dump.

Hardware problem? I just had the RAM replaced with fresh stuff and it happened again.

kpa · May 4, 2012

Have you done any tuning of vm.kmem_size* tunables? On i386 the recommendation is to up both vm.kmem_size and vm.kmem_size_max at least to 512M for stable operation of ZFS:

http://wiki.freebsd.org/ZFSTuningGuide#i386

As the page states, if you need more than 512MBs of kmem you'll have to compile your own custom kernel with increased KVA_PAGES setting.

hackish · May 4, 2012

The kernel is a GENERIC 8.3-RELEASE-p1 as of May 4th 2012.

The only startup options are

Code:

vm.kmem_size="512M"
vm.kmem_size_max="512M"
vfs.zfs.arc_max="160M"

These were taken from the ZFSTuningGuide. I also did try recompiling the kernel with the KVA_PAGES options but it would panic immediately. My assumption is that a GENERIC kernel has the best chance of being consistent with the rest of the world.

I am having serious second thoughts about continuing to run ZFS on a production machine.

kpa · May 4, 2012

You would probably have better luck with amd64 version of FreeBSD, all of this tuning would be mostly unnecessary except for the vfs.zfs.arc_max setting.

I would run both short and long self-tests on the disk drive with smartctl(8) to make sure the drive is ok, sometimes clean SMART stats do not tell the whole story about the drive's health.

Beeblebrox · May 4, 2012

Those messages look way too much like hardware error and I propose you first eliminate any such possibility from your system.
If you are sure that the HDD is fine, try memtest, preferably from a linux CD or else sysutils/memtest.

hackish · May 4, 2012

All of the smart tests passed. I have arranged to have a fresh hard drive installed with a fresh copy of 64 bit FreeBSD on it. I'll see if I can use that to read the ZFS partitions.

Beeblebrox · May 4, 2012

Hardware isn't "just the HDD". You need to take a complete approach, including those trivial cables.
Have a look at this also: http://www.inquisitor.ru

hackish · May 5, 2012

We completely replaced the hardware. Same issue. I suspect it may be a bug in the ZFS system where it's unable to deal with some sort of filesystem corruption in the ZFS. There are now 2 drives in the system, one with FreeBSD 64 bit and the original. Any suggestions on how I might try to mount this filesystem from with the FreeBSD 64 bit system? Of course the fresh install doesn't know how to find the drive with the old zfs pools.

Beeblebrox · May 5, 2012

Sorry to have wasted your time on a the hardware side, better to be safe than sorry though I think...

Can you tell us where is your swap? Is swap part of the ZFS pool or on its own proper slice?

Any suggestions on how I might try to mount this filesystem from with the FreeBSD 64 bit system?

# zpool import -f -R /media/rescue <poolname>
f to force the import, R to specify where you want it mounted (altroot)
You can also pass in the above command -o canmount=noauto to prevent automatic-mounting of datasets, then mount datasets by hand using
# mount -t zfs pool/dataset <mountpoint>

hackish · May 5, 2012

The swap was on a slice. There was only 1 ZFS partition.

Code:

FreeBSD cl-t153-284cl 8.3-RELEASE FreeBSD 8.3-RELEASE #0: Mon Apr  9 21:23:18 UTC 2012     
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

Code:

zpool import -f -R /dev/ad16s1g email
Unexpected XML: name=stripesize data="0"
Unexpected XML: name=stripeoffset data="977305088"
---SNIP---
Unexpected XML: name=stripesize data="0"
Unexpected XML: name=stripeoffset data="32256"
cannot import 'email': pool is formatted using a newer ZFS version

The 32 bit version of the OS was cvsup'd within minutes of each other. No idea why they show different versions. Is it possible that the format of the file has changed with the 64 bit version? I understood this was one of the strengths of zfs - uniform format between all platforms.

hackish · May 5, 2012

Hmmm... Correction it looks like I still need to build/install to get the version up to date.

Beeblebrox · May 5, 2012

# zpool upgrade -v
Will give you the version you are running. For details, zpool(8)(). The mountpoint in import needs to be folder name, not device.
# zpool import -f -R /media email

hackish · May 5, 2012

It would appear that I'm running version 14. I thought I read somewhere that 8.3 was supposed to be version 28. I was planning to upgrade it to 9 so I'll try that and see if it's able to read the 32 bit filesystem that way.

Just for kicks I tried to rebuild the kernel (dual booted to 32 bit) so maybe I could get it to run long enough and query what ZFS version it is. If I load the zfs module on that I get about 5-10 seconds before kernel panic...

hackish · May 6, 2012

I'm still having problems with this zfs thing. I upgraded to 9.0-release.

The replacement filesystem has a ZFS pool called email. I need to find a way to access the email zfs pool from the old disk which is installed in the system. It's on /dev/ad16s1g whereas the "good" zfs system is on /dev/ad6s1g and it's up and running properly.

The closest I've gotten is this:

Code:

zpool import email ad16s1g
cannot import 'email': pool may be in use from other system, it was last accessed by mail (hostid: 0xb486f72a) on Sat May  5 22:14:07 2012
use '-f' to import anyway

I'm a little afraid at this moment that it's going to do bad things to my existing pool that has the same name. How will I know what pool is which? I want to mount it, copy the data off and unmount it.

hackish · May 6, 2012

Ok, bit the bullet. Backed up my /email and tried it. Looks like there must be a bug in the kernel.

[cmd=]zpool import -f email ad16s1g[/cmd]

kpa · May 6, 2012

Leave out the ad16s1g from the import command, the pool is detected by on disk metadata and you can not give device names as part of the import command like that.

hackish · May 6, 2012

How do you tell zfs to look on that disk for the partition? If I do a zfs list here is what I get:

Code:

zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
email   118M   356G   118M  /email

That's my new email zfs filesystem living on /dev/ad0s1g.

As you can see from my feeble attempts above I really want to work with the one on /dev/ad16s1g so I can copy the data to this one.

kpa · May 6, 2012

The detection is solely based on on-disk metadata, you can not tell zpool(8) to use a specific device for import.

If you run just this as root it should list all pools that are available for import:

# zpool import

hackish · May 6, 2012

Running import lists only 1 pool.

Code:

 pool: email
    id: 10433152746165646153
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        email       ONLINE
          ada1s1g   ONLINE

I'm not sure anymore what pool this is...

hackish · May 6, 2012

Would you suggest I run [cmd=]zpool import -f 10433152746165646153[/cmd]
?

Nearly every command I've tried so far on this thing has resulted in a kernel panic. So far I've tried 8.3 32bit and 8.3 64bit and now 9.0 64 bit with very consistent panics.

kpa · May 6, 2012

Try this, it will import the pool forcibly and mount it under temporary mount point /altroot so you take a look what the pool contains (and to avoid a situation where the pool gets mounted over the system directories):

# zpool import -f -R /altroot 10433152746165646153

hackish · May 6, 2012

Code:

zpool import -f -R /mnt 10433152746165646153
cannot import 'email': pool already exists

Code:

zpool import -f -R /altroot 10433152746165646153
cannot import 'email': pool already exists

Wasn't sure if the altroot was an option or path...

Code:

zpool import -f -R /altroot 10433152746165646153 olddata

Did some googling... As soon as I ran this I got the all too familiar kernel panic.

I'll see if I can bring the image home with me.

Code:

dd if=/dev/ad16s1g > zfsimage.dat

I don't think I have time to go to bsdcon but if this problem persists I might have to go over there and see if any of the kernel experts can help.

ZFS Panic Galore

hackish

Attachments

Beeblebrox

hackish

hackish

Attachments

kpa

hackish

kpa

Beeblebrox

hackish

Beeblebrox

hackish

Beeblebrox

hackish

hackish

Beeblebrox

hackish

hackish

hackish

Attachments

kpa

hackish

kpa

hackish

hackish

kpa

hackish