Fatal trap 12 when trying to mount ZFS filesystem

Hi all,

I initialized a scrub on my home NAS, and directly aftwards initialized a sleep function on myself, then when I woke up the next morning, the NAS had rebooted and hung after starting ZFS with Fatal trap 12. If I started it up without ZFS, it was OK. Then directly after I tried to mount the ZFS filesystems, it paniced again. That system ran FreeBSD 8.0 i386 compiled from cvsup. I consulted mister Google, found some information where the tip was to upgrade FreeBSD since 8-STABLE has many ZFS-oriented fixes and improvements. So after upgrading, I started it up with 8.1-STABLE without ZFS- still OK so far. Next step would be to try and import the pool, since the new system didn´t know it had a pool imported before;
Code:
#zpool import -f zpool

Nov  7 20:59:48 main kernel: ZFS NOTICE: Prefetch is disabled by default on i386 -- to enable,
Nov  7 20:59:48 main kernel: add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
Nov  7 20:59:48 main kernel: ZFS filesystem version 13
Nov  7 20:59:48 main kernel: ZFS storage pool version 13
Nov  7 21:01:25 main syslogd: kernel boot file is /boot/kernel/kernel
Nov  7 21:01:25 main kernel: 
Nov  7 21:01:25 main kernel: 
Nov  7 21:01:25 main kernel: Fatal trap 12: page fault while in kernel mode
Nov  7 21:01:25 main kernel: cpuid = 1; apic id = 01
Nov  7 21:01:25 main kernel: fault virtual address	= 0x201f1f4
Nov  7 21:01:25 main kernel: fault code		= supervisor read, page not present
Nov  7 21:01:25 main kernel: instruction pointer	= 0x20:0x80662d54
Nov  7 21:01:25 main kernel: stack pointer	        = 0x28:0xfe90f12c
Nov  7 21:01:25 main kernel: frame pointer	        = 0x28:0xfe90f174
Nov  7 21:01:25 main kernel: code segment		= base 0x0, limit 0xfffff, type 0x1b
Nov  7 21:01:25 main kernel: = DPL 0, pres 1, def32 1, gran 1
Nov  7 21:01:25 main kernel: processor eflags	= interrupt enabled, resume, IOPL = 0
Nov  7 21:01:25 main kernel: current process		= 937 (thread_enter)
Nov  7 21:01:25 main kernel: trap number		= 12
Nov  7 21:01:25 main kernel: panic: page fault
Nov  7 21:01:25 main kernel: cpuid = 0
Nov  7 21:01:25 main kernel: KDB: stack backtrace:
Nov  7 21:01:25 main kernel: #0 0x8069019f at kdb_backtrace+0x4a
Nov  7 21:01:25 main kernel: #1 0x806582f2 at panic+0x150
Nov  7 21:01:25 main kernel: #2 0x8099ff36 at trap_fatal+0x427
Nov  7 21:01:25 main kernel: #3 0x8099faad at trap_pfault+0x306
Nov  7 21:01:25 main kernel: #4 0x8099f412 at trap+0x59a
Nov  7 21:01:25 main kernel: #5 0x8097a0ec at calltrap+0x6
Nov  7 21:01:25 main kernel: #6 0x80662699 at __sx_xlock+0x5b
Nov  7 21:01:25 main kernel: #7 0x80662a43 at _sx_xlock+0x38
Nov  7 21:01:25 main kernel: #8 0x875683af at dsl_pool_scrub_clean_cb+0x145
Nov  7 21:01:25 main kernel: #9 0x87567113 at scrub_visitbp+0x729
Nov  7 21:01:25 main kernel: #10 0x87566f12 at scrub_visitbp+0x528
Nov  7 21:01:25 main kernel: #11 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #12 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #13 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #14 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #15 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #16 0x87566d10 at scrub_visitbp+0x326
Nov  7 21:01:25 main kernel: #17 0x875670da at scrub_visitbp+0x6f0
Nov  7 21:01:25 main kernel: Uptime: 10m42s
Nov  7 21:01:25 main kernel: Cannot dump. Device not defined or unavailable.
Nov  7 21:01:25 main kernel: Automatic reboot in 15 seconds - press a key on the console to abort
So I thought: Could this be caused by me running FreeBSD on a non-generic system?

Since i run my root disk on a CF card, I just popped the card out, in with a empty one, installed 8.1-RELEASE, and booted generic kernel. Like the last time, everything works fine until I try to import the pool, and then it panics.
The pool itself however seems to be OK, ´cause if just run
#zpool import
I can see the pool, and there are no errors reported.

This absolutely sucks an elephants *something*! Ugh, I thought I had taken every precaution- building a storage system with redundancy up the wazzoo:
Boots from USB-key (and have a spare lying in case)
Mounts / from CF-drive (have spare for that as well)
ZFS then mounts /home /tmp /usr /var from pool
Pool is built with two parity disks and also one hotspare to be certain I could rely on it.
Damn you, Murphys Law!

Stored on this system is about 2TB of personal data that is extremely dear to me, especially all the images of my daughter. I need help.

/Sebulon
 
Not sure if it's related but I get the same trap 12 panic randomly on a 32bit 8.1 virtualbox guest during boot, just as it's about to mount to the (UFS) root file system...

James
 
Stardate 2010.317

Right now, I do wish I could have been onboard the USS Make Sh*t Up and bounced a graviton particle beam off the main deflector dish and saved my data from the nastiest ZFS black hole there is. I mean, what good is there to have a completly redundant disk management system when there is a real chance that the pool itself gets corrupt?

Pool layout:
zpool
raidz
ad5 PATA 500GB
ad6 PATA 500GB
ad7 PATA 500GB
raidz
ad8 SATA 1TB
ad10 SATA 1TB
ad12 SATA 1TB
spare
ad14 SATA 1TB

The thing is that FreeBSD thinks my pool is a-okey, no errors at all, in any tried system. But as soon you try to import the pool, it panics. OpenSolaris however thinks that three out of seven disks are faulted and couldn´t see the spare either, so the entire pool is considered to be shit.

I have tried the following:

From my machine, a Pentium 4:
1. Booting and trying to import the pool from my 8.1-STABLE i386 compiled system; crap
2. Booting into a fresh 8.1-REALEASE i386 system and importing; crap
3. I then found another processor, a Pentium 4 HT, newer model, repeated steps 1 and 2; crap x2
4. Booting OpenSolaris 2009.06 i386 didn´t work at all because too many disks were believed to be faulted, so again; crap

Then,
5. I mounted all of the drives into a completly different system, a 64-bit Core2 Duo, took an empty disk and installed a F-BSD 8.1-RELEASE amd64 system and tried to import; the crapness continues.

6. Booting OpenSolaris with my drives in the 64-bit system was... wait for it... also crap, because my Promise ATA 133 controller card wasn´t recognized, and that mobo, of course, is only equipped with SATA.

The only thing I have left is to find a ATA controller card that is going to get recognized by OpenSolaris, but even then, my chances are slim to none.
Please, if anyone have a suggestion, let me know. I feel that I have done almost all I can to troubleshoot this.
I will update this once I have found a different controller to try with.
 
I bet you now wish you had backups, right?
Anyway, here is a question: where do you keep you swap partition? Is it also in the zfs pool?
If so, try to put a swap partition on a disk somewhere else, and see if that helps.
 
I've never tried it (<-- disclaimer) but since the pool was created on 8.0 (zpool v13?) and you're now on 8.1 (zpool v14?), have you tried upgrading the pool (zpool upgrade?).
 
tingo:
Yeah, I feel so stupid right now, it´s crazy. In hindsight, it´s easy to see, but I guess I just got overconfident. Apparently, you can never be paranoid enough=S Regardless of how this goes, I´ll sure as hell keep separate backups from now on=)
The CF disk has two partitions, one for swap and one for (root)/

Jamz:
I´m pretty sure you have to have the pool imported before you can upgrade, but it´s a good idea, I´ll try that tomorrow.

A contact at Sun (or, more accuratly, Oracle nowadays) also said they have built in some cool features for emergency pool mending/rescuing in the latest released build of OpenSolaris that I´ll try as well. But it doesn´t do me much good if OpenSolaris still thinks the entire pool is faulted.

Fingers crossed!

/Sebulon
 
Update:

I can now at least confirm my suspicions about upgrading ZFS version of an exported pool does not work. You have to successfully import the pool before you can upgrade and that is where everything goes wrong for me.

I have tried different versions of FBSD and OpenSolaris, different kinds of processors, different kinds of hardware and architecture altogether- nothing changes the fact that it always panics when trying to import my pool, no matter what kinds of options I use for the command.

Trying to use zdb does not work in FBSD, it feels as if something is missing for it to work correctly, and was to be expected anyway. It works better on OSol though, but trying to do anything proactive runs out of memory after a while. My rig has 3GB RAM, where 4GB is max, so I doubt that one extra gig of RAM would do that big of a difference. What I wanted to have done was to take a bigger USB-key or something and add as a L2ARC device, but again, you have to have the pool imported first. Guh...

Tricky part is that when I installed the drives into the other system, that mobo didn´t have ordinary ATA, only SATA, and I tried two different ATA controller cards but none of them where recognized by OSol. Last possible chance for me is to find a 64-bit system with lots of RAM (8GB or more) and a ATA controller card that is supported, then perhaps zdb would be able to do the work necessary to clean the pool and making it able to import cleanly afterwards, but it sure doesn´t look good for those 3TB of data right now, I´ll tell you that.

Let this be a reminder for everyone out there that you can NEVER have enough backups, EVER. Ever, ever, ever!

The next system I build is going to have two pools, one big primary pool with high redundancy, and one smaller pool with lower or no redundancy just for total disaster recovery. Or perhaps just buy an external hard drive and connect into the NAS and rsync now and then, at least it´s better than nothing.

/Sebulon
 
Big update:

I have access to my pool again, yeay!!!

First of all, credits are in order. Thanks to a tip from Oracle engineering, I tried to download Solaris 11 express live CD (newest OpenSolaris- without the "Open" part), booted and tried to import the pool. Succeeeeeeeeeess! Sweet sweet success!

I have also bought an external 2TB USB-drive that I have started to copy everything onto, and it was dirt-cheap, around 130$ or 900SEK. Almost unbeliveable to get that much storage for such a low price.

I had lots of other issues it turned out. Mobo was going dead, cables were less than 100%, PSU was failing, two ATA133-controllers were giving read errors, the hot-spare disk had a loose power cable. Heck, the only thing good was the processor basically=)
Therefore, it´s very difficult to say exactly what started all of this, but my best guess is that the hot-spare disk was going on and off because of the loose power cable, which did not sit well with my pool as a whole, and caused a panic every time it tried to scrub areas on the spare disk, which it couldn´t and got very upset.

Anyway, for all of those that experience scrub-related errors of this kind, try to start up Solaris 11 express, import the pool and let it scrub itself clean from there.

I´ll update once it´s all done.

Last of all, I would like to again say thank you to Oracle engineering for their help, I was about to put 3TB of personal data down the drain.

/Sebulon
 
FWIW, I run my fileserver with two pools: one for the system, and one for the data. (The system pool is a zfs mirror, and the data pool is raidz1).
 
As promised, I am now putting in this last entry to this thread, to update on what has happened since the time. My guess is that if any one else are experiencing issues like this in the future, this might be on their reading list.

I have now started using two ZFS pools in the NAS. One with parity, where disks can break during operation without problem, and a secondary pool without any parity, just for complete disaster recovery.

I have also tinkered together a script that automatically replicates the data from the primary pool to the secondary. This is inserted as a cron-job that fires 04:00 every night:

Code:
0	4	*	*	*	root	/usr/local/bin/replicate run

# /usr/local/bin/replicate

Code:
#!/bin/sh
#
#
# Initial and continous zfs pool replication
#
#

case "$1" in

init)
#
# Makes initial replication
#

/bin/echo `date` Beginning initial replication sequence >> /var/log/replicate.log 2>&1;

# Take initial snapshots:
/sbin/zfs snapshot -r pool1/root@replicate.old >> /var/log/replicate.log 2>&1;
/bin/echo `date` Initial snapshots created >> /var/log/replicate.log 2>&1;

# Replicate data:
/sbin/zfs send -R pool1/root@replicate.old | /sbin/zfs recv -F -d pool2 >> /var/log/replicate.log 2>&1;
/bin/echo `date` Data replicated >> /var/log/replicate.log 2>&1;

/bin/echo `date` Initial replication sequence finished >> /var/log/replicate.log 2>&1

;;

run)
#
# Makes incremental replication
#

/bin/echo `date` Beginning incremental replication sequence >> /var/log/replicate.log 2>&1;

# Take new source snapshots:
/sbin/zfs snapshot -r pool1/root@replicate.new >> /var/log/replicate.log 2>&1;
/bin/echo `date` New snapshots created >> /var/log/replicate.log 2>&1;

# Replicate data:
/sbin/zfs send -R -i pool1/root@replicate.old pool1/root@replicate.new | /sbin/zfs recv -F -d pool2 >> /var/log/replicate.log 2>&1;
/bin/echo `date` Data replicated >> /var/log/replicate.log 2>&1;

# Destroy target .old snapshots:
/sbin/zfs destroy -r pool2/root@replicate.old >> /var/log/replicate.log 2>&1;
/bin/echo `date` Target .old snapshots destroyed >> /var/log/replicate.log 2>&1;

# Rename target .new snapshots .old:
/sbin/zfs rename -r pool2/root@replicate.new pool2/root@replicate.old >> /var/log/replicate.log 2>&1;
/bin/echo `date` Target .new snapshots renamed .old >> /var/log/replicate.log 2>&1;

# Destroy source .old snapshots:
/sbin/zfs destroy -r pool1/root@replicate.old >> /var/log/replicate.log 2>&1;
/bin/echo `date` Source .old snapshots destroyed >> /var/log/replicate.log 2>&1;

# Rename source .new snapshots .old:
/sbin/zfs rename -r pool1/root@replicate.new pool1/root@replicate.old >> /var/log/replicate.log 2>&1;
/bin/echo `date` Source .new snapshots renamed .old >> /var/log/replicate.log 2>&1;

/bin/echo `date` Incremental replication sequence finished >> /var/log/replicate.log 2>&1

;;

clean)
#
# Start over again
#

/bin/echo `date` Beginning cleaning process >> /var/log/replicate.log 2>&1;
/sbin/zfs destroy -r pool1/root@replicate.old >> /var/log/replicate.log 2>&1;
/sbin/zfs destroy -r pool2/root >> /var/log/replicate.log 2>&1;
/bin/echo `date` Cleanup complete >> /var/log/replicate.log 2>&1

;;

esac

It´s not smart or anything, but it gets the job done. Simple and effective. If anyone has any pointers, opinions, whatever, you are welcome to share.

If something like this ever happens again I can just switch over to using the secondary pool, thrash the primary and be back to normal in a day or two, instead of spending weeks trying to manually backup and restore. Viva la send/recv!

/Sebulon
 
What are your reasons to have the second pool without parity? Also, what do you mean without parity? Without redundancy, or without checksums?
 
Hi @danbi,

What I meant was without redundancy. As experience has taught me, it matters squat how much redundancy you have (mirror, RAID-(Z,Z2,Z3)) in case the "pool" itself becomes corrupt. Redundancy protects against hard drive failure, not pool failure.

Why without? Because it's only there as a secondary copy for complete disaster recovery, and having redundancy means more drives taking up space. I mean, the chances of two pools getting corrupt at the same time are astronomical. I'll take my chances :)

/Sebulon
 
Last edited by a moderator:
Sorry for the necropost here, but @Sebulon's issue describes what has just happened to me. It seems that my pool has become corrupt in the past few days due to some bad RAM (I'm an idiot, this is my first ZFS setup, and I didn't think to put ECC RAM in the machine... never again).

However, I'm in the process of downloading OpenIndiana presently, with the hopes that whatever "magic" happened with Sebulon's experience with Solaris 11 can be replicated there. If not, I'll be using Solaris 11.

@Sebulon: did you have to do anything specific? Or a simple zpool import pool got you up and running?
 
Last edited by a moderator:
OpenIndiana didn't even show anything when I typed zpool import. I'm downloading Solaris11 to try that after work. Now that I'm thinking about this, it's not quite the same issue as @Sebulon's. I had bad memory in the machine, unknowingly, and I'm sure that's caused some major corruption. Now, it may not be very widespread, but I should think that at least if I can import the pool, that some data might be salvageable.
 
Last edited by a moderator:
Back
Top