Panic on pool import

I have a pool containing a filesystem that appears to have become corrupt and now causes the kernel to panic on import.

The host system is AMD64/9.1-RELEASE (same issue with the 9.1 RC's).

The initial problem was a filesystem that became corrupt during an interrupted rsync. The problem filesystem was renamed and restored from backup. Accessing the corrupt entry in the renamed filesystem causes a kernel panic.I no longer have a backtrace of this panic.

Upon attempting to destroy the problem filesystem (and descendant snapshots), the kernel panic'd and can no longer be imported without causing a further panic.

Backtrace (screen cap):
kwDMZ.gif


The pool contains 28 disks attached via two fiber channel shelves and a QLE2462, divided into 4 RAIDZ2 vdevs, with both log and cache devices.

I attempted to import the pool via Solaris 11 live cd, which failed with an error indicating the pool is corrupt (I assume diverged meta data b/t FreeBSD and Solaris?).

The root pool is also ZFS. No dump device is presently configured.

The kicker is the backups and live data are both on the one pool, therefore I need to recover it.

Any suggestions on how to proceed are greatly appreciated.
 
Update:

Attempted to import the pool with 10-CURRENT, with the same result as 9.1.

Also attempted to import the pool with OI 151a5 and, much to my surprise, got (essentially) the same result - kernel panic with at a very similar stack frame (function bp_get_dsize).
 
@grant

Try at your own risk, you may loose data.

Here are some things you can try that I got from a developer.
  • Import the pool in Verbatim mode; even if there is any faulted vdevs, this option treats the pool configuration as it is completed:
    # zpool import -V pool
  • Import a non-importable pool from an extreme rewind.
    # zpool import -F -X pool
  • Import the pool in read-only mode:
    # zpool import -o readonly=on pool

/Sebulon
 
@Sebulon, thankyou for the suggestions. Unfortunately, the result is the same.

The sequence of events that led to the current situation appears to be:
* Power was interrupted during heavy io,
* A dataset became corrupt,
* When I attempted to remove the corrupt dataset, the kernel panic'd,
* Subsequent attempts to import the pool result in the above panic when ZFS attempts to replay the intent log.

Is there some means via zdb, or other, to destroy the intent log, or preferably destroy the corrupt dataset? Most operations I try on the corrupt dataset result in zdb exiting with 'error 16', or it asserts...

Is there a way to dump the valid datasets without importing the pool?
 
Is there a way to dump the valid datasets without importing the pool?
No way that I know of and not logical by design.

# # zpool import -o readonly=on pool
Should have worked, because with read-only import:
Pool transaction processing is disabled. This also means that any pending synchronous writes in the intent log are not played until the pool is imported read-write.
Why did that not work? What was the output? You can read more details here:
http://docs.oracle.com/cd/E19963-01/html/821-1448/gbbwl.html
http://docs.oracle.com/cd/E19963-01/html/821-1448/gbchy.html#gazug
 
Beeblebrox said:
No way that I know of and not logical by design.

# # zpool import -o readonly=on pool
Should have worked, because with read-only import:

Why did that not work? What was the output? You can read more details here:
http://docs.oracle.com/cd/E19963-01/html/821-1448/gbbwl.html
http://docs.oracle.com/cd/E19963-01/html/821-1448/gbchy.html#gazug

The result of attempting to import the pool read only is as per the screen capture in the first post.

Reading through the call tree, it appears that it is attempting to finish removing a interrupted snapshot destroy, but is encountering corrupt meta data, which leads to an assertion being triggered... I could try fudging the values via the kernel debugger, but I have no means of determining sane values.

I've been attempting to locate critical files via zdb so I can attempt to recover them via raw block dump, but unfortunately zdb aborts before the required files are located (the specific assertion detail was lost in the frame buffer, I'll trigger it again and post later).
 
Way above my paygrade but if you can trace the call to being related to a destroy, could you rebuild the kernel with the destroy bypassed (just putting a return at the top of dsl_dataset_destroy or dsl_destroy_inconsistent by the look of it)? No idea if it'll work and you'd still want to import read-only but it appears trying to get the data is paramount.
 
usdmatt said:
Way above my paygrade but if you can trace the call to being related to a destroy, could you rebuild the kernel with the destroy bypassed (just putting a return at the top of dsl_dataset_destroy or dsl_destroy_inconsistent by the look of it)? No idea if it'll work and you'd still want to import read-only but it appears trying to get the data is paramount.

Thanks for the suggestion. I made dsl_dataset_destroy() return immediately and the pool imported read-only, minus the one damaged dataset. This is much better than no data at all!
 
Back
Top