ZFS Page fault when opening ZFS pool

Hey all,

I upgraded my server to 10.1-RELEASE-p26 a couple of days ago with no problems what so ever until today, when I had a drive fail in my ZFS pool. Doing what I usually do, I replaced the faulty drive, and issued the command to replace the drive in the pool.

The replacement happened fine, and the pool started re-silvering. About 5 minutes in I got a page fault, and the server rebooted. Now every time I try to start up the ZFS pool I get a page fault. The server is essentially stuck in a loop. Even pulling out the new disk or putting in the old one doesn't help.

As I have replaced disks before on this machine with earlier versions of FreeBSD, and nothing else has changed on the system except the upgrade and new disk, I am at a loss as to what is going on.

Has anyone had a similar experience with RELEASE-p26 and ZFS? Anyone got any ideas for what I can do? (Short of blowing away the OS and reinstalling).

Thanks


EDIT:

Tried booting with the original FreeBSD liveCD, but unfortunately it cannot import the pool, saying that it doesn't support some features.

EDIT2:

Tried booting with the latest 10.2 Live USB stick, but that just page faults as well.
 
Last edited:
Tried replacing the hardware now, and still getting a page fault. Set ZFS_ENABLE off, enabled core dumps, and machine booted fine, loaded ZFS modules, everything booted fine, ran common commands such as zpool status etc, all fine (but the pool does not show up). When I try to do a ZFS import, it page faults. At least now I have some output:
Code:
Sat Mar 19 17:15:12 GMT 2016

FreeBSD Mnemosyne 10.1-RELEASE-p26 FreeBSD 10.1-RELEASE-p26 #0: Wed Jan 13 20:59:29 UTC 2016  root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

panic: page fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
fault virtual address  = 0x10
fault code  = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff809dc9db
stack pointer  = 0x28:0xfffffe0458553470
frame pointer  = 0x28:0xfffffe0458553480
code segment  = base 0x0, limit 0xfffff, type 0x1b
  = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags  = interrupt enabled, resume, IOPL = 0
current process  = 0 (zio_read_intr_1)
trap number  = 12
panic: page fault
cpuid = 3
KDB: stack backtrace:
#0 0xffffffff80962e70 at kdb_backtrace+0x60
#1 0xffffffff80927f95 at panic+0x155
#2 0xffffffff80d2574f at trap_fatal+0x38f
#3 0xffffffff80d25a68 at trap_pfault+0x308
#4 0xffffffff80d250ca at trap+0x47a
#5 0xffffffff80d0afa2 at calltrap+0x8
#6 0xffffffff81c1646f at i_get_value_size+0xcf
#7 0xffffffff81c12b50 at nvlist_add_common+0x60
#8 0xffffffff81c2014f at i_fm_payload_set+0x39f
#9 0xffffffff81c205ab at fm_payload_set+0x6b
#10 0xffffffff81c9ba5a at annotate_ecksum+0xba
#11 0xffffffff81c9b926 at zfs_ereport_finish_checksum+0x26
#12 0xffffffff81cab469 at zio_done+0x729
#13 0xffffffff81ca7492 at zio_execute+0x162
#14 0xffffffff81cab5ef at zio_done+0x8af
#15 0xffffffff81ca7492 at zio_execute+0x162
#16 0xffffffff81cab5ef at zio_done+0x8af
#17 0xffffffff81ca7492 at zio_execute+0x162
Uptime: 1m55s
Dumping 748 out of 16085 MB:..3%..11%..22%..33%..41%..52%..62%..71%..82%..92%

Anyone got any idea to what the problem is?
 
Hard to tell, you haven't given out all the information needed to help you.
Like:
how much memory is in your machine?
How large is the ZFS pool / pools?
Are you using any memory-hungry ZFS features? (dedup...)
and so on.
 
Sorry, wasn't sure what information was needed to be given. Plus the system worked fine for a year or more before the upgrade, so it isn't like I have added any new hardware that may have affected the pool.

16GB of RAM on the machine.
8 drive pool, 2 x raidz1, 1TB SATA disks.
2x SSD's for L2ARC and ZIL cache
Using only lz4 compression, no dedup. No other features made use of. Everything set to defaults.
Machine has 6 cores
 
Well, I was able to import the zpool in the end, using "zpool import -F". Unfortunately during the resilver another drive failed, and I lost the pool. So have now wiped and restored what I could from backup.

Still have no idea what caused this to occur, but ok. I suspect some sort of corruption in the pool itself, because after discarding a few transactions the system stopped page faulting on the pool, and I could open, mount and generally use the array.
 
Every time I read a forum post where someone has had an issue with ZFS (can't import pool, pool corruption etc.), I tremble just a little bit. For such a highly resilient file system, there sure are a lot of use cases where POOF, you can just loose everything. While I agree that the features and ease of management that comes with using ZFS are very attractive, is it worth these potential headaches that I seem so see on at least a weekly basis here?

My apologies for changing the subject somewhat with this post. I do hope that you are able to recover Unixnut as expediently as possible.
 
Well, all I can say is that since this happened, my (then new) ZFS array has been rock solid, and I have not lost a single byte of information. That is 5 years of reliability.

ZFS is very resilient, but like every system, it can have failures. A resilient file system is not a replacement for backups, and never will be. Hence I do a backup to an external hard drive. I only do it every 6 months though, because it does take a few weeks to perform said backup.

I have lost more data due to bit rot (even on HW raid arrays) than I have using ZFS, and I have no qualms using it as my "go to" filesystem for serious data storage.
 
Back
Top