8.2p6 updated to 9.0, zpool core dump

ZFSZealot · Jan 21, 2012

System was installed from amd64 8.0-CURRENT before 8.0-RELEASE and has been upgraded numerous times via the buildkernel/buildworld method in /usr/src/Makefile. All has been fine up to and including 8.2p6 build done a couple of days ago to prep for 9.0 upgrade. Attempt to upgrade to 9.0-RELEASE, zpool explodes trying to get a status with the new kernel.

Core2Quad Q9300 and 8GB RAM on a SuperMicro X7SBE, 1 80GB SATA boot disk and 1 16GB SSD attached to the motherboard SATA controllers. 8 1TB SATA disks attached to a AOC-SAT2-MV8 controller (Marvell 88SX6081) make up the data0 pool. Three more 34GB SCSI disks attached to an Adaptec 29160 make up the scsi15k pool, with the 16GB SSD attached to the motherboard as L2ARC for the scsi15k pool.

Root is UFS on the 80GB SATA disk, I moved all of the other base filesystems (/var, /usr) over to the data0 pool.

Only kernel config mod with both 8.2 and 9.0 was to comment out hptrr, I am not sure if this is still necessary in 9.0 with mvs but hptrr would of course pick up the Marvell controller instead of atapci in 8.x.

Source code was updated with the "cvsup" facility, accomplished make clean; make kernel-toolchain; make buildworld; make -DALWAYS_CHECK_MAKE buildkernel KERNCONF=CADENCE; make -DALWAYS_CHECK_MAKE installkernel KERNCONF=CADENCE (all appeared successful after commenting everything in /etc/make.conf except PERL_VERSION). Then I rebooted to single user mode to continue with mergemaster and installworld and got this with a zpool status:

Code:

  pool: data0
 state: ONLINE
 scrub: ...message I didn't capture about "scrub stopped" with some absurdly large number...
config:

        NAME        STATE     READ WRITE CKSUM
        data0       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
            ad14    ONLINE       0     0     0
            ad16    ONLINE       0     0     0
            ad18    ONLINE       0     0     0
            ad20    ONLINE       0     0     0
            ad22    ONLINE       0     0     0

errors: No known data errors

  pool: scsi15k
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        scsi15k     ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
          da2       ONLINE       0     0     0
        cache
Assertion failed: (nvlist_lookup_uint64_array(nv, ZPOOL_CONFIG_STATS, (uint64_t**)&vs, &c) == 0), file
 /usr/src/cddl/sbin/zpool/../../../cddl/contrib/opensolaris/cmd/zpool/zpool_main.c, line 1045.
pid 28 (zpool), uid 0: exited on signal 6 (core dumped)
Abort trap (core dumped)

Moving the kernel.old back in place on the UFS "/" filesystem and rebooting got me back in business but obviously there's a problem.

Here's the normal output of zpool status in 8.2p6, same before and after the attempt to use the 9.0 kernel:

Code:

  pool: data0
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data0       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
            ad14    ONLINE       0     0     0
            ad16    ONLINE       0     0     0
            ad18    ONLINE       0     0     0
            ad20    ONLINE       0     0     0
            ad22    ONLINE       0     0     0

errors: No known data errors

  pool: scsi15k
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        scsi15k     ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
          da2       ONLINE       0     0     0
        cache
          ada1      ONLINE       0     0     0

errors: No known data errors

I think this may be related to the SSD I am using as L2ARC for my small fast SCSI pool? The output stops right after printing "cache" although not sure if it's buffered...

Things I have not tried:

1. Export the pools, boot from a 9.0 DVD and see if I can import and do a zpool status (rule out broken kernel build)
2. Drop the L2ARC ada1 device and try the 9.0 kernel again. (Rule out SSD)

Again, system has been rock solid for a couple of years now, never with errors, including on the L2ARC SSD.

Confidence is a little shaken in the v28 code now though and I don't want to break my pools. Are either of these "safe" actions to try?

I have posted the core dump and photo of the console when this happened here:

http://www.dognet.org/zpool.core.bz2
http://www.dognet.org/zpool.core.photo.jpg

pruik · Feb 10, 2012

Not sure if it will help your debugging process but I've just encountered the same (sort of) error on a 8.2p6 system. Original install was done with MFSBSD 8.2-release in order to achieve root on ZFS. Yesterday I updated to 8.2p6 and zpool status gives;

Code:

Assertion failed: (nvlist_lookup_uint64_array(nvroot, ZPOOL_CONFIG_STATS, (uint6
Abort (core dumped)

Pool contains two disks in mirror and a SSD for L2ARC. Will do some testing this weekend including removing L2ARC.

ZFSZealot · Feb 14, 2012

I have not had a chance unfortunately to work with the system again because it's in production, it is still running 8.2p6.

I would try removing the L2ARC device myself to see if it makes a difference, if it's any interest to the community, but no guarantee when I can take the system out of production long enough to try that.

Otherwise I think I'm going to go with building a fresh 9.0 system and migrating the pool over to that.

pruik · Feb 15, 2012

I've got another server running 8.2p6 with the exact same problem. This server does not use L2ARC so taking your server out of production is maybe not worth your while.

ZFSZealot · Feb 17, 2012

@pruik, that information is extremely valuable to me. Thank you so much!

I do need to try installing a new boot disk with a fresh copy of 9.0 from DVD. If that doesn't work I will try disconnecting disks from one pool or the other and see if I can isolate it to one pool.

Another detail, I'm not sure if this matters, but these were originally v14 pools because I built this from 8.0-STABLE before the 8.0 release. Obviously I did the zpool upgrade quite some time ago to v15.

ZFSZealot · Feb 17, 2012

Correction: The data0 pool was v14 upgraded to v15. The scsi15k pool was created much more recently, after 8.2-RELEASE and was always v15.

ZFSZealot · Feb 21, 2012

So today I installed 9.0-RELEASE on a separate boot disk on a similar server and briefly booted from it in the production server with 8.2-RELEASE-p6 and the pools. I managed to successfully import the scsi15k pool with its L2ARC device, but could not import the data0 pool because it is on my Marvell 88SX6081 controller and the hptrr driver picked it up just like in prior releases. With the newly built boot/system disk back in the spare server I compiled and installed the custom kernel without hptrr using the normal procedure in the handbook (buildworld, buildkernel, installkernel, reboot, installworld, etc) and moved it to the production server to try again. Both pools imported perfectly.

I have not tried updating the source with cvsup yet, this is all based on the /usr/src that comes with the 9.0-RELEASE DVD.

I guess this is solved although I am not sure what went wrong with my build on the production server. I will continue to build and configure necessary services and packages using the newly built system disk in the spare server and swap in the system disk when that is done.

Hopefully this helps someone.

pruik · Feb 27, 2012

Did a freebsd-update rollback yesterday from 8.2p6 back to 8.2p2 and zpool status is working correctly again. Both servers were installed using the 8.2 MFSBSD image in order to get an easy full ZFS system. Don't want to invest more time than I already have in this confusing"zpool status behaviour because in a couple months both servers will be upgraded to 9.0 anyway.

I don't understand what could have broken zpool status if you look at /usr/src/UPDATING;

Code:

20120104:       p6      FreeBSD-EN-12:01.freebsd-update
        Extend the character set accepted by freebsd-update(8) in file
        names in order to allow upgrades to FreeBSD 9.0-RELEASE.

20111223:       p5      FreeBSD-SA-11:06.bind, FreeBSD-SA-11:07.chroot
                        FreeBSD-SA-11:08.telnetd, FreeBSD-SA-11:09.pam_ssh
                        FreeBSD-SA-11:10.pam
        Fix a problem whereby a corrupt DNS record can cause named to crash.
        [11]

        Add an API for alerting internal libc routines to the presence of
        "unsafe" paths post-chroot, and use it in ftpd. [11]

        Fix a buffer overflow in telnetd. [11]

        Make pam_ssh ignore unpassphrased keys unless the "nullok" option is
        specified. [11]

        Add sanity checking of service names in pam_start. [11]

20111004:       p4      FreeBSD-SA-11:05.unix (revised)
        Fix a bug in UNIX socket handling in the linux emulator which was
        exposed by the security fix in FreeBSD-SA-11:05.unix.

20110928:       p3      FreeBSD-SA-11:04.compress, FreeBSD-SA-11:05.unix
        Fix handling of corrupt compress(1)ed data. [11]

        Add missing length checks on unix socket addresses. [11]

20110528:       p2      FreeBSD-SA-11:02.bind
        Fix BIND remote DoS with large RRSIG RRsets and negative
        caching.

20110420:       p1      FreeBSD-SA-11:01.mountd
        Fix CIDR parsing bug in mountd ACLs.

20110221:
        8.2-RELEASE.