9.0R ZFS Panic

I have a pair of servers that I recently converted from openindiana to FreeBSD 9.0-RC3 (export data pool (raidz), replace boot drives, install FreeBSD, import pool). FreeBSD was installed in both cases on a zfs root mirror (using http://www.aisecure.net/2011/11/28/root-zfs-freebsd9/). Server #1 (2xdual-core xeon) has functioned flawlessly, but #2 (dual opteron--Tyan K8S MB) has had intermittent kernel panics (avl_find succeeded inside avl_add()) since the conversion. The only references to this error I've been able to find are in reference to opensolaris with no resolution.

The two installs are nearly identical -- I created the root drive for server #2 by adding it to the mirror of server #1 after setup and then removing it. I've recently upgraded both systems to 9.0-RELEASE but server #2 is still experiencing the same issue. Thinking it might be an issue with the imported filesystem, I attempted a scrub. Unfortunately, it is unable to complete the scrub without panicking. I reinstalled the openindiana boot drives and ran the scrub from there--it fixed a few checksum errors but panics still occur when switching back to FreeBSD.

This is compounded by a possibly unrelated issue causing the server to hang on panic instead of rebooting automatically. I disabled debug in sysctl.conf and it now displays the 15-second countdown but appears to hang instead of rebooting. Rebooting via the shutdown command works so I don't think reboot functionality is otherwise broken. The kernel panics wouldn't be a huge issue if the autoreboot worked (it's a remote server). Anyone have any ideas on the panic or at least how to ensure the server reboots if it occurs -- do I need to disable tracing as well?

Thanks,
Will M.
 
Please, post more technical information about faulty machine. Especially: zpool version, ram size, zpool size, disk count, disk type etc. Maybe there are some errors in syslog realted to this issue? If so, you could post it too.
 
Sorry, I'll try to be more verbose. Unfortunately, I have switched the faulty server back the the openindiana boot disks so I won't be able to gather debug info easily--I had taken a picture of the trace earlier, but I seem to have lost it.

Both pools are v28 RAIDZ originally created by opensolaris. Server #1 is 5x1TB+5x1.5TB(striped) w/8GB Adata DDR2 SSD ZIL and Server #2 is 5x2TB (Samsung HD204 drives)--no ZIL. Both servers are using LSI 1068 HBAs(though #1 is PCI-E and #2 is PCI-X). I forgot to mention in the first post that both servers have kern.maxvnodes=250000 set in loader.conf.

Other Server2 specs:
Tyan S2882 MB w/2xOpteron 246
10GB RAM

I didn't notice anything in the logs leading up to the panic and the servers have swap on zfs, which I understand breaks crash dumps. It does appear to be related to disk activity(scrub will trigger it, and my unison sync scripts manage to do so as well).

One thing I did find interesting: if I force a panic manually on #2(kill -6 1), the automatic reboot works correctly. #2 is also reachable via ping while stuck in the panicked state--not sure if this is normal.

I'll likely reattempt the FreeBSD conversion of #2 in a few weeks, but this will likely involve a new MB, CPUs, RAM, HBA, and a fresh pool so I don't expect the issue to reoccur.
 
Would it be possible to swap the drives between the two servers? As in, physically move them between servers?

If the "bad" pool works correctly on the Intel system, and the "good" pool also fails on the AMD system, then it would point toward a hardware issue with the AMD system. Check cables, reseat controllers, reseat RAM, check heatsinks/fans, clean out dust, run memtest86+, run a CPU checker, etc.

If the "bad" pool also fails on the Intel system, and the "good" pool works correctly on the AMD system, then it would point toward "something" wrong with the pool.
 
Unfortunately, the two servers are mirrors of each other--can't really take them down at the same time.

FreeBSD doesn't do any post-install CPU optimizations/tuning does it? The two installs were cloned from each other--if it were trying to use SSE2/3 instructions it would explain the panics. I've never run across any documentation that this is the case, though, so I haven't really investigated it.

When I rebuild the server, the new board (S2997) will be AMD as well. I'll try to follow-up with this thread if it works or not.
 
If you set CPUTYPE in /etc/make.conf on the Intel box, then went through a buildworld cycle, and then cloned that to the AMD box ... bad things will happen. :)

If you did not set CPUTYPE, then it will default to i486 (for 32-bit CPUs) or amd64 (with MMX, SSE, and SSE2 enabled; for 64-bit CPUs). Those should be usable on both Intel and AMD versions of 64-bit x86.
 
Back
Top