Solved Kernel panic when scrubbing ZFS pool

Hi!

My FreeBSD HTPC started crashing regularly, that is every day or so. But it's much more often (once per hour) when it's scrubbing 6x2TB RAID-Z2 pool.
To verify my theory, I downloaded and flashed the minimalusbstick.img onto a USB, booted into Live mode, imported the pool and scrubbed it - no problem whatsoever.
The computer passes Memtest86(+) for some 12h, same with prime95.

Hardware:
  • CPU: Intel Core 2 Quad Q6600
    Mobo: P5N32-E sli
    RAM: 8GB DDR2 800MHz
    GPU: Nvidia gt520
    IBM ServeRaid M1015 for sata port expansion

Here is the error code that gets printed onto the screen before it reboots:
Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 0, apic id = 00
If I do cat /var/log/messages | grep panic (or fault or trap instead of panic) I don't get anything. So I actually waited and took a picture of the screen when it had crashed. Please tell me a better way.

Anyhow, I enabled the crashdump and attached or two links for two dumps which I ask you to please take a look at. (I'm unable to get the kgdb working)

https://filetki.si/owncloud/public.php? ... 7370b03121
https://filetki.si/owncloud/public.php? ... 6ad6e784f1
EDIT: Links taken down. I got it working by running the following line thanks to nakal.
Code:
kgdb /boot/kernel/kernel /var/crash/vmcore.0
 
You should check your hardware thoroughly. I've had a similar issue with random crashes which were caused by apache handling requests. No tools, like memtest could find anything. Then I got the idea to start 4 processes of "openssl speed". I could reliably crash the system. I also wrote a small program that filled the memory bytes while "openssl speed" was running. It found byte constellations that it did not write. After a hint from a FreeBSD developer that it is hardware, I began to search. I ordered a replacement CPU. And indeed it was the cache in the CPU which was faulty, flipped some bytes and caused Apache to crash and sometimes the entire system. I just described it here so you know that finding faulty hardware is not very trivial.

From my experience you have to have a very bad luck if it is really FreeBSD that causes crashes (I have never had this situation yet, except for some special kernel features and modules, which are clearly not behaving randomly, but crashing quite reliably and easy reproducible). Mostly you need to hunt after hardware problems in this case.

One question remains... you are running GENERIC kernel right or does it differ from the one on the USB stick?
 
Migelo said:
My FreeBSD HTPC started crashing regularly, that is every day or so. But it's much more often (once per hour) when it's scrubbing 6x2TB RAID-Z2 pool.
To verify my theory, I downloaded and flashed the minimalusbstick.img onto a USB, booted into Live mode, imported the pool and scrubbed it - no problem whatsoever.
The computer passes Memtest86(+) for some 12h, same with prime95.

Anyhow, I enabled the crashdump and attached or two links for two dumps which I ask you to please take a look at. (I'm unable to get the kgdb working)
The first dump file is 88 MB, so I'm not going to download it just to look at it.

In general, if the backtrace is the same in each case (including source line numbers), you have hit a reproducible software bug. In theory, a filesystem error (either ZFS or the older UFS) should not cause a kernel panic. But if corrupted / bad data is read and the kernel tries to act on it without checking it for validity, you can get a panic. Many of these operations are guarded with checks which either prevent the crash or provide a more useful crash reason (like "panic: assert: foo_pointer is NULL"), but not all of them are (all those checks introduce additional overhead for dealing with a [presumably] very rare situation).

If you get the panic from different places in the kernel (even if triggered by the same command), you probably have a hardware problem such as bad CPU, memory, power supply, etc. There is at least one case where this can still be a kernel bug - if a prior kernel operation caused data to be written to the wrong memory location, then random code later on may detect that corruption and panic the system.
 
nakal said:
I ordered a replacement CPU. And indeed it was the cache in the CPU which was faulty, flipped some bytes and caused Apache to crash and sometimes the entire system. I just described it here so you know that finding faulty hardware is not very trivial.
On any relatively recent CPU*, it should log a MCA message, like this (copied from here) rather than just dying:

Code:
Oct  1 10:43:42 marvin kernel: MCA: Bank 4, Status 0xf41b210030080a13
Oct  1 10:43:42 marvin kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007
Oct  1 10:43:42 marvin kernel: MCA: Vendor "AuthenticAMD", ID 0x40fb2, APIC ID 0
Oct  1 10:43:42 marvin kernel: MCA: CPU 0 UNCOR OVER BUSLG Responder RD Memory
Oct  1 10:43:42 marvin kernel: MCA: Address 0xbfa478a0
Intel implemented MCA starting with the Pentium 4. I'm not sure if AMD implemented it before or after that.
 
nakal said:
One question remains... you are running GENERIC kernel right or does it differ from the one on the USB stick?
GENERIC, yes.

I don't think it's a hardware problem, since if I subject the CPU to the same load (zpool scrubbing) under the same FreeBSD 9.3 GENERIC system, it does not crash.

Terry_Kennedy said:
The first dump file is 88 MB, so I'm not going to download it just to look at it.
It's actually about 800MB, I just compressed it good. How do I get the backtrace, since the debugger is not working.

If you get the panic from different places in the kernel (even if triggered by the same command)
How can I see this?

Terry_Kennedy said:
On any relatively recent CPU*, it should log a MCA message, like this.
cat /var/log/messages | grep marvin turns out empty.

Can someone please take a look at the dumps to see if they have the same backtrace.
 
Migelo said:
Terry_Kennedy said:
On any relatively recent CPU*, it should log a MCA message, like this.
cat /var/log/messages | grep marvin turns out empty.

grep for "MCA" instead. "marvin" is the hostname of the machine from which this example comes.
 
Migelo said:
Terry_Kennedy said:
If you get the panic from different places in the kernel (even if triggered by the same command)
How can I see this?
It should be in the panic message itself. Take a look at this image from another unrelated trap 12 panic. Look at the instruction pointer, if you don't have a full backtrace. Note that checking for the "same place" via instruction pointer means identical kernels, identical hardware. If you have a backtrace included in the panic, like show in this image, you can simply use the offsets into each module (foo+0x123) to determine if you are seeing the same cause. Note that the top few entries in the backtrace will generally be the same, since they are part of the panic reporting.

If you post pictures of the screen with the full panic message displayed, other people can help you see what's relevant.

cat /var/log/messages | grep marvin turns out empty.
As was pointed out, marvin is the name of the system that example came from. But that whole reply of mine was directed at the other user that said their CPU went bad with no indication. Sorry for confusing you in thinking it was related to your issue.
 
ag74 said:
Migelo said:
Terry_Kennedy said:
On any relatively recent CPU*, it should log a MCA message, like this.
cat /var/log/messages | grep marvin turns out empty.

grep for "MCA" instead. "marvin" is the hostname of the machine from which this example comes.
Thank you, I'll post that later when I can get access to the machine.


Is there any better way than for me to film the machine's screen when it's scrubbing the pool until it panics?
I also provided the two crash dumps which should provide sufficient evidence to what is going on, as I imagine that we can figure out from the dumps themselves if the the panics have the same backtrace.
 
Migelo said:
Here are two pictures I managed to get from the crashes. I think that the traces are the same. One picture is very bad, I'm sorry.
Ok. It looks it is being triggered by ZFS and seems that the fault is actually in vm_page_remove (which was probably passed bad data from one of the earlier routines).

I don't think you have a hardware problem - at least, a hardware problem is not the direct cause of these seemingly-identical panics. An earlier problem may have corrupted the pool in a manner that the kernel can't handle without a panic.

One of the mods will probably chime in and suggest which FreeBSD mailing list will be best to ask this question on (most developers don't read these forums, instead they follow the mailing lists in their areas of expertise).
 
Terry_Kennedy said:
On any relatively recent CPU*, it should log a MCA message, [...] rather than just dying:

Believe me, there was nothing like this in logs. It was a silent memory error that appeared when the CPU was under load for some minutes. Changing the CPU (Intel Xeon) was the solution.

If the panic is not random (like in my case), locate it in the crash dump. Did you compare /boot with the /boot on USB? Maybe one of the kernel components is faulty. Did you use the same set of sysctl during normal boot and the USB boot?
 
Just one not-totally-related-question: When I try to debug the kernel using kgdb I get:
Code:
kgdb: could not find a suitable kernel image
I googled but did not find anything. Also, the directory described in the Handbook is empty:
Code:
/usr/obj/


Did you compare /boot with the /boot on USB? Maybe one of the kernel components is faulty. Did you use the same set of sysctl during normal boot and the USB boot?
You mean the
Code:
/boot/loader.conf
?

Did you use the same set of sysctl during normal boot and the USB boot?
Please explain a bit better, I'm a noob. =)
 
Migelo said:
Just one not-totally-related-question: When I try to debug the kernel using kgdb I get:
Code:
kgdb: could not find a suitable kernel image

Did you try like this (provided, you have a crash dump in vmcore.0):
Code:
kgdb /boot/kernel/kernel /var/crash/vmcore.0


Also, the directory described in the Handbook is empty:
Code:
/usr/obj/

It just means that you never have built world or kernel by yourself, I guess. Which is totally OK.

Did you compare /boot with the /boot on USB? Maybe one of the kernel components is faulty. Did you use the same set of sysctl during normal boot and the USB boot?
You mean the
Code:
/boot/loader.conf
?

No, I just ask myself if the kernel and modules are OK on the filesystem. These are all files under /boot/kernel/ directory. You can compare them quickly by calculating checksums or use cmp with those that work on the USB drive. This is not very probable that there is a difference, but I just want to exclude the obvious errors.

Did you use the same set of sysctl during normal boot and the USB boot?
Please explain a bit better, I'm a noob. =)

When you don't know what sysctl is, you probably did not change anything, so it's fine.
 
Ok, that worked and I got to debug the two crash dumps I posted links to in the first post. So it really seem that it's ZFS.

http://imgur.com/x3Asb6u
http://imgur.com/uCEpofl

Also thank you for the explanation why that folder is empty, because there was no explanation in the Handbook.

Now things have turned for the worse. If I'm importing/exporting/scrubbing the pool whole system will just freeze. It doesn't matter if I'm running the already installed system or a fresh live version from the USB. So I'm not going to compare the /boot/ directory as it seems that it's a hardware fault. Do you agree?

Now I'm trying to get all the disks attached to another system, boot the liveUSB from there and see whether I get the same freeze.
 
The scrub completed fine on another system, so it's definitely a hardware problem. The motherboard? The SATA controller?
 
Thanks, I tested the tool and it seems cool. I'm used to doing the usual Memtest86+ and others, good to have one extra!

As it seems, albeit strange, that my mobo's memory controller is dying. At first, I ran Memtest86+ and it worked BUT the progress bar wouldn't move anywhere even if left for hours, so that got me thinking. Every liveCD of any kind of system would just freeze after n minutes of use, no error, just freeze. Same happened with the inquisitor, the screen would just freeze. So I replaced the 4x2GB of RAM with another 4 sticks of another DDR2 RAM, thinking it was the memory's fault. After a few freezes doing various tests the PC refused to boot. It displayed an error BIOS ROM CORRUPT. I spent an hour researching how to flash a new BIOS without a floppy drive and then, just out of curiosity, I pulled out all but one stick of RAM. The PC booted! It also booted up with 2 sticks, but would freeze at POST with 3, I didn't even bother putting the 4th in.

So it seems that the mobo is slowly dying, currently I'm scrubbing the pool to test my "will work with 2 sticks of RAM" theory. Will report back.
 
It has been working for well over a week now. Seems that it was truly a HW problem. I'll report if I get another crash. (marked as SOLVED). Thank you all!
 
This could also be a power supply problem, including failing capacitors on the motherboard. A scrub involves lots of disk activity that draws more power than normal.
 
Back
Top