Help: General Protection Fault on NAS (12.1)

Note to mods: If this isn't the right sub-forum, then please by all means move it. I put it in "System Hardware" because I'm suspecting a hardware issue. But I'm not positive. :)

Since around Christmas, and perhaps even further back, my NAS has started crashing randomly. By crashing I mean it's dumping a core in /var/crashes and rebooting. The info files from the crashes all look the same:

Code:
# more info.7
Dump header from device: /dev/ada2p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4114960384
  Blocksize: 512
  Compression: none
  Dumptime: Tue Jan 21 03:09:30 2020
  Hostname: [REDACTED]
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 12.1-RELEASE-p1 GENERIC
  Panic String: general protection fault
  Dump Parity: 2696097602
  Bounds: 7
  Dump Status: good

The core.txt file:
Code:
# cat core.txt.7
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
Unable to find matching kernel for /var/crash/vmcore.7

That "unable to find matching kernel for..." line concerns me. The core file is about 4GB in size; I didn't want to upload it and blow out any storage limits. If anyone's interested in seeing it, let me know and I can try to figure out another place to upload it.

This did not start because I upgraded to 12.1. In fact, it was happening in 12.0 as well; I was half-hoping I somehow bungled the upgrade to 12.0, and I just pushed through an upgrade to 12.1. That didn't seem to matter in the least.

Helllllp? :-)

Hardware, for the record:
  • Supermicro MBD-X10SAT-O motherboard
  • Intel Core i7 4790 CPU
  • 16GB DDR3
Thanks
 
Since around Christmas, and perhaps even further back, my NAS has started crashing randomly.
If you haven't done any software updates or installed anything (hard or software) then it's very likely a hardware problem. I'd start by checking for memory errors, then check your disks.

  • Supermicro MBD-X10SAT-O motherboard
  • Intel Core i7 4790 CPU
  • 16GB DDR3
Mainboard seems to support ECC memory, do you have that or did you use 'regular' non-ECC memory?
 
I'd start by checking for memory errors, then check your disks.

Thanks. Wouldn't disk errors pop up in logs? The zpool command shows no concerns. And as for memory, the only accepted way to test that is with memtest86, correct?

Mainboard seems to support ECC memory, do you have that or did you use 'regular' non-ECC memory?

Non-ECC. The CPU in it is a desktop one, not a Xeon. ECC wouldn't work with that CPU. The motherboard supports both types of CPUs, too.
 
Wouldn't disk errors pop up in logs? The zpool command shows no concerns.
Not necessarily. Check your disks with sysutils/smartmontools and look at the SMART data of each disk. This won't provide conclusive evidence a disk is broken but there are certainly a lot of indicators to verify. Disks tend not to last forever.

Non-ECC. The CPU in it is a desktop one, not a Xeon. ECC wouldn't work with that CPU. The motherboard supports both types of CPUs, too.
That's a shame as ECC would complain if a memory error happens. A lot of times the error can be corrected so you don't get "real" memory errors while you wait for the spare parts to arrive.
 
Not necessarily. Check your disks with sysutils/smartmontools and look at the SMART data of each disk. This won't provide conclusive evidence a disk is broken but there are certainly a lot of indicators to verify. Disks tend not to last forever.

Fair call but bad disks shouldn't (...) cause GPFs. Note the emphasis. Anyway, no SMART errors or bad indicators of any sort on all nine drives.

That's a shame as ECC would complain if a memory error happens.

Sure, and increase the cost of the rig because I needed to buy ECC RAM and a Xeon to go with it. ;-)
 
Fair call but bad disks shouldn't (...) cause GPFs.
True for normal operations, yes. Except when the bad part is in your swap partition. Then it's possible some bad data gets swapped back into memory which could cause all sorts of havoc.

Sure, and increase the cost of the rig because I needed to buy ECC RAM and a Xeon to go with it. ;-)
Yeah, I know. I have one machine with a dual Xeon and 96GB ECC memory. It was a surplus machine and it was donated to me. But yeah, I'm spoiled.
 
True for normal operations, yes. Except when the bad part is in your swap partition. Then it's possible some bad data gets swapped back into memory which could cause all sorts of havoc.

Having already thought of that, I disabled the swap partition and let the rig run along with "only" 16GB. It still tanked in the same fashion at one point not to long ago. So while that's a great suggestion and idea, I don't think that's the snag here.
 
Do you have any non-zfs filesystems on that machine? If so, have you run a thorough fsck(8) on them (I can't remember the details, it's in the man page).
 
Do you have any non-zfs filesystems on that machine?

All filesystems are ZFS.

I've since replaced the RAM with new DDR3 sticks to see if that helps. So far, the rig has been up for four days and counting, which is saying something. If I make it a month without a crash, I'll call the problem licked.
 
So far, the rig has been up for four days and counting, which is saying something.
Nice. So it sounds like the old memory was probably bad.

One way to really push a machine is to do a full build(7) ( make -j8 buildworld for example on an 8 core machine). Just the build, no need to install anything from it. The build process is quite CPU, I/O and memory heavy so any bad hardware typically shows up quite quickly that way.
 
Back
Top