Server Reboots

Hello
I have Supermicro FatTwin server running a pfsense.

At least every 10 o 15 days it suddenly reboots, i just check all for days the logs without any luck, the only message i saw is:

Fatal trap 12: page fault while in kernel mode
cpuid = 38; apic id = 32
fault virtual address = 0x1
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e91aca

stack pointer = 0x28:0xfffffe085af6c8f0

Fatal trap 12: page fault while in kernel mode
cpuid = 46; apic id = 3a
fault virtual address = 0x1
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80f8f3e5
stack pointer = 0x28:0xfffffe085b817920
frame pointer = 0x28:0xfffffe085b817b20
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1

processor eflags = interrupt enabled, resume, IOPL = 0

Fatal trap 12: page fault while in kernel mode
current process = 12 (irq333: igb5:que 6)
cpuid = 2; frame pointer = 0x28:0xfffffe085af6c970
code segment = base 0x0, limit 0xfffff, type 0x1b

trap number = 12

= DPL 0, pres 1, long 1, def32 0, gran 1
panic: page fault
cpuid = 46
KDB: enter: panic
panic.txt0600001213762536676 7156 ustarrootwheelpage faultversion.txt06000033013762536676 7634 ustarrootwheelFreeBSD 11.3-STABLE #243 abf8cba50ce(RELENG_2_4_5): Tue Jun 2 17:53:37 EDT 2020
root@buildbot1-nyi.netgate.com:/build/ce-crossbuild-245/obj/amd64/YNx4Qq3j/build/ce-crossbuild-245/sources/FreeBSD-src/sys/pfSense


Any idea of whats is happening?

I google the page not present error and some talks about memory, but memory check was ok
 
Any idea of whats is happening?

Yes, this is called a kernel panic. It means that the OS itself came into a situation that should never happen, and needed to terminate. (You are lucky, because this means that there is a reason for the reboots.)
Possible reasons are very widespread: any kind of hardware malfunctions, but also misconfigurations and bugs.

Normally the system will then write a dump of the memory contents, and this dump can be analyzed to get a clue what actually happened. The dump should go into the swapspace, and will be copied to /var/crash during the next startup.

So there are two things to do
1. make sure the dump gets written (not very difficult), and
2. get a clue from the dump (not very easy).

To get the dump, you need a swapspace configured (or more specifically, have the dumpdev parameter in rc.conf point to some disk partition at least as big as your memory), and enough free space in /var/crash. Maybe the dumps are already there?
For further details, please see the handbook.

BTW, there is already something one can see from the data You provided:
current process = 12 (irq333: igb5:que 6)

(Check if it is always such a process, and always the same device.)
This seems to be a driver for an Intel PRO/1000 nic. So look into these, or the pci bus, or the memory. (A memory test doesn't mean much, it just shows obvious flaws, but cannot test all conditions.)
 
Yes, this is called a kernel panic. It means that the OS itself came into a situation that should never happen, and needed to terminate. (You are lucky, because this means that there is a reason for the reboots.)
Possible reasons are very widespread: any kind of hardware malfunctions, but also misconfigurations and bugs.

Normally the system will then write a dump of the memory contents, and this dump can be analyzed to get a clue what actually happened. The dump should go into the swapspace, and will be copied to /var/crash during the next startup.

So there are two things to do
1. make sure the dump gets written (not very difficult), and
2. get a clue from the dump (not very easy).

To get the dump, you need a swapspace configured (or more specifically, have the dumpdev parameter in rc.conf point to some disk partition at least as big as your memory), and enough free space in /var/crash. Maybe the dumps are already there?
For further details, please see the handbook.

BTW, there is already something one can see from the data You provided:


(Check if it is always such a process, and always the same device.)
This seems to be a driver for an Intel PRO/1000 nic. So look into these, or the pci bus, or the memory. (A memory test doesn't mean much, it just shows obvious flaws, but cannot test all conditions.)
Yes, i knoe is a kernel panic, but im not able to find why. I already check the crash report, the three times server reboots, but i didnt saw anything in specific, i will check in older crash reports if a saw the intel pro driver device. This is a good start

Thaks alot!
 
Back
Top