Kernel Panics in 11.2-STABLE

networkninja · Jul 24, 2019

Hi,

I've been having an issue with one of my servers that runs rclone all day. It kernel panics after about 13-14 days of uptime, and the only difference on this host is that it runs a lot of rclone processes inside of a jail. Here is the console message:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 04
fault virtual address   = 0x9b
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff803f3714
stack pointer           = 0x28:0xfffffe2016267a60
frame pointer           = 0x28:0xfffffe2016267a90
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 49134 (rclone)

Any help greatly appreciated, or if anyone can point me in the right direction to help troubleshoot?

Thanks

T-Daemon · Jul 24, 2019

Have look at the handbook, Chapter 10. Kernel Debugging.

If you don't know how to debug yourself at least you can pass on more detailed informations than console messages.

Edit:
Before kernel debugging you should update to 11.3-STABLE.

SirDice · Jul 24, 2019

networkninja said:
Any help greatly appreciated, or if anyone can point me in the right direction to help troubleshoot?

You're running a -STABLE, which is essentially a development version. Update to the latest -STABLE first.

networkninja · Jul 24, 2019

I wonder if I’m hitting this 10 year old bug? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=138126

The reason it’s -STABLE is because this is a FreeNAS box, so in theory it has been thoroughly tested. I’m going to have to move rclone to a different host since I don’t want one of my main storage boxes panicking. Anything I can do to help squash this bug?

Could this be hardware related? All last 5 kernel panics look the exact same, trap 12 and running process is always rclone. Also this is inside a jail.

Thanks in advance.

SirDice · Jul 24, 2019

networkninja said:
The reason it’s -STABLE is because this is a FreeNAS box,

PC-BSD, FreeNAS, XigmaNAS, and all other FreeBSD Derivatives

networkninja · Jul 24, 2019

SirDice said:
PC-BSD, FreeNAS, XigmaNAS, and all other FreeBSD Derivatives

Understood, didn't know that. I am going to build a FreeBSD vm and run this from there so that I can see if that panics.

Thank you for the feedback.

_martin · Jul 27, 2019

networkninja said:
Could this be hardware related?

The virtual address (same as in PR you mentioned) seems bogus. That would explain the trap (page fault). There's not enough information here nor in the PR to give you more. Though nothing can be ruled out I'd say this is a sw related problem.

If you find a way to reproduce this in a vanilla FreeBSD installation do share.

Terry_Kennedy · Jul 31, 2019

_martin said:
The virtual address (same as in PR you mentioned) seems bogus. That would explain the trap (page fault). There's not enough information here nor in the PR to give you more. Though nothing can be ruled out I'd say this is a sw related problem.

Indeed. Note that "you" in the following text is directed at the original poster, not the post I'm replying to. It is a semi-canned response to this sort of bug report.

Normally if this happens in a high repeatable address it is a software problem, such as a needed check for "can never happen" that is missing. If it happens at a high varying address it could be a hardware problem, but we won't go into that here. If it happens at a low repeatable address it is likely the same sort of thing, generally something like a pointer to an object having bogus data, or the object itself contains a bogus reference. If it happens at a low varying address things are more complicated as it is possible that some data structure had been corrupted earlier and left a "ticking bomb" for the system to trip over much later.

The first step to getting help on this would be to get a backtrace from the panic, for multiple panics. If you are lucky, they will all be at the same place in the same backtrace path.

If that doesn't help track it down, the next step would be to build a kernel that contains more sanity checks (offhand, they are DEBUG and WITNESS, but check the Handbook to make sure) and see if that kernel detects and reports an issue when running your workload. Note that these kernels are slower due to the added checks (which is why they're not in the releases users normally run). That alone may change things enough that you won't see the problem.

In any event, once you have gathered this additional data (preferably on an officially-supported release), you can open a bug with enough information that a developer can look at it, or get some "me, too" responses from other users who are having the same problem.

Kernel Panics in 11.2-STABLE

networkninja

T-Daemon

SirDice

Administrator

networkninja

SirDice

Administrator

networkninja

_martin

Terry_Kennedy