Kernel panic several times a day

Deleted member 15063 · Sep 4, 2021

CLimbingKid said:
current process = 60306 (smbd)

Is it always this process?

free-and-bsd · Sep 6, 2021

garry said:
It was GA-Z77M-D3H, "refurb" I got direct from China. . No wifi (ethernet everywhere!). I also have a atx GA-Z77X-D3H purchased new and it runs like a champ with all four memory slots occupied with fastest-supported xmp memory.

My second guess (in my case) is my GPU card that's just too old (nVidia G100). Looks like the fan rotation is very slow, I guess it overheats & makes the computer reset. Is that possible, do you think? Computer itself has 16G RAM + Intel Xeon E31270 @ 3.40GHz, 500W PSU (quite enough I guess).
Had no such problems before I replaced the CPU (it was 2 core lousy Celeron) and GPU card (was GT640).

free-and-bsd · Sep 6, 2021

garry said:
No wifi (ethernet everywhere!).

Make no mistake about it: Z77n-wifi has 2 LAN ports into the bargain

That'll be why I chose it.
Made no use of wifi until recently when I started using it as AP with Radius security (just for the fun of it).

garry · Sep 7, 2021

free-and-bsd said:
Had no such problems before I replaced the CPU and GPU

Nice board but I prefer to ignore the onboard ethernet ports and always add a dual-port Intel Pro 1G ethernet card.
So now your system runs faster on the same memory ... maybe the memory has hit its performance limit. Try turning off XMP extreme memory profile and Turbo Mode memory in the bios. It's a guess!

CLimbingKid · Sep 26, 2021

I wanted to report back my painful TrueNas experience - this one was a real bugger to track down. The many pagefault panics really did not point me in the right direction - or probably I just dont understand them well enough. In any case they would not pagefault in teh same process from what I could see.

The most useful thing I did was to keep a detailed log of all the changes over my 5 months, which covered everything from swapping cables to recovering zfs pools that failed, optimising ESXI, rebuilding TrueNas completly etc. Sometimes I woudl have to wait a week before seeing issues - not only was I seeing these occasional restarts, but checksum errors of the pool with no errors or checksums from the underlying drives.

Creating a seperate VM with the FreeBSD disto and running stress-ng really gave me confidence back with the setup and was a great suggestion. It ran for a week continuously with 4VMs and loads of RAM.

I eventually split my TrueNas install across two seperate instances/VM's, passing my HBA through to one and the onboard SATA controller to the other - then spliting my two zfs pools across the two instances. Here I finally had the fault following the HBA, my LSI 9300. Despite direct fan cooling its getting to an incredble temperature just at idle. Looks like its failing.

In hindsight not sure how I could have used the panics to fault this faster, but looks like failing LSI which would make sense given the checksums failures found at pool level.

Thanks to garry, _martin and everyone who came back with ideas and suggestions, and I hope this helps anyone else trawling the forums as I did - a pagefault in my case was not ram or cpu, but the HBA.

Thanks

CC

jbo@ · Sep 28, 2021

CLimbingKid I´d like to express my respect for digging into this and reporting back usable results. As mentioned previously these types of topics get abandoned rather frequently which is a real shame as those can be extremely interesting and insightful. I always learn something when I read these types of threads as extremely knowlegable people start to get involved then the problem-observing party is reporting back with valuable information.

Kernel panic several times a day

Deleted member 15063

Guest

free-and-bsd

free-and-bsd

garry

CLimbingKid

jbo@