I wanted to report back my painful TrueNas experience - this one was a real bugger to track down. The many pagefault panics really did not point me in the right direction - or probably I just dont understand them well enough. In any case they would not pagefault in teh same process from what I could see.
The most useful thing I did was to keep a detailed log of all the changes over my 5 months, which covered everything from swapping cables to recovering zfs pools that failed, optimising ESXI, rebuilding TrueNas completly etc. Sometimes I woudl have to wait a week before seeing issues - not only was I seeing these occasional restarts, but checksum errors of the pool with no errors or checksums from the underlying drives.
Creating a seperate VM with the FreeBSD disto and running stress-ng really gave me confidence back with the setup and was a great suggestion. It ran for a week continuously with 4VMs and loads of RAM.
I eventually split my TrueNas install across two seperate instances/VM's, passing my HBA through to one and the onboard SATA controller to the other - then spliting my two zfs pools across the two instances. Here I finally had the fault following the HBA, my LSI 9300. Despite direct fan cooling its getting to an incredble temperature just at idle. Looks like its failing.
In hindsight not sure how I could have used the panics to fault this faster, but looks like failing LSI which would make sense given the checksums failures found at pool level.
Thanks to
garry,
_martin and everyone who came back with ideas and suggestions, and I hope this helps anyone else trawling the forums as I did - a pagefault in my case was not ram or cpu, but the HBA.
Thanks
CC