pfSense crashing after 12-27 hours or so: ping responds, routing persists, but ssh, logging, etc die.

I have a support request I have been updating on the pfSense forums but thought that the BSD forum may have better expertise in enabling differential diagnostic tools to help isolate the cause of my crashing issue. I know these problems are difficult to debug, but perhaps someone has dealt with something similar and can offer a quick solution or differential diagnostic suggestion to assist in further diagnosis.

I recently migrated my pfSense firewall from an ancient IBM x336 (100's of watts) to a Protectli FWB box (units of watts) and, aside from crashing, it is performant and cute and quiet, all good things. The crashing, however, not so much. It is running coreboot, which seems like a poorly considered choice at this point as an unwanted additional variable.

The symptoms are that the system becomes unresponsive via web, ssh, and console after some number of hours of operation. I believe it may be related to a pfBlockerNG update as I had a week plus of uptime before an update to that package and haven't had more than 27 hours since, but that could be a chimera. I have run a scrub and checked for disk errors (none reported) and while the system is live and remote and so difficult to properly sysutils/memtest86+, I did build sysutils/memtest and ran it on 4G (half the installed memory) and then 5G successfully with no errors reported, not conclusive but indicative of a good and compatible DIMM.

When the system faults, VPN connections hang, unbound becomes unresponsive, nginx stops responding, cron jobs don't seem to run, an ssh connection or console connection get a "user" prompt, but the console doesn't throw a password prompt, ssh will prompt but doesn't proceed after pw entry. The system responds to ping requests normally and routing and 1:1 NAT continue. Logging (apparently) stops at the moment of hang. No crashes or system errors are logged; logs just stop updating until reboot.

These little boxes don't come with IPMI style interfaces, alas, though I do have remote console via an Avocent. The problem is that sending a ctl-alt-del to the hung device yields only:

Code:
init 1 - - timeout expired for /etc/rc.shutdown: Interrupted system call; going to single user mode
init 1 - - some processes would not die; ps axl advised

I'm not getting any core dumps (I've modified the sysctl options to be a bit more vocal if possible (changes that haven't been tested yet).

I'd be very grateful for any hints (aside from "never run critical infrastructure on consumer hardware") or additional diagnostic advice.

-David
 
Update, crashed again this an SSH attempt correctly rejected a bad password, but on entry of a good password the login process stalls. Seems like some part of the kernel is still running, just not enough to function.
 
Thanks covacat, When I get it back up I'll stress test the disk IO. Disk IO timeout is very definitely possible, not great news given it is remote, but at least a very helpful diagnosis. It acted a little differently this crash and usually the Avocent console is filled with usbus0 disconnect messages from the KVM dongle, but this time it was filled with:

Screenshot from 2023-03-06 19-05-44.png


Maybe this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229745 but also maybe flaky controller (which would be a bummer) or a flaky drive (which is also a bummer but definitely a <$ bummer).
 
given that those small protectli boxes are basically just (much) older-generation topton devices, my guess would be either the cheap disk they were often equipped with in china, or just thermal problems, especially if this is a variant without fan and/or perforated bottom cover.

Not only because of the forum rules and this is a FreeBSD forum, but also because pfSense is doing some really weird and non-standard stuff with the underlying OS (like e.g. bundling unsupported driver versions/variants), you should install vanilla FreeBSD and if the problem still persists you can start troubleshooting without all the pfSense stuff getting in the way (like constantly overwriting config files...).
 
it was filled with:
At least, these messages aren't pfSense-specific. It's certainly possible they are triggered by some "dodgy" driver (and maybe it's indeed something pfSense did modify ... this uncertainty is the reason only original/vanilla FreeBSD is "supported" here). More likely though: some hardware issue. I've seen things like that with simply bad cabling....
 
My first step would be to run sudo smartctl -a /dev/ada0.

This will show a whole bunch of details that should help identify a failing disk, and also temperature issues.
 
Re: Running smartctl: The cam errors in the photo (why do people paste photos instead of cut-and-pasting the text) look much more like disk communication errors. If the system was a normal and accessible desktop, I would start by checking SATA cables, power wiring, power supply quality. On a remotely installed system, and with it being the somewhat bizarre protectli hardware, that's more difficult to debug.
 
Hi all, thanks for the support on a very unsupported configuration. I do appreciate it.

TL:DR thanks to the fortuitous capture of the console output synchronized with covacat's comment, I've been lead to the conclusion it is a drive issue. It could very well be some minor cabling issue, a subtle drive incompatibility, a flaky driver, or some other hardware or driver subsystem fault. The bugzilla thread I found had some advice:

camcontrol tags ada0 -N 25
vfs.zfs.cache_flush_disable="1" -> /boot/loader.conf
hint.ahcich.0.sata_rev=2 -> /boot/device.hints

It's been 5:34 since setting those parameters and still up. If/when it crashes, maybe there will be some useful diagnostic info. Even if this fixes the problem, it'll still be an annoyance after any update as the fixes will be overwritten.


Some quick answers:
My Screen shot is of an Avocent KVM via VNC from 12,000km away. I don't have a serial dongle attached, so I can't remote the serial feed (though that is clearly a desirable feature for issues like this).

The drive is a Fortasa/Apacer/Transcend industrial SLC FMS-MSAN128GB-ITM.

Temperatures are quite mild (CPU 47C)

There's a bit of a back story to the CAM status errors: because I don't have a serial console attached, console output only has a VGA worth of history and a consequence of whatever failure afflicts the system, the console gets spammed with USB disconnect errors which pushed the CAM errors off screen and into the aether until I got lucky and caught the console today. Crashes are 3-27 hours apart, I'm not watching console full time nor are they are not written to disk (in hindsight, the reason for this may be obvious).

As for testing, I suspected disk or memory first and thought I'd ruled them out (I suspect now that whatever the nature of the failure, my tests were unsuccessful at ruling out disk). I ran a long SMART test, which on this disk is 2 minutes, and which passed; full log attached in the event someone sees more in it than I do, but aside from the esoteric device not having a MIB, all the keywords are positive.

I initiated a zfs scrub and then checked zpool status, no errors at all.

I ran the diskinfo -cti tests, all quite performant, no errors popped up. Running the tests didn't trigger a failure.

While not conclusive, I took this to at least indicate basic compatibility and function for the disk subsystem.

As this is a live and remote system, rebooting to memtest86 wasn't practical. I compiled and "side loaded" memtester and ran a few passes on 5 of 8GB of RAM, no errors at all. Again, while not conclusive, at least indicative of good RAM.

I wrote a little script that gets called by cron every 5 minutes to record the status of the system in some detail in the hope that I might capture some anomaly a few minutes before fail. I don't see anything amiss, but perhaps more expert eyes than mine would. The log files don't seem to have anything of import in the last minute of consciousness.
 

Attachments

  • ps_status.txt
    31.8 KB · Views: 65
  • smartctl-diskinfo.txt
    5.8 KB · Views: 52
What's the operational environment of the system? Is there a UPS? Is it in the vicinity of high-current electric motors such as used in lifts (elevators)?

Can you get somebody capable to remove, clean, and re-seat all the cables, connectors, and DIMMs?
 
Good questions. There's no cable as the drive is an mSATA. Reseating connectors is a very good idea. It is a possible remotely guided operation, just 4 screws for the back plate, 2 for the mSATA retention and the clips on the DIMM.

We're at 15 hours uptime now after the interface moderating changes I noted above (which, as sko correctly noted, get blown away on update). This, while heartening, is hardly conclusive, it ran 27 hours before crashing after a different modification resulted in much, sadly premature, celebration.

The power environment is as stable as I could make it. I work in Iraq and our power goes out 4-8 times a day. The national grid provides 180-245VAC at 43-52Hz while most of the generators I am on are pretty well regulated, many are pretty much throttle control load variable voltage/frequency open loop. And, having lived through Enron's attempt to blackmail California, I have 2x APC SURTA3000XL UPSes as the Protectli doesn't have dual inputs (unlike the IBM x336 and all my switches, DC or AC), it is powered by a pair of ABB CP-E 12/10.0 power supplies connected via a TDK-Lambda DRM40 redundancy module (the output of each supply tweaked to deliver 12.00V after the 200mV drop the DRM40 introduces). This DC subsystem also provides power to the ATT BGW210-700 and fiber interface get 12V from the same DIN blocks.

Current uptime 18 hours...
 
Those things don't care about fluctuating grid voltages/frequencies as long as they get a halfway decent ~12V DC input (or whatever voltage they are rated). Any cheap switching power supply can manage that even with the crappiest generator supply. If you have PoE switches try using a PoE-splitter (but only if you can set the PoE output to x2 or fixed value, because the appliance will need >15.4W at boot)

What brand is that mSATA drive? As said: those Topton systems are often equipped with the cheapest no-name drives (and RAM) that can be found on this planet.
I/we run 6 of those N5105-based boards on Free- and OpenBSD as (VPN-)routers and firewalls without any issues - but all of them are using WD blue or Micron NVMes and Kingston RAM.

But there's also still the huge variable of pfSense doing very weird and non-standard stuff with their drivers and default configurations, so get rid of it and use vanilla FreeBSD so we can actually help you (IF the problem still persists). Writing the pf.conf and to any other config manually is far more efficient, faster and easier than dealing with that inflexible GUI anyways...
 
I have a bit of direct experience with suboptimal power and could tell some stories but to your point I agree the supply as configured is sufficient to provide stable and reliable power to the device, especially from a US-spec grid.

I agree about generic ODM drives; the drive in this particular box is a Fortasa (OEM: Apacer mSATA A1) FMS-MSAN128GB-ITM rated for 85C ambient running at 30C according to SMART, and bad cluster and bad blocks are both at zero. It is 2016 vintage (NOS) and thus it isn't entirely unreasonable that driver expectations were a little high, despite it being an SLC military market device.

At 22 hours uptime now, for those keeping score at home. 5 hours to go to tie the record so far.
 
Well, that ran 35 hours without a crash but, a breakthrough, I caught the console again before the USB errors flushed it and once again, TRIM.
Screenshot from 2023-03-08 09-29-18.png


Since autotrim is enabled by default by pfSense for ZFS installs and since it is likely to be activated on time scales consistent with the failures, I'm quite suspect. The firmware on the device is SFPS925A and the update notice to SFPS928A, depending on the device mentions trim errors resulting in hangs and hang after an update to the VT Table. I can't find a firmware patch, but I can disable autotrim once I get the FW remotely rebooted. Will update...
 
I've disabled autotrim with vfs.zfs.trim.enabled="0" added to /boot/loader.conf.local and verified it took on reboot. Time will tell. If not this, then perhaps try for a remote drive reseat to perhaps break up oxide on the drive lands otherwise it's back to the x336 until I can get my hands on that annoying little box. Will update with results.
 
Nooo... 2 days, 15 hours and then.
Screenshot from 2023-03-11 12-01-33.png

will try a firm reseat of the mSATA to see if the interface might be bad and if that fails, revert to the Big Blue x336 box for now. So much for CO₂
 
Check the temperature of the sata controller. If you can't hold your finger on top of the chip then it's too hot. You can glue a small heating or place active cooler.
 
Interesting, I hadn't considered the SATA controller as a likely failure point. I think in a Protectli box, there's not much room (physically) for dealing with such an issue and it's hard to justify the investment in fixing it, but while the drive itself isn't reporting temp problems (via SMART), there's no sensor for the SATA controller. It might explain why degrading performance in software appears to be increasing the time between failures.

Thanks for the very good (if a little depressing) thought.
 
uh, this sounds exactly like what I have encountered years ago with a product from a different vendor. If you can, try to keep that thing running on a different platform, maybe Opnsense is ok, or otherwise try to use OpenBSD or Linux. I have been down that rabbit whole, and we have switched our OS for our product just to realize that the problem persisted. In the end we bought new servers for all our customers from a different manufacturer.
 
There is a pretty established existence proof that the hardware/software combo is stable, I'm not worried about that and, like many, I have a decade or two of good experience with pfSense, and am happy there.

It is just one of those debugging problems to identify the source of a problem that fails in a way that frustrates debugging. It is clearly a disk subsystem error based on console logs and that messing with the drive parameters changes MTBF. There are a few components involved:
  • Host SW/drivers and config
  • Drive Firmware
  • Drive hardware
  • Physical Interconnect
  • Host interface hardware (thanks VladiBG for that somewhat unhappy reminder).
Given that the base hardware (Protectli FW4B) is a pretty standard device and pfSense is well tested on it, I consider discovering a heretofore unreported bug that crashes pfSense every few days on that hardware unlikely.

The drive I installed, a Fortasa FMS-MSAN128GB-ITM /should/ be particularly robust in this application. The "I" means "industrial" and rated to 85C, the "T" means SLC cells. 128GB is overkill, but should ensure decades of operation under normal loads (at least with SLC), the "M" means Military for some reason, though clearly not armed nor wearing IBA.

This drive, being atypical, is a reasonable target for suspicion. It booted fine, reports no errors, yields excellent performance numbers, passes all tests. Weird crashes from either command queuing or TRIM are known and this drive's firmware update history addresses both sorts of issues and cites OS lockup as a consequence: thus a likely target to test. Sadly, I've applied all OS-level fixes I can find to no avail. It doesn't mean those issues aren't a problem, but are clearly not the whole problem.

The next easiest thing to fix, though it is a hands-on open box fix, is to rule out the interconnect by re-seating the drive in the mSATA connector. There are reports of similar issues being caused by faulty SATA cables and being resolved by replacement. mSATA doesn't have a cable, but a corroded or dirty pin/pad interface is certainly possible. That's the next thing to try. I'm not, if I am honest with myself, optimistic about this as such connections generally do good job of self-cleaning on install and when there is a problem it tends to crop up years later as oxide layers build. But it is far easier than guiding a remote reinstall of the firewall that gatekeeps network access.

If that fails, I may run out of remote diagnosis capability and will revert to the recently retired power hungry IBM x336 this little protectli box is meant to replace in an environmentally friendly way. But if I can guide remote hands through the process, I'll replace the drive with a Kingston drive that's on the known compatible list. If the problems go away, it was either a faulty drive or a subtle incompatibility with the esoteric drive's hardware or firmware.

If a drive swap and reinstall doesn't resolve the issues, I'm left with the host interface hardware as the last element in the chain and while I could boot from USB it is hard to trust a box like that in such a critical role going forward. Captain planet will be mad, but until someone releases a fanless V1500B based box with ECC and IPMI, I will accept my lesson and stick with power hungry data center-grade hardware.
 
Back
Top