pfSense crashing after 12-27 hours or so: ping responds, routing persists, but ssh, logging, etc die.

Nooo... 2 days, 15 hours and then.
View attachment 15788
will try a firm reseat of the mSATA to see if the interface might be bad and if that fails, revert to the Big Blue x336 box for now. So much for CO₂
What you see here is a complete communication failure, both at the SATA level (that's the CAM messages), and between the CPU's PCI bus and the SATA controller (that's the AHCI timeout). So the "source" of the problem is that the SATA controller is "broken".

I put the words "source" and "broken" in quotes, because it's possible that the SATA controller is the (not exactly innocent) victim of the mSATA device: If it has bizarre errors (particularly in corner cases of the protocol, like TRIM), it could confuse the SATA controller so much that it stops working. Error handling is often the weakest part of implementations.

I'll say something politically incorrect: In general, products from small vendors (such as Protectli and Fortasa) tend to have less testing and less quality control; that is a simple and logical consequence of having fewer staff.
 
I’ve got three Protectli machines running 24 x 7 for more than a year and not had any issues with them (all lightly loaded, one OpenBSD firewall, one OpenBSD desktop and one Alpine Linux). I’ve tended to opt for the branded components for the RAM and storage.

Obviously YMMV and all that but my experience with them has been mostly positive.
 
Faulty cables or bad connection produce ECC errors in the SMART.
That is true for the data wires. Faulty power wires can produce exactly what we see here: the drive firmware goes astray, and then we get the 31000ms error, until power cycle.
I've seen this happen a lot with wired SATA, but it is hard to explain how it might happen on a slot connected one (unless the power source itself is unstable - but then milspec /should/ have a broader tolerance than anything else there).
 
One simple thing you could do is find out if the case is uncomfortably hot to the touch, and if so, point a fan at it for further testing.
 
None of the sensors, (CPU or disk) indicate any thermal problems (system temp 43C). The drive temp SMART status is always 30C which seems sketch, but all other SMART variables are incrementing as one would expect. Still no reported errors. Load is quite light (2% right now, up to maybe 15%). My "hands on" support is technically limited, but a touch test should be within scope. I'll ask for that.

ralphbsz, you're making me sad with your insight. I'm moderately confident of the power situation not just because it is dual ABB (CP-E 12/10.0) power supply on an industrial grade (TDK-Lambda) redundancy module (DRM40), but also because the same output drives the AT&T fiber interface and CPE terminal and those have zero errors. Protectli seems to have a pretty solid user base and the forums are not filled with these sorts of errors so I doubt it is systematic on that end. I may just have a bad SATA controller part (I doubt these are going through extensive burn-in testing and my initial uptime was a week, I'd have passed it myself), which would be a bummer for sure.

Reading the history of the Apacer firmware updates, there are some pretty clear red flags and system halts for a problem with the "VT table" implementation and for TRIM implementation are explicitly mentioned. I wrote both Fortassa and Apacer to see if I could get the SFPS928A firmware and Apacer was nice but not helpful. I have despaired of mitigating any possible firmware faults with host configurations as I've run out of any plausible suggestions on that end, but still hold out hope the protectli box's SATA controller is OK.

My next step will be to try to get the box out of the critical path and revert to the tried and true old blue x336 box, then see if, thus derisked, my remote support will indulge me in experimenting to isolate the cause. The amended plan is (assuming no firmware update is available):

Verify temperature is indeed "reasonable" (not ouchy)
Wiggle/reseat the existing drive to ensure it isn't an oxide-layer problem (faulty wires, doesn't seem too likely but would be nice if that easy)
Swap the Fortassa drive (pic attached for posterity) with a modern high-volume branded one (Kingston SUV500MS/120G should rule out the drive as the problem)
Cure the on-board SATA controller with thermite
 

Attachments

  • Fortasa FMS-MSAN128GA-15MB.jpg
    Fortasa FMS-MSAN128GA-15MB.jpg
    184.1 KB · Views: 50
Cure the on-board SATA controller with thermite
Curing is any of various food preservation and flavoring processes of foods such as meat, fish and vegetables, by the addition of salt, with the aim of drawing moisture out of the food by the process of osmosis. (from Wikipedia)
 
Selected SMART logs:

Code:
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       473         -
# 2  Extended offline    Completed without error       00%       354         -
# 3  Extended offline    Completed without error       00%       306         -

Code:
SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

Code:
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4            4  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            5  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

Oh, update - also the heat sink on the box is just normal warm to the touch, not hot, so the temp sensors are probably not too far off.

Thermite curing definitely removes all moisture, though not so much by osmosis.
 
The amended plan is (assuming no firmware update is available):

Verify temperature is indeed "reasonable" (not ouchy)
Wiggle/reseat the existing drive to ensure it isn't an oxide-layer problem (faulty wires, doesn't seem too likely but would be nice if that easy)
Swap the Fortassa drive (pic attached for posterity) with a modern high-volume branded one (Kingston SUV500MS/120G should rule out the drive as the problem)
Cure the on-board SATA controller with thermite
That's sensible (though I have no knowledge of "curing" anything other than pork belly).
You can clean gold contacts with a high quality pencil eraser. Get the most expensive, like Faber-Castell or Staedtler.
Use an anti-static strap (disposable is fine). Lay the SSD on an anti static bag, and rub contacts gently, avoid direct finger contact, and take care to brush off any residue from the eraser.
 
Oh, that's a good idea... it has been a while since I had to do that. I have an antistatic workstation and the special fiber glass pencil brushes all set up and unused for quite a few years. Time to roll that out. I was thinking tinned pads which form highly resistive oxide layers, breaking through which is a key design parameter of contact pins. But hard gold doesn't deform (much) and while it is oxide resistant, could very well be dirty.

The protocol that comes back to me is:
Scrub pad with a non-woven swab dipped in anhydrous alcohol
Use a fiberglass brush (or eraser) until the pads are bright
Blow off any dust/eraser bits with Clean Dry Air (CDA)
Scrub again with the non-woven swab with alcohol
Let dry fully (CDA to accelerate drying)
Reassemble.

I'm quite dubious of the drive itself at this point and not too optimistic that good contacts will resolve the issue, but still holding out hope it isn't the SATA controller itself.

For all supporting this little hardware drama so generously, I apologize, but it will take a few days for my remote hands to allocate time.

As for curing ailing SATA controllers with thermite, it should be as effective as this:

The_Radium_Cure.jpg
 
BTW, sorry for the long delay, but got the interface remotely cleaned and reinstalled, testing now.
 
Back
Top