Other Catastrophic NAS failure after many years of excellent performance

OrphanAudio · Jan 15, 2024

I have been incredibly happy with the three custom FreeBSD NAS that were custom built for me over many years.
Although somewhat computer savvy, and with the ability to perform basic maintenance by reading the online manual and procedures, I have only a passing familiarity with Linux based OS and command line,
That said.........................,

Several events occurred through the recent Covid years that have now placed me in the position of very quickly having to become a FreeBSD expert and no one to assist (see below):
-Drive failures were previously easy to recover by simply replacing the bad drive and allowing ZFS to rebuild the array, this is no longer the case.
-Leaking capacitors on the system,s MSI MB compromised its stability (all NAS are safe in a temperature controlled area, so probably just age related).
-My FreeBSD guru (a very talented TV Broadcast IT manager) who graciously built and maintained all my NAS, passed away in early 2023 at far too early an age.

The current situation is:
-FreeBSD v13.0 releng/13.0-n244733-ea31abc261f
-6 HDD drive array plus M.2 NVMe.
-One HDD will not complete initialization and shuts down after about 30 seconds.
-System boot halts after identifying five working HDD,s and identifying peripheral hardware (Keyboard, Mouse).
-The system typically attempts to reboot several times, but eventually the screen goes black.

I replaced the compromised MB with a new one, with no change in results.
Several replacement HDD are on the way (just in case more in the array are ready to tip over).

I suspect the NVMe might be damaged, but have no experience to test its integrity or bypass it with something like an external USB device.

I am looking for any advice or assistance to resolve this.

Thank you.

Tieks · Jan 15, 2024

You said you replaced the motherboard, but it still fails to get past the boot. Do you have a bootable rescue disk/CD/USB stick with which you can start a session?
File /var/log/messages should give you some information on the problem(s). It will give also give you a chance to remove the faulty disk from file /etc/fstab, because that may block a normal boot too.

OrphanAudio · Jan 16, 2024

Those are very important clues, Thank you.

I will read up on how to create a rescue drive, check the message log, figure out how to clear faulty disk data and see if the system will complete a "normal" boot cycle.
? Should I wait until I have a replacement drive installed before I perform this procedure, or should the raid come back up in degraded condition from an external boot until I receive and install the replacement drive ?

I have spent a while digging through the online docs, but have not (as yet) found any section that defines how FreeBSD hardware is/can be configured. I expect the OS lives on a separate drive (SSD, NVMe) and data is confined to the builders choice of raid array.

If so, will a plain vanilla install of the OS to the NVMe (basic default configuration) allow ZFS to discover and rebuild the existing data array with a replacement drive in place or is there a section (or sections) of the online docs I should study to understand configuration essentials.
Hopefully I should be able to handle remaining OS configs afterward (network protocols and so on).

Sorry for more questions, but please correct me if any of the above is wrong (or simply too optomistic... lol),.. and if possible, let me know what sections of the online docs I should read so I can get my feet a bit more than wet (and try to graduate myself away from FreeBSD "noob" status).

Thank you very much for your assistance.

ralphbsz · Jan 16, 2024

OrphanAudio said:
-6 HDD drive array plus M.2 NVMe.

How is the ZFS pool containing the 6 disks configured? RAID-Z, RAID-Z2, mirroring, or not redundant?

-One HDD will not complete initialization and shuts down after about 30 seconds.

As long as your ZFS pool containing that sick disk has redundancy against at least one failure, I would temporarily remove (disconnect) that disk. It might be causing ...

-System boot halts after identifying five working HDD,s and identifying peripheral hardware (Keyboard, Mouse).

... might be causing all of the computer to go down. I've had that at home, a SATA disk that causes NOTHING to work (not even BIOS or boot) when it is plugged in.

I suspect the NVMe might be damaged, but have no experience to test its integrity or bypass it with something like an external USB device.

As tieks already said: Get a USB stick or similar, write a bootable copy of FreeBSD on it, and boot from it. That will help debug things.

cy@ · Jan 16, 2024

Leaking capacitors may be unrecoverable, even after replacing the MB. I had a machine with a bad DIMM. Not leaking capacitors but the result is the same, so read on. The data written to disk eventually became garbage because even though ZFS maintains checksums, data written to disk that had been subsequently corrupted after checksums were calculated meant the disk corruption, like a cancer, grew. An d that disk corruption resulted in FreeBSD ZFS panics. After reviewing dumps and methodically checking hardware, removing DIMMs, switching them around, I finally found the bad DIMM.

That machine has been fine ever since.

Then when those same blocks were read, they failed checksum. Luckily it was my sandbox machine and the pool was recreated from backups and copies from my primary build machine.

The lesson is, take hardware problems very seriously.

DanDare · Jan 16, 2024

Is the system booting from the NVMe, and it contains the FreeBSD installation?
If so, disconnect all HDDs (including power) but NVMe, to sort out the boot issue, alone. But take note of the order of cable connections (which HDD goes which SATA channel). Take note of your old MB BIOS settings if you can (e.g. you can select drive boot order in some motherboards).
Do a BIOS CMOS clear.

Then do like ralphbsz said. Indeed important to know how you have your array (RAID-Z, RAID-Z2, mirroring etc)
When it boots, starts adding HDD one by one and check what 'zpool status' say.
I think not usual (not likely really) but maybe good to consider also the HDD is so bad (if so) that it can damage the MB, somewhat.

... might be causing all of the computer to go down. I've had that at home, a SATA disk that causes NOTHING to work (not even BIOS or boot) when it is plugged in.

My home backup 'server' is a ZFS mirror pool of 4 HDDs. One of the HDDs works alone in any SATA channel but makes the MB to freeze if not connected to channel2, when all HDDs are connected. ¯\_(ツ)_/¯

OrphanAudio · Jan 16, 2024

Thank you all for the helpful advice.
I made my first post brief so as not to ask too many questions at once.
The number of issues that happened at the same time was extremely unusual (MB, one HDD and potential OS corruption), but I tried to test/eliminate and replace as many hardware issues as I could before asking about software I was unfamiliar with.

ralphbsz said:
How is the ZFS pool containing the 6 disks configured? RAID-Z, RAID-Z2, mirroring, or not redundant?

I am not certain of which raid config was used, I hope to be able to determine that when I can look at logs or at least get the OS to complete a boot cycle, but previous single drive faults have quickly recovered with installation of same capacity HDD, so the level of redundancy must be quite high.

As long as your ZFS pool containing that sick disk has redundancy against at least one failure, I would temporarily remove (disconnect) that disk. It might be causing all of the computer to go down. I've had that at home, a SATA disk that causes NOTHING to work (not even BIOS or boot) when it is plugged in.

I neglected to mention that I disconnected the failed drive after several failed boots to eliminate the possibility of it hanging the Sata bus.

The mode of drive failure is sadly a familiar one (from decades of dealing with early MFM, SCSI & IDE fails) in which the motor spins up correctly, drive electronics can be identified using an external drive case, but the voice coil assembly (drive read/write head-arm) is unable to correctly access the drive platters. The possibility of actual damage to the MB is incredibly remote from that type of fault.
I have also been able to plug another Sata drive (unfortunately too small to be used for recovery) into the Sata port previously occupied by the failed drive and the bios shows that temporary drive's type and capacity so its integrity is also not in question.

I have seen no indication of MB or memory damage and boot behavior is too consistent to indicate an intermittent.
The new MB consistently gets through a partial NVMe boot of FreeBSD and halts at exactly the same location, after correctly identifying the five remaining drives, performing file system checks (all marked as clean) and external hardware ID/initialization.

cy@ said:
After reviewing dumps and methodically checking hardware, removing DIMMs, switching them around, I finally found the bad DIMM.

Again memory faults are likely not the case as I have a second machine with the same MB and DIMM (different OS and purpose) that I have test swapped parts with to confirm hardware was not the issue.

I will try disconnecting the Sata drive pool to see how the NVMe OS drive reacts. If it still cannot complete a boot I will create a USB boot drive.
I may also be able to borrow an M.2 drive enclosure to test and re-build the NVMe.

cy@ · Jan 16, 2024

OrphanAudio said:
Again memory faults are likely not the case as I have a second machine with the same MB and DIMM (different OS and purpose) that I have test swapped parts with to confirm hardware was not the issue.

Read my words again. I didn't say your problem was a memory problem. I did say that hardware problems can cause memory corruption. You can have memory corruption due to DIMM errors, bad caps, a wonky NIC card that overwrites RAM through DMA, etc. You have a hardware problem and all bets are off.

OrphanAudio · Jan 16, 2024

Read my words again. I didn't say your problem was a memory problem. I did say that hardware problems can cause memory corruption. You can have memory corruption due to DIMM errors, bad caps, a wonky NIC card that overwrites RAM through DMA, etc. You have a hardware problem and all bets are off.

Yes, I missed the intention of your post. I will keep an eye out for data damage that could have been caused by transient hardware glitches.

I was finally able to burn an amd64-bootonly.iso image to a USB stick (using Etcher software) and managed to get my machine to boot from it in Live-CD mode.

The current road-blocks are:
-I cannot find any mention of the default username and password in the online docs so I can actually enter the user interface, query logs and determine zpool status for a full recovery.
-I was only able to boot from the USB disc after completely disabling the NVMe boot drive. Placing the NVMe drive later in the boot order did not work.
I am not sure why this is happening (MB hardware setting or perhaps the USB drive simply took too long to convince the Bios to boot from it...??), but I have to resolve this as well or I will not be able access any logs to determine the reason for the original crash.

(edit) more data:
I was able to get a rescue boot with the NVMe present and shell to a # prompt, but no idea how to mount/access the NVMe or raid volumes (again my noob status raises it head, but I suppose if I bang my skull against this enough I will figure it out).
I reviewed the FAQ for ideas and attempted to use mount -urw / followed by mount -a/,... but no joy. I expect there are some large holes to fill in my knowledge base so any suggestions are welcome for mounting and inspecting condition of the drives.

As always, thank you in advance for your kind assistance.

Tieks · Jan 17, 2024

OrphanAudio said:
I was able to get a rescue boot with the NVMe present and shell to a # prompt, but no idea how to mount/access the NVMe or raid volumes

Replacing your MB may have changed drive designations/numbering. That may invalidate the contents of /etc/fstab, which causes mount -a to fail. If mount -urw does work it will allow you to edit and save /etc/fstab using /usr/bin/ee. Try to comment out all references to NVMe by placing a # sign in front of the lines. Save and reboot. Then scroll through the output of dmesg to find your NVMe drives, see if they have matching entries under /dev and correct file /etc/fstab accordingly. See man fstab() and man mount() for info.

bigbrother · Jan 18, 2024

when you boot, issue a
dmesg | more
and read the info, there you can see what discs are identified, like
ada0: Serial Number S7EWNJ0W478457N
ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 512bytes)
ada0: 476940MB (976773168 512 byte sectors)

after locating the disk (in my example ada0), see the /dev/ for the partitions discovered, like
ls -la /dev/ada0*
is you see the 's1' like /dev/ada0s1 and other then perhaps you are using disklabels on FFS, so just issue a similar command with the following
disklabel /dev/ada0s1
and then you can continuing with fsck and mount.