ZFS zpool keeps rebooting the server

I have a server with a mix of pools ssd & SATA disks, I started to notice that the server began to reboot frequently, I manage to boot in single mode, and notice that no more reboots, then I boot the pool with the oldest SATA disks and after a while, it restarted (no logs in console) also no dump so first question is, how can I catch/trap the logs and find what is the exact issue?

The pool that seems to have the problem uses two mirrors (ada2/ada3) mirror1 and (ada4/ada5) mirror1, If I try to run scrub after some seconds server reboots, currently, I am in single mode with the pool exported running: smartctl -t long /dev/ada[2,3,4,5]

How could I prevent the reboots, find the disk and fix the pool?

Just in case I upgrade to latest stable and using the Generic kernel:
13.2-STABLE FreeBSD 13.2-STABLE stable/13-f6488428308 GENERIC amd64

in /boot/loader.conf I limit the mem to:

Code:
# Set Max size = 4GB
vfs.zfs.arc_max="4294967296"
# Min size = 2GB
vfs.zfs.arc_min="2147483648"
 
after a while, it restarted (no logs in console) also no dump
So we have exactly zero information about why the crash happens. With that, we can't help debug.

Suggestion: Have you looked in /var/log/messages whether errors are being logged? It's possible they are not, if the crash takes out the file system so recent log entries don't end up on disk.

Here's a pretty heavyweight suggestion: Get another device, format with a different file system (in this case UFS), and attach it. Write the logs onto that device.

The problem could also be a serious hardware issue. Two examples: First, it's possible that your power supply is undersized, and doing a scrub (or other heavy disk activity) drops the voltages so far that the CPU reboots. Second: I once had a SATA disk that when plugged in completely froze the motherboard, so much so that not even booting or the BIOS worked.
 
  • Like
Reactions: mer
It can be anything, from SW to HW issue. You need to eliminate one or the other and try to narrow it down. On SW side sudden reboots could be a triple fault.

If you have spare PSU or possible free disks you could try to replace those. There's always a possibility of faulty CPU or board itself. Unfortunately with a board lacking a fw that monitors FRUs or any other parts of the board it's always a guessing game.

With SW issue I'd verify you can get a dump (trigger a crash with sysctl). Do you have physical access to the box? Did you observe if anything is written in console at all? If yes trying to record that with cell phone might be worth a try (in case of something does get logged on screen but there's no time to write it to a disk).
 
It it doesn't panic and doesn't print anything it is almost certainly hardware, with the mainboard being the main suspect.
In my experience RAM is usually more suspect than the motherboard or CPU. ZFS works RAM hard.
 
The pool that seems to have the problem uses two mirrors (ada2/ada3) mirror1 and (ada4/ada5) mirror1, If I try to run scrub after some seconds server reboots, currently, I am in single mode with the pool exported running: smartctl -t long /dev/ada[2,3,4,5]
Smartmontools don't write to the disk. you can run smartctl -t long anytime while the system is running.

If a spinning rust disk you can back up your data and write over bad blocks to reallocate and move them (and order a new disk in the mean time).

Faulty disks never cause ZFS failures or crashes. Bad RAM can cause bad data written to disk. I had a bad DIMM in one of my machines many years ago. It corrupted the zpool and the UFS filesystems. If RAM is bad, nothing, including ZFS, will know.

(And anticipating someone to quip: what about ZFS integrity. No amount of software integrity running on faulty RAM can make up for faulty RAM.)
 
I have had the system up and running (single-mode) since yesterday but the zpool that creates the reboot is not imported, so far no reboots, I ran smartctl -t long /dev/adain all the disks and from all I got:
No Errors Logged

The server has been up for more than a year, I suspect is something with a disk but no idea how to capture logs when it panics, I have already in syslog.conf:

Code:
console.*                                       /var/log/console.log
*.*                                             /var/log/all.log


In the logs, I see:
Code:
 kernel: No suitable dump device was found.

I only have dumpdev="AUTO" in rc.conf, but root is on a ZFS pool, swap is in:
Code:
/dev/zvol/ssd/swap

Any more ideas on how to find what could be the root cause?
 
the zpool that is currently exported is a "stripe of mirrors?"
If so, you could try power down, physically unplug one of the mirrored devices say ada2 or ada4. Don't just unplug the data cable, unplug power too. Then power up to single user mode and try to import. Yes it will be degraded, but may give a data point to work with.

That may also help indicate power supply issues.
 
swap is in
I think swapping to zvol is only really *supported* on solarish systems, there were numerous issues reported against using swap-on-zvol on FreeBSD and Linux (and that could be your issue as well).
More so, dumping to zvol is not supported at all; it's no-op in FreeBSD-specific ZFS code; you'll need real separate partition for swap/dump to get the kernel dump (there's also netdump, but it's more involved), could try adding some spare sata disk for this to check if there's really a panic.
 
but root is on a ZFS pool, swap is in:
I've my doubts too: I don't think you can dump to ZFS zvol. Did you try that with the sysctl ? This does panic your system: sysctl debug.kdb.panic=1. Then check if you have crash available.
 
Faulty disks never cause ZFS failures or crashes.
Sadly, they can. Especially SATA/consumer grade drives often refuse to just die and e.g. just vanish for a few seconds and reappear randomly or lie about being healthy but stalling as soon as receiving any IO can bring a whole system to a crawl and even cause crashes. Been there, done that, got my free shirt...

We recently had a failing NVMe that also just reset the host one night. I put that exact drive (because of ZFS checksum errors and a suspected failure) into another host to collect some data for RMA and directly observed a reset of that host after which the drive was finally completely dead.
A SWAP provider disappearing may also cause a hard crash, especially if SWAP was not mirrored.


RE faulty RAM:
OP is talking about a server, so I'd take ECC ram for granted. *usually* this catches any errors due to a faulty module and - depending on the firmware - either just crash or trigger an alert and disable and/or log an error at the BMC about the defective module, freezing the host in the process (IIRC the SUN sparc servers did exactly this, which is IMHO the cleanest way of dealing with memory failures).
 
I've my doubts too: I don't think you can dump to ZFS zvol. Did you try that with the sysctl ? This does panic your system: sysctl debug.kdb.panic=1. Then check if you have crash available.
It panics but no crash is available, in /var/crash, where should it be written?

I have the pool exported and booted in normal mode, so far all good, but just trying to see what to add/tune so that when I enable back the pool I can catch the errors if any.
 
Sadly, they can. Especially SATA/consumer grade drives often refuse to just die and e.g. just vanish for a few seconds and reappear randomly or lie about being healthy but stalling as soon as receiving any IO can bring a whole system to a crawl and even cause crashes. Been there, done that, got my free shirt...

We recently had a failing NVMe that also just reset the host one night. I put that exact drive (because of ZFS checksum errors and a suspected failure) into another host to collect some data for RMA and directly observed a reset of that host after which the drive was finally completely dead.

Well, those are very different things. NVMe is directly PCIe. Obviously that has an easier time to sabotage the system than disks behind a SATA HBA.

I maintain my suspicion of the mainboard.
 
  • Like
Reactions: mer
When I've run into weird spontaneous reboots/shutdowns in the past, power issues are often a culprit especially in typical consumer grade pcs. Barely adequate power supply add a video card and everything is unstable.
If the drives are all still plugged in but pool not imported, what kind of load are they? Likely sitting mostly idle perhaps even spun down.
 
It panics but no crash is available, in /var/crash, where should it be written?
That's the thing -- it can't dump to a zvol. You need to create a separate partition (freebsd-swap) for it so it can save the dump. dumpon -l has to give you something valid where dump would be stored.
Just to make clear: once system crashes it saves itself into this partition. Next boot savecore process searches this partition and saves it to a filesystem. By default it's /var/crash (can be overwritten in rc.conf by dumpdir variable).
 
I did the import (it rebooted minutes later) but this is what I manage to get:

Code:
# zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon May  1 04:17:39 2023
        255G scanned at 1.83G/s, 12K issued at 88B/s, 5.33T total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada5    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/samba/timemachine:<0xc9ee>
        tank/vms/poudriere/disk0:<0x1>

I am not sure if is the power supply because when not importing the pool all works fine and can still use smartctl on ada[2,3,4,5], I will open the case and remove the disk as mer suggest
 
I think that the state of the pool is due to the sudden reboot, it's not why the crash/reboot is occurring. Permanent errors on pools was a topic discussed here on forums few times.

From SW point of view: you might be hitting a bug in ZFS (or related to ZFS) as the trigger is always the same, you can reproduce it. To narrow down the issue having working dumps may help a lot. If you have spare disk you can use that as a temporary dump device (just create a partition with freebsd-swap, set it up to be used).
If you have the luxury of 2nd FreeBSD machine you could import this pool there and observe the behavior.
 
I removed disks from the pool one at a time (trial/error) and reboots stopped after removing disk ada4, in the logs and console now I see:

Code:
Solaris: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.

In console:

Code:
May  5 13:29:12 home kernel: May  5 13:29:12 home ZFS[8893]: catastrophic pool I/O failure, zpool=tank
May  5 13:29:23 home kernel: May  5 13:29:23 home ZFS[26197]: catastrophic pool I/O failure, zpool=tank

But before when running the smartctl -t long /dev/ada4 I had no errors.

I will try to find a spare disk and create a freebsd-swap, any idea if a USB stick works for this purpose?

The machine is currently up and running but zpool status takes ages to show any output, I can list the data but can't access it, only copies 256K from everything
 
Do you actually have a scrub running on it or did you start one after getting the pool imported?
I don't know if scrub is persistent across export/imports
 
I haven't started just found that reboots stopped and I haven't rebooted since then, but zfs/zpool commands take a lot, I may reboot and start zpool scrub tank any more ideas?
 
My reason for asking is your output up in #16 implies a scrub is in process. This may be due to zpool import, if so not much you can do about it.
Up in #18 you talked about removing ada4, try removing ada5 and leaving ada4 in. +
Why?
"I removed ada4 and got these unrecoverable errors". The way mirrors work that implies to me "maybe an issue with ada5"
 
It it doesn't panic and doesn't print anything it is almost certainly hardware, with the mainboard being the main suspect.

I attached again disk ada4, but change it to another SATA port in the motherboard:

Code:
# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 4.89G in 00:07:10 with 0 errors on Fri May  5 14:40:28 2023
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada2    ONLINE       0     0     0

errors: No known data errors

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   7.25T  4.77T  2.48T        -         -    23%    65%  1.00x    ONLINE  -

So this seems to match a hardware failure in the motherboard, how to test from now on? I manage to take out a video of this behavior:
 
I attached again disk ada4, but change it to another SATA port in the motherboard:
Excellent debugging data point.
Did you change data cables or "same cable, different port on the motherboard"? If same cable it does imply motherboard. If different cables it implies "could be motherboard or could be cable".
One of my rules is "make sure cables are fully seated". power, data, ethernet, unplug and replug. Things heat and cool sometimes they walk.
I have had "drives that work fine for years fail" but if I change the cable "it's magically better". Never sure if it's the cable or the act of replacing the cable that fixes it.

Looks like you swapped ada2 and ada4?
 
I used the same cable but moved ada4 to a free (unused port) and moved ada5 where it was ada3, that in the output of ZFS looks like a swap between (ada2 and ada4), fortunately, no data loss and can copy/move files across without problem (so far looks good)

I recently moved the machine to a closed rack that is reaching ~40C (probably too hot).
 
  • Like
Reactions: mer
We recently had a failing NVMe that also just reset the host one night. I put that exact drive (because of ZFS checksum errors and a suspected failure) into another host to collect some data for RMA and directly observed a reset of that host after which the drive was finally completely dead.
A SWAP provider disappearing may also cause a hard crash, especially if SWAP was not mirrored.
Unmirrored swap disappearing result in hard hangs while the kernel waits for the device to reappear. This makes sense. But a crash (kernel panic, without a register dump) is due to a kernel bug where the developer did not anticipate an error condition, like derefrencing freed or non-existent virtual memory.

I've seen on my systems where a disk failure of half a gmirrored pair will hard hang the machine. This is because the disk locked the SATA bus resulting in no access to any SATA disk. I've seen this before with SCSI disks back in the day too, on FreeBSD and Sparc Solaris, where a faulty SCSI CDROM device locked up the SCSI controller of a Sun 2000, that the machine just froze. These are not panics or crashes. They are hard hangs caused by faulty hardware, and they affect any O/S.

BTW, disconnecting the faulty scsi CDROM device from the bus resolved the Sun 2000 problem.

We tend to use the term system crash imprecisely. The term was first used in the mainframe days. There was no such thing as panic() on the mainframe, it simply either re-IPLed (rebooted) or went into a hard spin loop with interrupts disabled (hard hang). Which is a good segue into what a hard hang is.

A hard hang is when the O/S has disabled interrupts while in critical sections of code in the kernel. Then something unexpected (not expected by the developer) happens. The system may go into a hard loop with no ability to recover because it cannot be interrupted by an device (like disk or keyboard keystroke), like happens when drm-51[05]-kmod suffers a buffer overflow. You will notice the fan on the machine speed up, the keyboard is unresponsive, and you're left with one option: to power cycle the laptop. In this case, interrupts are disabled and the machine is in a hard spin loop, probably a deadlock or lock order reversal -- these bugs can happen to any O/S. The only option is to press the reset switch or power button.

The reason these bugs are so difficult to track down is because after the reset switch or power button is pressed, all diagnostic information in memory is lost. (On the mainframe we had a stand alone dump O/S which we would boot using a reserved area of memory not used by the O/S for that purpose. It would dump all memory and swap allowing one to do post-mortem analysis. Not so on any other O/S today.)

Hard hangs and crashes (panics) are different.

BTW, on Intel, only the NMI, non-maskable interrupt, cannot be disabled. NMI is triggered when there is a RAM fault, and some third-party diagnostic cards use it to invoke a debug mode in drivers that support this function. The reset switch briefly cuts power on some MBs or simply resets the keyboard controller on the MB which in turn resets the rest of the MB. This clears RAM as if the machine was power cycled.

Sorry for taking this in a tangential direction. I wanted to differentiate between a crash (panic) and a hard hang.
 
Back
Top