Random Reboot and possible cause

Currently running 13.2-RELEASE-p2 and a few weeks back started seeing random reboots overnight. Did some troubleshooting and I managed to track it down to the Daily Periodic jobs that run at 3am. FYI, this system does not send outgoing email, so I found 1300+ files in /var/spool/clientmqueue and removed them.
Removing those files corrected the random reboot. There was no shutdown or errors of any kind to indicate a problem. Nothing in messages, daemon or console logs. The machine would just reboot and start normal bootup. Should that many files cause a system reboot? It just seems strange that there was no errors or indication that it was even shutting down. When I issue reboot or shutdown I see info in the logs indicating that it's coming down and who initiated the reboot/shutdown.

Here's my rc.conf:

Code:
hostname="XXXXXXXXX"
ifconfig_em0="inet XXXXXXXX netmask 0xffffff00"
defaultrouter="XXXXXXX"
sshd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="NO"
zfs_enable="YES"

sendmail_cert_create="NO"
sendmail_submit_enable="NO"
sendmail_outbound_enable="NO"
sendmail_msp_queue_enable="NO"
sendmail_enable="NO"

nmbd_enable="NO"
smbd_enable="YES"
winbindd_enable="NO"

named_enable="YES"
portmap_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_enable="YES"
mountd_flags="-r -n"
ntpd_enable="YES"

This is my periodic.conf:

Code:
daily_output="/var/log/periodic.daily.log"                                 # user or /file
weekly_output="/var/log/periodic.weekly.log"                                        # user or /file
monthly_output="/var/log/periodic.monthly.log"                                       # user or /file

Mostly this machine is just a NAS for my home network, but I may be adding some function to it soon. Hopefully the creation of periodic.conf will stop files from showing up in /var/spool/clientmqueue since it doesn't send outgoing mail.

CPU is AMD Ryzen 5 5600G with 64 gig DDR4. The OS drive is an SSD and is formatted with ZFS.

Thanks
 
Should that many files cause a system reboot?
No. Never. I think these are just coincidental. You happen to stumble upon a bunch of periodic email stuck in the queue. Filesystem wasn't full right? A full filesystem could certainly result in a bunch of errors. But more importantly, no panics and/or reboots.

It just seems strange that there was no errors or indication that it was even shutting down.
Hardware errors and/or issues with certain drivers could potentially cause it. Graphics driver perhaps? You can pick out from the logs the date/time it started again, what time between the periodic jobs and the reboots? Periodic is known to hammer the filesystem for a couple of minutes. It generates a lot of I/O. That could have triggered an issue as well. If storage is gone, there's nothing to log to anymore either.
 
Unfortunately with many desktops and even (sub)entry level server HW there's not much that can help you troubleshoot this. Firmware (of actually lack of it) is usually too simple.

At this stage it can be anything, we can only speculate. From HW to SW and/or even external sources (e.g. power grid fluctuation).

You said:
few weeks back started seeing random reboots overnight.

Does it mean you see them regularly (so not so random) always at 3am-ish ? If so you could try to run those jobs manually to see what it does. Increased power demand from PSU could trigger issue. But that's just a wild guess/example of many possible reasons.
 
No. Never. I think these are just coincidental. You happen to stumble upon a bunch of periodic email stuck in the queue. Filesystem wasn't full right? A full filesystem could certainly result in a bunch of errors. But more importantly, no panics and/or reboots.
Filesystem was not full. Not even close. No panics or reboots either. Here's a bit from the messages log:

Code:
Aug 22 07:48:44 alexandria nmbd[3234]:   Samba name server XXXXXXXXX is now a local master browser for workgroup HACKNET on subnet 192.168.1.9
Aug 22 07:48:44 alexandria nmbd[3234]:
Aug 22 07:48:44 alexandria nmbd[3234]:   *****
Aug 23 15:18:21 alexandria su[6301]: derwood to root on /dev/pts/0
Aug 25 03:01:55 alexandria syslogd: kernel boot file is /boot/kernel/kernel
Aug 25 03:01:55 alexandria kernel: ---<<BOOT>>---
Aug 25 03:01:55 alexandria kernel: Copyright (c) 1992-2021 The FreeBSD Project.
Aug 25 03:01:55 alexandria kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

No indication that the system was rebooting or coming down or having a kernel panic.

Hardware errors and/or issues with certain drivers could potentially cause it. Graphics driver perhaps? You can pick out from the logs the date/time it started again, what time between the periodic jobs and the reboots? Periodic is known to hammer the filesystem for a couple of minutes. It generates a lot of I/O. That could have triggered an issue as well. If storage is gone, there's nothing to log to anymore either.
Not using any graphics drivers at all. I tried tinkering with Nvidia support but it didn't go well. That card is now in a Linux system doing hardware transcoding.. Here's kldstat:


Code:
Id Refs Address                Size Name
 1   51 0xffffffff80200000  1f3e2d0 kernel
 2    1 0xffffffff8213f000   59dfa8 zfs.ko
 3    1 0xffffffff826dd000     a4a0 cryptodev.ko
 4    1 0xffffffff82b20000     3378 acpi_wmi.ko
 5    1 0xffffffff82b24000     5ecc ig4.ko
 6    1 0xffffffff82b2a000     3218 intpm.ko
 7    1 0xffffffff82b2e000     2180 smbus.ko
 8    1 0xffffffff82b31000     3340 uhid.ko
 9    1 0xffffffff82b35000     3380 usbhid.ko
10    1 0xffffffff82b39000     31f8 hidbus.ko
11    1 0xffffffff82b3d000     3320 wmt.ko
12    1 0xffffffff82b41000     2a08 mac_ntpd.ko
13    1 0xffffffff82b44000     3530 fdescfs.ko

And daemon.log:

Code:
Aug 22 07:48:44 alexandria nmbd[3234]:   *****
Aug 22 07:48:44 alexandria nmbd[3234]:
Aug 22 07:48:44 alexandria nmbd[3234]:   Samba name server XXXXXXX is now a local master browser for workgroup XXXXXX on subnet XXXXX
Aug 22 07:48:44 alexandria nmbd[3234]:
Aug 22 07:48:44 alexandria nmbd[3234]:   *****
Aug 25 03:01:56 alexandria named[864]: starting BIND 9.16.42 (Extended Support Version) <id:a62d1bd>
Aug 25 03:01:56 alexandria named[864]: running on FreeBSD amd64 13.2-RELEASE-p2 FreeBSD 13.2-RELEASE-p2 GENERIC
Aug 25 03:01:56 alexandria named[864]: built with '--disable-linux-caps' '--localstatedir=/var' '--sysconfdir=/usr/local/etc/namedb' '--with-dlopen=yes' '--witho
ut-python' '--with-libxml2' '--with-openssl=/usr' '--with-readline=-L/usr/local/lib -ledit' '--with-dlz-filesystem=yes' '--enable-dnstap' '--disable-fixed-rrset'
 '--disable-geoip' '--without-maxminddb' '--without-gssapi' '--with-libidn2=/usr/local' '--with-json-c' '--disable-largefile' '--with-lmdb=/usr/local' '--disable
-native-pkcs11' '--disable-querytrace' '--enable-tcp-fastopen' '--disable-symtable' '--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/share/i
nfo/' '--build=amd64-portbld-freebsd13.1' 'build_alias=amd64-portbld-freebsd13.1' 'CC=cc' 'CFLAGS=-O2 -pipe -DLIBICONV_PLUG -fstack-protector-strong -isystem /us
r/local/include -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -ljson-c -fstack-protector-strong ' 'LIBS=-L/usr/local/lib' 'CPPFLAGS=-DLIBICONV_PLUG -isystem
/usr/local/include' 'CPP=cpp' 'PKG_CONFIG=pkgconf' 'PKG_CONFIG_LIBDIR=/wrkdirs/usr/ports/dns/bind916/work/.pkgconfig:/usr/local/libdata/pkgconfig:/usr/local/shar
e/pkgconfig:/usr/libdata/pkgconfig' 'PYTHON=/usr/local/bin/python3.9'
Aug 25 03:01:56 alexandria named[864]: running as: named -u bind -c /usr/local/etc/namedb/named.conf
Aug 25 03:01:56 alexandria named[864]: compiled by CLANG FreeBSD Clang 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)
Aug 25 03:01:56 alexandria named[864]: compiled with OpenSSL version: OpenSSL 1.1.1o-freebsd  3 May 2022
Aug 25 03:01:56 alexandria named[864]: linked to OpenSSL version: OpenSSL 1.1.1t-freebsd  7 Feb 2023

The system is kept current and has all updates and patches.

I realize that correlation is not causation but in this case I'm wondering. Since those files were removed, it has stopped. Thats mostly why I'm asking here. I did memory tests with Memtest86 and I replaced the power supply with a spare that I had. Nothing changed until I removed those files.

Thanks for responding
 
Unfortunately with many desktops and even (sub)entry level server HW there's not much that can help you troubleshoot this. Firmware (of actually lack of it) is usually too simple.

At this stage it can be anything, we can only speculate. From HW to SW and/or even external sources (e.g. power grid fluctuation).

You said:


Does it mean you see them regularly (so not so random) always at 3am-ish ? If so you could try to run those jobs manually to see what it does. Increased power demand from PSU could trigger issue. But that's just a wild guess/example of many possible reasons.
They were happening pretty refularly just after 3am.. 3:01 to 3:04 usually. The power supply is a new Corsair 750 watt which is well beyond what the system needs. There are 6 hard drives attached, the Ryzen 5 processor pulls 65 watt, and 64 gig of DDR4. But that's it. The OS drive is SSD and there's no GPU. I build systems that way on purpose with an eye toward keeping consumption under control.

Thanks for the ideas.
 
They were happening pretty refularly just after 3am.. 3:01 to 3:04 usually. The power supply is a new Corsair 750 watt which is well beyond what the system needs. There are 6 hard drives attached, the Ryzen 5 processor pulls 65 watt, and 64 gig of DDR4. But that's it. The OS drive is SSD and there's no GPU. I build systems that way on purpose with an eye toward keeping consumption under control.

Thanks for the ideas.

Hmmm I've always found log-less reboots without warning to be hardware-related... that may or may not be the case here but given that it seemed to correspond with scheduled moments of increased activity, here are a few questions I had:
  • What do your long S.M.A.R.T. scans say about disk health?
  • Are your HDDs connected via RAID controller or directly? Any BIOS warnings?
  • How are your thermals? Do you monitor temperature and are your machine internals dust-free?
  • Do your exposed boards look relatively sane? No dying capacitors, residual magic smoke smell, etc?
 
Agree with the above: A reboot is probably a hardware problem.

Look in the crontab for root and all users: What does your system do at 3am?
 
Hmmm I've always found log-less reboots without warning to be hardware-related... that may or may not be the case here but given that it seemed to correspond with scheduled moments of increased activity,
Agreed.. That's why I'm asking on this one. I replaced the memory from another system, plus I ran Memtest86 against the system.

here are a few questions I had:
  • What do your long S.M.A.R.T. scans say about disk health?
  • Are your HDDs connected via RAID controller or directly? Any BIOS warnings?
  • How are your thermals? Do you monitor temperature and are your machine internals dust-free?
  • Do your exposed boards look relatively sane? No dying capacitors, residual magic smoke smell, etc?
The SMART scans show no failures. Not even pre-failures. The 6 SATA drives are Seagate EXOS X14 drives that were all purchased within the last 6 months. I'm using both the on motherboard SATA controller plus I have a PCI-E 1x SATA card.. RAID is disabled on the motherboard BIOS and the card is non-RAID ASMedia ASM116x AHCI SATA. Motherboard is an ASUS B550 Pro with solid capacitors.

Thermals:
Code:
dev.cpu.11.temperature: 26.6C
dev.cpu.10.temperature: 26.6C
dev.cpu.9.temperature: 26.6C
dev.cpu.8.temperature: 26.6C
dev.cpu.7.temperature: 26.6C
dev.cpu.6.temperature: 26.6C
dev.cpu.5.temperature: 26.6C
dev.cpu.4.temperature: 26.6C
dev.cpu.3.temperature: 26.6C
dev.cpu.2.temperature: 26.6C
dev.cpu.1.temperature: 26.6C
dev.cpu.0.temperature: 26.6C

I have a large heatpipe on the CPU with a low speed fan.

Exposed boards are all clean and look brand new.
 
Agree with the above: A reboot is probably a hardware problem.

Look in the crontab for root and all users: What does your system do at 3am?
The same thing most FreeBSD systems do at 3am. The Periodic jobs:




Code:
# Perform daily/weekly/monthly maintenance.
1       3       *       *       *       root    periodic daily


 
UPDATE -- I started a scrub on my ZFS pool this morning and it caused a reboot.. I've ordered a new SATA controller. An LSI 9207-8i with appropriate cabling. It will be here in a few days and I'll move the Seagate drives to it and put the SSD on the motherboard. Hopefully that will do the job
 
Currently running 13.2-RELEASE-p2 and a few weeks back started seeing random reboots overnight.
The issue is similar to this:

Hopefully the creation of periodic.conf will stop files from showing up in /var/spool/clientmqueue since it doesn't send outgoing mail.
Try this:
 
The issue is similar to this:


Try this:
I started a scrub on my RaidZ2 and that caused a reboot. I didn't get a backtrace though. I've got a new LSI controller arriving today with cables. I'm going to try and see if it makes a difference.

--Edit--
Controller is installed and scrub is running.

Code:
  pool: media
 state: ONLINE
  scan: scrub in progress since Sun Sep  3 16:47:59 2023
        4.66T scanned at 20.3G/s, 220G issued at 957M/s, 26.4T total
        0B repaired, 0.81% done, 07:57:18 to go
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da3     ONLINE       0     0     0

errors: No known data errors
 
Another update.. The new controller did not help. I also did a full wipe and reinstall of 13.2 and that did not help either.
Starting a scrub causes a reboot. Still no indication of why.
 
Please enable dumps -- that way system would not (in most cases) just reboot but would produce the crash we can see. Don't forget to install gdb too.
 
If this is desktop hardware with that many drives: most likely a dying/overwhelmed PSU. Desktop/gaming PSUs are usually cheap multi-rail designs that aren't well suited to handle a lot of periphery (disks), especially under load.
There are server-grade PSUs in ATX form factor available, but those are usually more expensive (i.e. 500$++) than some used "real" server systems e.g. from supermicro.
 
If this is desktop hardware with that many drives: most likely a dying/overwhelmed PSU. Desktop/gaming PSUs are usually cheap multi-rail designs that aren't well suited to handle a lot of periphery (disks), especially under load.
There are server-grade PSUs in ATX form factor available, but those are usually more expensive (i.e. 500$++) than some used "real" server systems e.g. from supermicro.
I replaced the power supply with a new unused power supply of the same model. Corsair CX750M.
The CPU is 65 watt, 64 GB DDR4 (25 Watt total), Six Seagate Exos X14 12TB (60 Watt total). No GPU. ASUS Tuf B500 Pro motherboard (200 Watt max), LSI SAS Controller (15 watt)
So, power consumption is 365 watt on a 750 watt power supply that is new.
 
Did you try to trigger the panic with the scrub ?
Yes. it paniced and then everything died. I've been working with the hardware for about the last 3 hours and it will not boot.. Even with one dimm and the CPU. So, I have a motherboard on the way and a CPU just in case.. So far everything has been swapped out except for the motherboard and CPU.. What a PITA
 
Sorry to hear that. As I mentioned above this could be anything at this point.
But if it paniced hopefully you'd be able to dig out the stack trace to see what happened. Or possibly to get more stack traces to rule out /unlikely/ SW issue.
 
I replaced the power supply with a new unused power supply of the same model. Corsair CX750M.
The CPU is 65 watt, 64 GB DDR4 (25 Watt total), Six Seagate Exos X14 12TB (60 Watt total). No GPU. ASUS Tuf B500 Pro motherboard (200 Watt max), LSI SAS Controller (15 watt)
So, power consumption is 365 watt on a 750 watt power supply that is new.

There's still the problem with those 'gaming'-PSUs that most of the rated power can only be drawn on the dedicated CPU and GPU rails. Since the dawn of SSDs the rails for peripheral, i.e. those that power the SATA and 4-pin molex connectors, have been downgraded to a bare minimum and can't handle many spinnig drives and their relatively high peak loads.
Also those Corsair PSUs don't seem to be the most reliable - I have a defective HX850 in my scrap box at work that died after ~2 years of light use. The replacement RM550x died after 2 weeks. The warranty replacement unit is now 5 months old, I hope it will last a while longer until the system is decomissioned... (the modular cabling has been tightly woven and strapped into the tiny case of that desktop, replacing it for a different PSU, let alone a non-modular one would be the job for a maniac. Otherwise I'd have ordered a different brand PSU after the second failure...)

But given that system now won't boot anymore, a hardware defect (mainboard) seems more likely now. Also (sadly) relatively "normal" for gaming hardware that isn't suited for 24/7 operation.
 
Wait, don't we have a similar story here recently?

Same footprint: standard (i.e. gaming) hardware with a server-grade disk array - and strange errors that seem difficult to pinpoint.
 
System is up and running now and a scrub has been started
Waiting to see if it panics.

--Update-- Scrub is 1/2 complete. No issues. I've got zpool iostat running to track throughoput and I've got other things hitting the filesystem during all of this.. No hiccups so far. Looks like it was the motherboard.
 
Wait, don't we have a similar story here recently?

Same footprint: standard (i.e. gaming) hardware with a server-grade disk array - and strange errors that seem difficult to pinpoint.
I read it and it looks similar, however in my case it turned out to be a dodgy motherboard.
 
There's still the problem with those 'gaming'-PSUs that most of the rated power can only be drawn on the dedicated CPU and GPU rails. Since the dawn of SSDs the rails for peripheral, i.e. those that power the SATA and 4-pin molex connectors, have been downgraded to a bare minimum and can't handle many spinnig drives and their relatively high peak loads.
Also those Corsair PSUs don't seem to be the most reliable - I have a defective HX850 in my scrap box at work that died after ~2 years of light use. The replacement RM550x died after 2 weeks. The warranty replacement unit is now 5 months old, I hope it will last a while longer until the system is decomissioned... (the modular cabling has been tightly woven and strapped into the tiny case of that desktop, replacing it for a different PSU, let alone a non-modular one would be the job for a maniac. Otherwise I'd have ordered a different brand PSU after the second failure...)

But given that system now won't boot anymore, a hardware defect (mainboard) seems more likely now. Also (sadly) relatively "normal" for gaming hardware that isn't suited for 24/7 operation.
As I found out today, it was the motherboard being unsteady on power. Replacing the motherboard has fixed the problem. FWIW, I've had COTS hardware running 24/7 for close to 20 years. From time to time I've also had some SuperMicro hardware and while it is nice to have, "gaming" hardware as you put it does the job just fine and lasts almost as long as the enterprise level hardware. I'm willing to accept the "gaming" hardware differences. It's worked out just fine for me.
 
Back
Top