FreeBSD server stop - need help to solve issue

Hi, I am new to this forum but not to FreeBSD. I have successfully run a FreeBSD server for backup (by rsync) for years and now the server has stopped responding three - four times during the last two weeks. The server runs a server motherboard, ECC memory and so on - all according to HW spec for FreeBSD. And there is no changes in hw..

The FreeBSD version is 13.1-RELEASE-p5, scrub on zpools ran without fault 12 January 2023 and all pkgs are updated as well. The install is plain vanilla, two SSD server disks in mirror as filesystem for backup. The only third party which is installed and compiled by me is netdata.

I have search in /var/log - but there is nothing there which I am able to use for solving the issue (mostly due to my incompetence on FreeBSD).

I don't know if it is the ssh server which is causing the issue, but when the server is not responding there is no data from netdata either. Access to server is default ports for rsync, tunnelled by SSH through default ports. And ssh by terminal..

Where do I start for looking what is causing the issue? The /var/crash is empty as well..
Thomas
 
Welcome to the forums.


[...] run a server [...] for years

how many years?
Any chance that hardware is now ancient and is just starting to die?
Wild guess in that direction: one of the disks is dying and instead of just staying dead, it locks up the whole system until it decides to come back for a while... Been there several times - especially SATA disks are pesky liars when it comes to health/SMART data and refuse to shut up and just die when their time has come.


If this is a server system, you should have some kind of OOB-management (IPMI) with a remote console - what do you see there when the system goes unresponsive? Does it just reboot or are there any errors/dmesg output on the screen?
 
You can have a look at :
Code:
less /var/log/netdata/access.log
less /var/log/netdata/debug.log
less /var/log/netdata/error.log
less /var/log/netdata/heamth.log
cat /var/log/messages | grep -i ssh | less
 
If the disks stop responding the entire OS partially locks up. Running applications tend to keep running as most of it is in memory and doesn't require disk access. When you login with SSH it needs to access the disk(s) and this will stall waiting indefinitely until the disks start responding again. Any process that needs to load something from disk will simply stall until the disks start responding again, I presume netdata needs disk access and therefor also locks up.

I definitely agree with sko check the disks. On several occasions I've had one disk in a 4 disk RAID-Z just hang up the entire bus causing the whole pool to seize up. Which in turn caused everything else that required access to that pool to freeze too.
 
Welcome to the forums.




how many years?
Any chance that hardware is now ancient and is just starting to die?
Wild guess in that direction: one of the disks is dying and instead of just staying dead, it locks up the whole system until it decides to come back for a while... Been there several times - especially SATA disks are pesky liars when it comes to health/SMART data and refuse to shut up and just die when their time has come.


If this is a server system, you should have some kind of OOB-management (IPMI) with a remote console - what do you see there when the system goes unresponsive? Does it just reboot or are there any errors/dmesg output on the screen?
Thx :-) The HW is about 4-5 years old. I did also install FreeBSD 13.x when it was released and it has until recently, run with no problems. The issues suddenly appeared 2-3 weeks ago. I have only ssh terminal connection, cannot see if server has died or only the ssh connection is dead. I have to dig into logs to check what is going on..I know I should have a remote console on it :-(

I have not yet started the netdata service. I will let the server run for some weeks without the netdata service and see if netdata is causing the issues..
 
If the disks stop responding the entire OS partially locks up. Running applications tend to keep running as most of it is in memory and doesn't require disk access. When you login with SSH it needs to access the disk(s) and this will stall waiting indefinitely until the disks start responding again. Any process that needs to load something from disk will simply stall until the disks start responding again, I presume netdata needs disk access and therefor also locks up.

I definitely agree with sko check the disks. On several occasions I've had one disk in a 4 disk RAID-Z just hang up the entire bus causing the whole pool to seize up. Which in turn caused everything else that required access to that pool to freeze too.
The disks are SATA disks: ada0 as zroot (single disk), mirrored disks are WD Red™ SA500 NAS SATA SSD 2.5”/7mm Cased. I will check the logs to see if there are any errors from disks..

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <KINGSTON SUV400S37120G 0C3J96R9> ACS-4 ATA SATA 3.x device
ada0: Serial Number 50026B77690038A2
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 114473MB (234441648 512 byte sectors)
ses0: ada0 in 'Slot 00', SATA Slot: scbus0 target 0
ada1 at ahcich3 bus 0 scbus3 target 0 lun 0
ada1: <WDC WDS500G1R0A-68A4W0 411000WR> ACS-4 ATA SATA 3.x device
ada1: Serial Number 2140HU446905
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 476940MB (976773168 512 byte sectors)
ses0: ada1 in 'Slot 03', SATA Slot: scbus3 target 0
ada2 at ahcich4 bus 0 scbus4 target 0 lun 0
ada2: <WDC WDS500G1R0A-68A4W0 411000WR> ACS-4 ATA SATA 3.x device
ada2: Serial Number 2140HU444315
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada2: Command Queueing enabled
ada2: 476940MB (976773168 512 byte sectors)
 
cat /var/log/messages | grep -i ssh | less

I did not find anything suspicious in netdata logs, but found this in /var/log/messages.. Might it be a sshd error?

Jan 8 08:28:26 backup pkg[43720]: libssh2 upgraded: 1.10.0,3 -> 1.10.0_1,3
Jan 9 08:54:21 backup sshd[7746]: error: Fssh_kex_exchange_identification: read: Connection reset by peer
Jan 15 18:31:45 backup sshd[5387]: error: Fssh_kex_exchange_identification: read: Connection reset by peer
 
I did not find anything suspicious in netdata logs, but found this in /var/log/messages.. Might it be a sshd error?

Jan 8 08:28:26 backup pkg[43720]: libssh2 upgraded: 1.10.0,3 -> 1.10.0_1,3
Jan 9 08:54:21 backup sshd[7746]: error: Fssh_kex_exchange_identification: read: Connection reset by peer
Jan 15 18:31:45 backup sshd[5387]: error: Fssh_kex_exchange_identification: read: Connection reset by peer
That only means that someone (either a valid user or a hack attack) started to log in, and disconnected during one of the security setup phases (namely key exchange). If you have lots and lots of those, it might be a symptom of hackers probing your ssh port. That can be worrisome, or (sadly) part of doing business. But it should not have anything to do with the machine fully crashing.

I agree with the other posters who said that disk failure is the most likely culprit. And if it is the boot disk, you might not even see error messages from it: While an IO error on the disk might make it to the in-memory log (which you can inspect with the dmesg command, if only you could log in), it might not make it to /var/log/messages, because (duh) the system disk just failed.

Even worse: A few years ago I had a disk drive that failed so thoroughly that the motherboard completely stopped functioning. No BIOS screen, no booting, no sign of life at all.

I have two suggestions. First, as you already said, get remote console access organized ASAP. When the crash occurs, there might be something interesting on the console (like the most recent disk error, or a kernel stack trace if a hardware problem has caused the OS to crash). Second, install the smartmontools package, and use smartctl -a on all the disk drives, to see whether you see any indication of something suspicious or outright broken. Neither of those two suggestions is a silver bullet, but we got to start somewhere.
 
Second, install the smartmontools package, and use smartctl -a on all the disk drives, to see whether you see any indication of something suspicious or outright broken. Neither of those two suggestions is a silver bullet, but we got to start somewhere.
Thx, server is still alive this morning :) installed smartmontools and ran smartctl -a and selftest on all disks - no errors reported.. I don't think the server is hacked or tried to be hacked. I don't have a fixed IP-adress and the servers address is a local address only 10.x.x.x. And the only access to server is by ssh and password less login by ssh-keys...
 
Not sure if its ip address is assigned to something else too.
Which may turn connection on and off.
The servers IP-adress is added to the static DHCP (by MAC address) and set by the router. As far as I can understand there are nothing else trying to get the same IP-adress
 
I would hammer the harddrive that holds the root filesystem with benchmarks (e.g. bonnie) to see whether that triggers hangs.

If yes, it is also possible that a bad mainboard is responsible. I had bad mainboards produce hangs on heavy disk I/O multiple times.
 
I would hammer the harddrive that holds the root filesystem with benchmarks (e.g. bonnie) to see whether that triggers hangs.
OK, I will try this if and when the server dies again :-) The only changes for now are netdata service is not started and the smartd daemon installed and running for the zroot disk and mirrored backup disks..... If the server is running for a week or two I might suspect netdata causing the issue, did find some error in netdata log... But it also might be a hw error... will try like bonnie if server dies again :-)
 
Issue is most likely found but not yet fixed. The server went offline again today and I thought why not pull the network cable and try the second interface. I finally put the network cable back in the default network interface and the server came online again. So for some reason my default network interface (igb0) after some time is going down.. I don't know why. I have installed monit as well as monitoring and every time the network interface goes down, monitor will throw an alert and log it. I am not a network expert, the network interfaces is activated by default values by FreeBSD at startup.

So if anyone has some ideas what might cause the network interface to go down please let me know. The MB is a Asrock E3C226D2I server board with Intel chipset. The MB was bought back in 2017 and has been running without any problems since then...
 
Hmm. Maybe some (unusual? bizarre?) network misconfiguration that normally has no effect, but sometimes leaves the network stack blocked? And re-initializing it (by switching interfaces, perhaps causing things like DHCP to restart things) unjams it?

Try diagnosing the network interface at various levels, for example both ping and an ssh probe. Or try running a script that records the output of ifconfig every second, and see whether it changes. Or various netstat outputs.
 
If the interface goes down there has to be something in /var/log/dmesg and possibly in the /var/log/messages.
Might also be an address collision - any DHCP server giving out addresses in the same range and not ping-checking before handing out leases?

Assuming you run a switch with (at least basic) management capabilities: what does the switch has to say about that interface? increased errors/packet drops? decreased port speed? (e.g. "sh int giN/N/N" on cisco)

If you run something like zabbix, it might be worth monitoring the switch via SNMP to get some more insight to such issues...


Given the (luckily) few encounters I had with ASRock (desktop) hardware I still wouldn't say a hardware defect is unlikely after just ~5 years. From what I experienced with 2 mainboards and some of their NUC variants, I have pretty a low opinion of ASRock and avoid them like the plague...
 
I had GbE interface hangs on em earlier this year. Nothing in dmesg, just had to reboot to fix it. Went away with a FreeBSD update.
 
Back
Top