System keeps going unresponsive, no idea why

Drizzt321 · Sep 30, 2020

So my FreeBSD 12.1-RELEASE-p8 system keeps going unresponsive, with no specific error I can see in the logs (unless I'm missing something). Even the console output, I get nothing. If I have a USB keyboard plugged in, doesn't respond to input. If I plug one in after it goes non-responsive it doesn't get connected/powered up (spare keyboard, has backlight keys, keys don't light up).

I'm on a Ryzen 1700X, 32GB RAM, Intel PRO/1000 NIC, run bhyve (via vm-churchers) with a few VMs such as Ubiquiti Controller, pihole, Home-Assistant, Samba for file shares, and such. When I notice it goes unresponsive (via SSH dropping, can't connect to shares, etc), often there's errors similar to below, although sometimes it's some random sendmail lines. I zroot, and a 8 disk RAIDZ2 array. I still have plenty of available space on zroot and my storage pool.

sshd[17021]: error: maximum authentication attempts exceeded for root from 132.145.127.59 port 55872 ssh2 [preauth]

or

sshd[17257]: error: PAM: Authentication error for root from 45.227.255.4

or

sshd[17897]: error: PAM: Authentication error for illegal user no from 185.107.80.218

Anyone have any thoughts on things to check, any debug/logging I could enable which might help? Any places to check other then /var/log? I'm at my wits end, just don't understand what might be happening. I can't think of anything particular that might have changed, and it's definitely at random. Sometimes weeks will go by, sometimes less than 24hrs.

tingo · Oct 1, 2020

Are your zfs pool(s) healthy?

Phishfry · Oct 1, 2020

sshd[17021]: error: maximum authentication attempts exceeded for root from 132.145.127.59 port 55872 ssh2 [preauth

Is your ssh server on the internet? If it is you should lock it down
security/sshguard

Drizzt321 · Oct 1, 2020

Yes ZFS pools are healthy (although ZROOT is not the latest ZFS version, but otherwise healthy).

Yes, SSH is port forwarded to this box from the Internet, no don't have sshguard. I'll install and configure this, certainly possible it's excessive SSH connection/requests/etc that are bad. Although I'd think that'd seriously hinder my regular browsing, and I don't see any signs of that.

I don't have any specific firewall configured, that I know of. None of pf, ipf, ipfw are enabled in the rc.conf. Which is easiest for a simple config?

Could it be the NIC? Some other hardware?

richardtoohey2 · Oct 1, 2020

My list for trouble-shooting: humans (myself included), DNS, RAM, power. Doubt it is the ssh connections, but something like sshguard a good idea (or even (dons asbestos pants) move the port and look at the AllowUsers option in the sshd_config file as an initial start if you want to do something without installing a port).

These sporaradic ones are a complete pain - could be a certain part of RAM that occasionally gets tickled (tried memtest?), or just enough power draw under some loads to cause power issues.

Once I was having odd issues (aborted connections rather than hard hangs) with scping files, happened in different places and different times, no obvious pattern. Replaced the RAM and was solid after that (but maybe in the process of changing the RAM I re-seated some other component or something, so tricked myself into thinking it was the RAM?)

To eliminate the NIC - have you got another one you could try?

Any changes recently?

You could try running top and logging every minute to see if anything there e.g. I've got this running every minute to watch a memory issue (the issue might be more mine!) I'm trying to understand:

Code:

@every_minute date >>      /tmp/swap_usage.txt && top -n 100 >>      /tmp/swap_usage.txt && pstat -sh >>      /tmp/swap_usage.txt

Good luck!

ralphbsz · Oct 1, 2020

Being continuously hack-attacked over ssh does not explain your system going down. But it is bad:
45.227.255.4 is a machine in Panama.
132.145.127.59 is on the Oracle public cloud.
I'm 99% sure those machines are not supposed to ssh into your machine, in particular not as root. If hackers are continuously probing you, they will get in sooner or later, or DoS you. You need to do something. My personal suggestion (doesn't work for everyone), quite effective, much easier than sshguard and a firewall: Simply move the ssh port to a bizarre port number. No, not 2222 (that's old, everyone does that. Personally, I use the last four digits of Pi <- ha ha old joke. Now, for security, you really want something better, but this reduces the frequency of ssh attacks to "tolerable".

Monitoring: I would connect a monitor to the console or via serial port, which logs in *WITHOUT* X-windows, and right after booting start top on it. Most likely, top will keep going, and you can see it. If your hardware doesn't work that way, then write a little scripts that runs top once and appends to a file (run top with only one screen of output), then calls sync to make sure the file is on disk, and repeat every 10 seconds. Next time you log in, check what top was doing right before the machine went down. My hunch is that something is exploding, and you have a massive resource problem: your system is probably actually up and running, but so slow that it is indistinguishable from dead.

Drizzt321 · Oct 1, 2020

OK, decided to start with an open ipfw setup with sshguard. Up and running!

I'll wait a bit, do my usual, see if that seems to help.

Doubt it's DNS, this is all over local LAN, 192.168.1.0/24, with a DHCP reservation for the server.

Could be RAM, I suppose. I did run a full memtest86+ round on the RAM when I first got it. I'll see, may need to run another couple of rounds, see if that prompts something.

There is the NIC on the motherboard, Realtek I think, but I got this Intel one since Intel generally makes much better NICs. I'll wait on sshguard, see if that has an effect, then the rest.

I did add your monitoring to a log file, pick something up if it's that.

ralphbsz You presume I have a serial port...this is a modern consumer motherboard. I'd have to get a serial header to back panel, and I don't even know if it's a real serial port, or sitting on the USB bus or something. Still, I could see if that gives me something maybe.

And this is a home NAS server, headless (although I do have an HDMI to one of my monitors for now, to try to troubleshoot this), so no X-windows, nothing like that.

You're right, it's probably a resource issue of some kind, or something just blocking the kernel somehow from responding, even on the TTY console (USB keyboard, standard cheap PCIe video card output). And that's kinda strange I'd think, has to be something truly crazy.

richardtoohey2 · Oct 1, 2020

The DNS bit was just on my list of usual suspects - not for your case here!

Do also have a look at the AllowUsers in sshd_config - it's an extra layer of protection. No good if you have LOTS of user accounts, but if only one or two, worth doing.

Lots of people HATE the high port idea (and there are some good reasons for the dislike - see e.g. https://lobste.rs/s/jj8cx9/why_putting_ssh_on_another_port_than_22_is) but it works for me.

Think we are all agreeing it's unlikely to be the ssh connection attempts.

Intel NICs definitely good, just if your one has developed a fault, trying another one (if you have a spare) will eliminate that possibility.

Drizzt321 · Oct 1, 2020

Put in AllowUsers in the list, just because. And now all of a sudden it restarted itself. I was SSHed in monitoring with htop while watching a movie via SMB. Suddenly lost connection, switched over to that monitor input, was restarting. Weiiird. Looking more like memory, or something wonky with the NIC/driver.

Drizzt321 · Oct 1, 2020

Well this is interesting. I haven't always been checking the console for messages, since it seemed to not show anything. Tonight I had it do the same thing, and had a lovely kernel stack trace error! Definitely looks likes something network related, let me try switching network ports, see if that makes a difference.

full size phone pic of the trace https://i.imgur.com/DcnYFgS.jpg

ekvz · Oct 1, 2020

One more for moving SSH to a random port. I am doing this for years and it literally reduces bruteforce to zero for me and that's actually while i am still on a port below 1024. Not that it really matters with SSH but i like the short number. Of course there's also sshguard running but it's just sitting there being lonely.

Drizzt321 · Oct 1, 2020

Hm, hung sometime last night. I'll keep another eye on it. It is quite hot today where I'm at. I'll wait a bit more, get a memtest86+ bootable thumbdrive ready to go.

richardtoohey2 · Oct 1, 2020

Temperature/thermal cut-outs another thing to watch out, yes. Should be on my list!

If consumer-grade hardware it might not have much in the way of sensors so something like ipmitool might not be of use. But if there's anything you can use to monitor the CPU & ambient temperatures then that's another thing to look at (or try and improve the cooling anyway, to see if it helps, but you still need a way of measuring any cooling you try).

Drizzt321 · Oct 2, 2020

It's a 1700x with a Noctua UH12S, 120mm high quality tower cooler. I'm not sure it's thermal, at least CPU. Not sure it's thermal at all, it does this seemingly at random, even when it was much cooler (day or night).

It's 50% done with pass #2 on memtest86+, no errors so far, time to run overnight. It's doing the single-core version by default, I'm just going to let it go. If in the morning it doesn't see anything, I'll reboot and try multi-core, maybe it's a particular CCX/core that hits errors.

diizzy · Oct 2, 2020

You might also want to make sure you're running the latest bios and sysutils/devcpu-data (ports) is installed
Similar reports: View: https://www.reddit.com/r/Amd/comments/93ue5a/potential_fix_for_ryzen_crashes_on_freebsd/

Drizzt321 · Oct 2, 2020

So was about 2/3 of the way through pass 4 this morning, no errors. I reboot and started it with SMP, within minutes got a bunch of errors on CPU #4 (0-based). This mean my CPU is bad?

*sigh*

EDIT: Had someone on the Arstechnica forum suggest it might be related to the marginality issue that was seen way back when https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response. Seems quite plausible, as I got a mid-summer/late-summer 1700X.

Drizzt321 · Feb 17, 2021

OK, update from me, it's now been 14 full days of uptime. For me, turns out it was Ryzen C-State halt issue. Do some searching, lots of complaints out there. Mostly for Linux users, but affects *nix in general. Probably didn't need to change the mobo/RAM/PSU, but I already did and don't feel like changing things back.

Anyway, for me the following UEFI tweaks/mitigations fixed it:

AMD CBS (I think) "Power Supply Idle Current" to Typical
Disabled C State C6, left the rest enabled

diizzy · Feb 17, 2021

Thanks for following up on your issue

Mitlow · Mar 18, 2021

richardtoohey2 said:
My list for trouble-shooting: humans (myself included), DNS, RAM, power. Doubt it is the ssh connections, but something like sshguard a good idea (or even (dons asbestos pants) move the port and look at the AllowUsers option in the sshd_config file as an initial start if you want to do something without installing a port).

These sporaradic ones are a complete pain - could be a certain part of RAM that occasionally gets tickled (tried memtest?), or just enough power draw under some loads to cause power issues.

Once I was having odd issues (aborted connections rather than hard hangs) with scping files, happened in different places and different times, no obvious pattern. Replaced the RAM and was solid after that (but maybe in the process of changing the RAM I re-seated some other component or something, so tricked myself into thinking it was the RAM?)

To eliminate the NIC - have you got another one you could try?

Any changes recently?

You could try running top and logging every minute to see if anything there e.g. I've got this running every minute to watch a memory issue (the issue might be more mine!) I'm trying to understand:

Code:

@every_minute date >> /tmp/swap_usage.txt && top -n 100 >> /tmp/swap_usage.txt && pstat -sh >> /tmp/swap_usage.txt

Good luck!

Your advice on changing ports was very useful to me. This simple trick works. Server monitoring shows that "uninvited guests" have stopped logging into the server.

ralphbsz · Mar 18, 2021

But note that the "ssh on strange port" trick is not perfect. It reduces the number of attacks by a huge factor. I have two internet-facing machines, and on one I've never had anyone attack the ssh port; on the other one, I get one attack every 2 or 4 weeks. That's orders of magnitude better than what an open server would see (an attack every few minutes or even seconds), but it should not be your only security barrier.

richardtoohey2 · Mar 18, 2021

ralphbsz said:
But note that the "ssh on strange port" trick is not perfect. It reduces the number of attacks by a huge factor.

+1 to that.

There are strong arguments against moving the port: https://lobste.rs/s/jj8cx9/why_putting_ssh_on_another_port_than_22_is

It works for me as part of a set of mitigations I use, YMMV may vary and all that.