Problem with runaway sshd and sendmail on FreeBSD 7.4

Running a FreeBSD 7.4 server, haven't made OS config changes to it for months, hadn't rebooted in a couple of months. No hardware changes in a couple of years. Just today it started exhibiting a new weird behavior.

I try to ssh in, client side hangs, server side process runs away consuming many CPU cycles without doing anything. Going to DEBUG3 in /etc/ssh/sshd_config doesn't even log an attempted client connection. It's like it goes off into Never Never Land before it does anything. Similar thing happens to connections to sendmail daemon, CPU use by process goes through the roof, but it doesn't get any work done. Apache running on same box not affected, it works fine. BIND on same box works just fine as well.

When the problems started, it was running 7.4p6. I rebuilt the kernel and did a new install and rebuilt world and did an install and rebooted. Now running 7.4p11 with updated binaries and libraries. Same problems. X on the machine and client side outbound connections work just fine. Nothing in the logs to indicate a problem.

The box has two NICs and is running as a gateway/server/firewall between the Internet and my home network. In addition to above mentioned services, it's running natd via pf and dhcpd, with both seeming to work fine.

What the heck is going on? Anyone have a clue? Anyone know what I can even check? I'm at a complete loss, and have scoured the 'Net and can't find any information on similar problems.
 
npc,

Since the box is in service at your home I am assuming that you have physical access to it, correct? Throw a KVM on it and see what going in /var/log for auth and messages.

Given the age of your box and given how insane it has been lately with zero days, you could be looking at systemic ownageship. I personally would not spend too much time. Get your config files off and roll a new one.

Good luck.
 
If you notice from my initial message, I do have console access and there's nothing unusual in the logs. I rebuilt sshd and sendmail from source, put them in place, ran them, and they experienced the same problem, then I compared the md5 checksum before and after, they're the same. So, it appears unlikely to be problems with the daemons themselves. Gotta be kernel or hardware.

Doing an upgrade would be fine, but what versions are safe? What should I upgrade it to? What prevents the new one from having similar issues, whether "owned" or not? Where is there a good summary of these "ownageship" issues?
 
npc said:
If you notice from my initial message, I do have console access and there's nothing unusual in the logs.
Yeah, sorry about that. I read where ya couldn't ssh and didn't see anything about log files so I did not spend too much time processing the rest.
.. but what versions are safe?
"Safe" as I am sure that you are aware is subjective. Sometime too much energy is expended getting wrapped around an axle on the perfect setup. I now subscribe to the theory of diminishing returns and being able to say "boy howdy we sure figured out that problem after 50 hours!!" is not where I am at anymore.

For what I do at my house (similar to your config), 9.1 has been dandy and I think that you will be happy with it too. Plus you will pick up PF version 4.5!

:D

But if ya want to slog it out on your box bro, I would recommend that you run tcpdump and see what is being put on the wire for ssh/sendmail as opposed to just looking at a log file. Perhaps you could build up an identical 7.x box, making note of the status of ssh/sendmail in a change management style of control as you add software and see where it breaks. If it does not break and everything is the same software wise, you clearly have something sideways on your production box.
 
Okay, upgraded to 8.3 from CD. Doesn't fix anything, but now I can't get IP forwarding to work either, so *everything* on my internal network is now effectively a brick. Also, after upgrades and reboots additional programs exhibit the same runaway CPU problem: the fluxbox window manager, and hald being the most obvious.

So, with all due appreciation to johnblue, I've wasted a day upgrading and then fixing the resulting problems, and I'm actually worse off than I was before.

Anyone else want to try, because I think my next step has to be moving to Linux, 'cause it's cheaper than buying another box?
 
npc said:
Anyone else want to try, because I think my next step has to be moving to Linux, 'cause it's cheaper than buying another box?
There is a high degree of probability that if you install any Linux distro .. it will work! LOL. Doing that is very similar to what I initially suggested that you do namely, copy your config files off and start over.

:e

Nothing is worse than sinking time into a project and not having it work, but the results do not surprise me in the least. At the very least you should have imaged your current install onto a spare hard drive and then slipped that spare hard drive into the box so as to do your "upgrade" testing. Personally, as a sysadmin, having a DR plan is something intuitively obvious I apologize for not mentioning it as an option because most people should know they are responsible for their own data in whatever form that may be in.

But back to your box! It is whacked and surely has been the case when the version level was 7.x. Backup your config files and roll a new one.

Good luck.
 
I suspect the problem is not on the border machine, but on the internal network. Have you tried to ssh-in from the border machine to itself to see if ssh is loggin any activity?
Just a guess, but I suspect the problem could not be on ssh if you have rebuilt everything...
 
First, I'm the same poster that started this thread. I went to log in to this site, and it told me my account didn't survive an upgrade. So I went through the password reset process, it sent me a new password, I tried to log in to the site, and it still told me my account didn't survive an upgrade, even using the reset password. So, the password recovery system isn't working for folks whose accounts didn't survive the upgrade. Someone might want to look into that.

So, it's a year and a half later. I reinstalled FreeBSD 8.3 on the box and things worked. Later, I updated several times, currently at 8.4-RELEASE-p14 and haven't had any problems ... until today.

I went to SSH in to the same box, and experienced exactly the same problem I had a year and a half ago. Exactly. sshd() and sendmail() are using 100% CPU, the ssh() session hangs before providing the password prompt, the box is otherwise acting as a network gateway just fine. I don't know if it would still run X okay, as I didn't install it this time around. Before today, the box had been up for 55 days without an obvious problem, although I rarely need to log into it, doing so mostly to perform periodic updates. It has probably been in this state for some time, but did not occur at all before the reboot 55 days ago, which was probably to do a -RELEASE kernel update. A reboot today did not fix things.

Code:
ssh -vvv output:2$ ssh -vvv athena
OpenSSH_5.9p1, OpenSSL 0.9.8y 5 Feb 2013
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to athena [192.168.192.1] port 22.
debug1: Connection established.
debug3: Incorrect RSA1 identifier
debug3: Could not load "/Users/npc/.ssh/id_rsa" as a RSA1 public key
debug1: identity file /Users/npc/.ssh/id_rsa type 1
debug1: identity file /Users/npc/.ssh/id_rsa-cert type -1
debug1: identity file /Users/npc/.ssh/id_dsa type -1
debug1: identity file /Users/npc/.ssh/id_dsa-cert type -1

I can confidently say that there are no local networking problems. I do not believe the box has been hacked. Whatever is causing the problem existed with FreeBSD 7, didn't occur for a while after upgrade, but does occur, at least under some circumstances, with FreeBSD 8.4-RELEASE-p14.

sendmail() and sshd() (and telnetd()) are trying to do *something* that, on occasion, gives them fits. It is something that other processes, including ntpd(), isc-dhcpd, named, apache, and natd() don't have. Each of the offending processes is churning CPU, but a truss of the PID shows them doing nothing, presumably waiting for something. It does not appear to be a DNS problem, host() and nslookup() running on the host return correct results (from named running on localhost) immediately. I can ssh out from the box with no problems, both to the inner network and the Internet. If I truss sshd -D, the last thing it does before it seems to seize is a successful seteuid(). There is no indication of any problem in any log. The offending processes are all killable.

If it's hardware, it's not always a problem. It's not confined to one BSD version. The problems seem to be confined to a small number of networking applications. It may be an IP stack interaction with hardware. I don't know what the trigger is, why I could SSH in 60 days ago but can't now, and I can't imagine what it might be. Anyone have any ideas? In the absence of anything good, I could try backing out to 8.4-RELEASE-p{<14}, but I don't want to run on a version with known security issues, and if that doesn't work, I think my next step has to be to install Linux on it.

Any ideas? What could reasonably explain the time gap in issues other than an intermittent hardware problem or a regression in the kernel?
 
When you say it happens on "multiple BSD versions", do you mean multiple versions of FreeBSD, or other versions like OpenBSD?

Why are you running telnetd()?

That it goes bad and then stays bad after a reboot suggests something permanent, and I'd be worried about running telnetd() or other security concerns.

Do you have a backup from before it went bad?
 
wblock@ said:
When you say it happens on "multiple BSD versions", do you mean multiple versions of FreeBSD, or other versions like OpenBSD?

I was imprecise. I figured this would be clear from the context. I know that this problem will occur on FreeBSD 7.4 and FreeBSD 8.4-RELEASE-p14. I don't know for a fact that it occurs on any other kernel version, but this seems oddly specific, unless someone can point to something in 8.4-RELEASE-p14 that didn't occur in earlier 8.4 and 8.3 versions that could be a regression.

Why are you running telnetd()?

Because sshd isn't working, and I wanted to see if I could get into the damn box. It's not something I do normally. I am 100% certain that this is not the problem.

That it goes bad and then stays bad after a reboot suggests something permanent, and I'd be worried about running telnetd() or other security concerns.

It went bad before I started telnetd under inetd.

Do you have a backup from before it went bad?

Yes, but that would require reverting to an unsafe kernel version. I could do that as a test, but I'm not sure what it would reveal. Seems like a lot of work for little new information.
 
Restoring from a backup to another drive might allow comparisons to figure out what has changed. Or install something like security/tripwire before anything changes.

Or if it really seems like a bug, a restored backup could be immediately upgraded to FreeBSD 9 or 10 while still running the unchanged ports.
 
wblock@ said:
Re: Problem with runaway sshd and sendmail on FreeBSD 7.4

wblock@ said:
Restoring from a backup to another drive might allow comparisons to figure out what has changed. Or install something like security/tripwire before anything changes.

I'm familiar with tripwire, I've been using it since the 90s.I'm the only one with access to the box. What has changed since a known good state? I've done a buildworld/installworld upgrading from p13 to p14 and updated a few ports, none of which are exhibiting problems. That's it. If a software change caused this, it's in the FreeBSD kernel, which is why I mentioned the possibility, or it's the freakiest hacker in history who is world class at covering his tracks, but makes sloppy changes to my machine that are unique on the entire Internet.

Or if it really seems like a bug, a restored backup could be immediately upgraded to FreeBSD 9 or 10 while still running the unchanged ports.

It's entirely possible that I could upgrade to 9 right now, although I don't know that for sure. However, we know that an upgrade from 7.4 to 8.4 was a temporary fix. It doesn't seem like a good investment to upgrade to 9 or 10 without having some reason to think this won't recur again, right?

I appreciate the attempt to help, but this set of advice is not advancing my position. If you have some suggestions as to what the cause might be, or if you have some suggestions on how to diagnose or repair the problem, I'd like to hear them. I don't really need anyone to tell me what my upgrade options are. Thanks, but I don't need any assistance evaluating those.
 
Back
Top