No reply to SSH login or SSH stops responding (deadlock?)

I was wondering if I should start a new thread, or revive a 2 years old thread that has a similar issue, but not exactly the same… Here's to a new thread based on :
https://forums.freebsd.org/threads/49411/

I have two PCs with each a static IP4 connected to the Internet via a dynamic IP router. The local PC (FreeBSD 10.1-RELEASE) knows the IP4 of the remote PC (FreeBSD 10.3-RELEASE) through dynamic DNS scripts (that I wrote, and will gladly share with you when I take the time to clean them up).

The local PC can connect to the remote PC via ssh(1) without any problem, even for days/weeks without requiring the need for the remote PC from being rebooted. SSH connections can be up to a few hours, then I close them using CTRL-D when I'm done (so less files are opened should a power outage occur). I do kill the processes that tunnel TCP ports via SSH that I created because they no longer work when I come back at a later time. My Internet connection is not permanent.

The problem I encounter happened twice already, and I would like fix it or work around it. In the SSH shell, I click the up-arrow to recall an older command that I typed, then SSH becomes unresponsive after I type ENTER to execute my selected command. No SSH logon is possible after that. Surprisingly, pinging the remote PC on its own LAN works.

Code:
% ssh -vvv -C -4 ruser@1.2.3.4
OpenSSH_6.6.1p1, OpenSSL 1.0.1j-freebsd 15 Oct 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug3: ciphers ok: [snip]
debug2: ssh_connect: needpriv 0
debug1: Connecting to 1.2.3.4 [1.2.3.4] port 2000.
debug1: Connection established.
debug1: identity file /home/luser/.ssh/id_rsa type -1
debug1: identity file /home/luser/.ssh/id_rsa-cert type -1
debug1: identity file /home/luser/.ssh/id_dsa type -1
debug1: identity file /home/luser/.ssh/id_dsa-cert type -1
debug1: identity file /home/luser/.ssh/id_ecdsa type -1
debug1: identity file /home/luser/.ssh/id_ecdsa-cert type -1
debug1: identity file /home/luser/.ssh/id_ed25519 type -1
debug1: identity file /home/luser/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.6.1_hpn13v11 FreeBSD-20140420

Code:
% telnet -4 1.2.3.4 2000
Trying 1.2.3.4...
Connected to abc.def.com.
Escape character is '^]'.

To my knowledge, /var/log/messages and /var/log/security didn't have anything out of the ordinary on the first occurrence of the problem, after I rebooted the remote PC. I'll also check /var/log/debug.log and /var/log/auth.log this time just in case.

I'm considering setting LogLevel to DEBUG3 temporarily (for a few days until the problem occurs again) in sshd_config. This setting is not recommended for user privacy. What does that mean? Too much info in log files? If so, I can always delete the log files after resetting this option to default.

A long time ago:
The host may be having an issue. I've had this happen on a few heavily loaded servers. You'd get the banner but everything else just stalls. It could be disk related, if one drive has failed for example, the system would block everything and waits indefinitely for the disk to respond to I/O. It's also possible the server has run out of memory, and I mean really run out, no memory free and a full swap. That would also result in connections seemingly hanging.

Perhaps I could check for some memory leaks by SSH itself and do a preventing stop/start SSH or a reboot of the PC when the problem is about to happen? What should I check for such issues? ps(1)?

Anything else?

Dominique.
 
I fear that this problem is going to be very hard to isolate, as I don't have a sure-bet procedure to reproduce the problem easily. It seems more and more probable to be a deadlock problem, perhaps between OpenSSH, the TCP/IP stack and the kernel. It could even be hardware related. Who knows at this point?

Once on location, ssh(1) would still not work, but ping(8) requests are answered. I plugged in a VGA monitor, USB mouse and USB keyboard into the remote PC. I didn't find any out-of-the-ordinary status messages on the console that I know of. The status of newly connected USB devices are also displayed, so I know I can investigate further.

ttyv0 lets me login to ruser, good! I then switch to root to check the log files. Again, I didn't find any out-of-the-ordinary messages in /var/log/auth.log, /var/log/debug.log, /var/log/messages and /var/log/security. However, I noticed a huge gap in the logs, no message during the time ssh(1) would not execute, as if the PC had stopped running (except for ping(8) still able to run). I do have a crontab(1) script scheduled as ruser, accessing the web, and it too never ran during the time ssh(1) would not execute or it tried but got a fetch(1) error message.

I notice that the geli(8) partition is still present in the /dev tree, and mounted, so I gracefully terminate the app that uses it, a downloader having seemingly failed tasks midway. I umount(8) the partition afterward.

Then, I try ssh(1), AND IT WORKS!!! What the??? This sure looks like a deadlock to me. The next time that the problem occurs, I'm going to check ssh(1) again at every step that I do to see when the deadlock is eliminated. That should be a good clue. I also set the sshd(8) LogLevel to DEBUG3 for the next time the problem happens again. This little one is so log-hungry. It's unreal! Hopefully it will help as well.

Anything else to help isolate the problem? Seems like a bug report is needed in this case, but I'm not even sure how to describe it.

Dominique.
 
I got the same deadlock problem on September 18. This time, I couldn't log in on the console. The ALT-Fnum keys worked to switch to a different virtual terminal, but no keyboard input and neither visual echo of it when trying to log in. ping(8) on a different computer did get replies. How strange! All I could do is reboot the computer.

I saved all my log files for further reference, but I don't notice anything out of the ordinary except the fact that all FreeBSD activities seemed to have frozen between the moment that the deadlock occurred and the system reboot (all, except ping(8) and ALT-Fnum). I have a cron task scheduled hourly under root (fetchweb), and it didn't run either. I added the *** REBOOT *** separator in the logs below for clarity. Port 2000 is SSH and port 5000 is VNC.

/var/log/auth.log
Code:
Sep 17 22:32:18 BSD1234 sshd[16796]: reverse mapping checking getaddrinfo for BSD8765.my.isp.com [8.7.6.5] failed - POSSIBLE BREAK-IN ATTEMPT!
Sep 17 22:32:18 BSD1234 sshd[16796]: Postponed keyboard-interactive for ruser from 8.7.6.5 port 63829 ssh2 [preauth]
Sep 17 22:32:21 BSD1234 sshd[16796]: Postponed keyboard-interactive/pam for ruser from 8.7.6.5 port 63829 ssh2 [preauth]
Sep 17 22:32:21 BSD1234 sshd[16796]: Accepted keyboard-interactive/pam for ruser from 8.7.6.5 port 63829 ssh2
Sep 17 22:32:21 BSD1234 sshd[16796]: User child is on pid 16799
Sep 17 22:32:24 BSD1234 sshd[16800]: Connection from 8.7.6.5 port 63830 on 192.168.1.10 port 2000
Sep 17 22:32:25 BSD1234 sshd[16800]: reverse mapping checking getaddrinfo for BSD8765.my.isp.com [8.7.6.5] failed - POSSIBLE BREAK-IN ATTEMPT!
Sep 17 22:32:25 BSD1234 sshd[16800]: Postponed keyboard-interactive for ruser from 8.7.6.5 port 63830 ssh2 [preauth]
Sep 17 22:32:28 BSD1234 sshd[16800]: Postponed keyboard-interactive/pam for ruser from 8.7.6.5 port 63830 ssh2 [preauth]
Sep 17 22:32:28 BSD1234 sshd[16800]: Accepted keyboard-interactive/pam for ruser from 8.7.6.5 port 63830 ssh2
Sep 17 22:32:28 BSD1234 sshd[16800]: User child is on pid 16803
Sep 17 22:32:28 BSD1234 sshd[16803]: Starting session: shell on pts/1 for ruser from 8.7.6.5 port 63830 id 0

*** REBOOT ***

Sep 18 23:34:31 BSD1234 sshd[640]: Server listening on :: port 2000.
Sep 18 23:34:31 BSD1234 sshd[640]: Server listening on 0.0.0.0 port 2000.
Sep 18 23:34:44 BSD1234 login: login on ttyv0 as root
Sep 18 23:34:44 BSD1234 login: ROOT LOGIN (root) ON ttyv0

/var/log/cron
Code:
Sep 18 01:50:00 BSD1234 /usr/sbin/cron[17367]: (root) CMD (/usr/libexec/atrun)
Sep 18 01:55:00 BSD1234 /usr/sbin/cron[17372]: (operator) CMD (/usr/libexec/save-entropy)
Sep 18 01:55:00 BSD1234 /usr/sbin/cron[17373]: (root) CMD (/usr/libexec/atrun)
Sep 18 02:00:00 BSD1234 /usr/sbin/cron[17387]: (root) CMD (newsyslog)
Sep 18 02:00:00 BSD1234 /usr/sbin/cron[17389]: (root) CMD (/usr/libexec/atrun)
Sep 18 02:00:00 BSD1234 /usr/sbin/cron[17388]: (operator) CMD (/usr/libexec/save-entropy)

*** REBOOT ***

Sep 18 23:34:32 BSD1234 /usr/sbin/cron[654]: (root) CMD (fetchweb)
Sep 18 23:38:44 BSD1234 /usr/sbin/cron[635]: (root) CMD (fetchweb)
Sep 18 23:40:00 BSD1234 /usr/sbin/cron[718]: (root) CMD (/usr/libexec/atrun)
Sep 18 23:44:00 BSD1234 /usr/sbin/cron[725]: (operator) CMD (/usr/libexec/save-entropy)
Sep 18 23:45:00 BSD1234 /usr/sbin/cron[729]: (root) CMD (/usr/libexec/atrun)

/var/log/debug.log
Code:
Sep 18 00:55:02 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 100643
Sep 18 00:55:10 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 99686
Sep 18 00:55:14 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 99458
Sep 18 00:55:31 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 110010
Sep 18 00:55:34 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 99157
Sep 18 00:55:51 BSD1234 sshd[16799]: debug2: channel 0: rcvd adjust 100254
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: receive packet: type 96
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: rcvd eof
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: output open -> drain
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: obuf empty
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: close_write
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: output drain -> closed
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: read<=0 rfd 7 len 0
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: read failed
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: close_read
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: input open -> drain
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: ibuf empty
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: send eof
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: send packet: type 96
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: input drain -> closed
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: send close
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: send packet: type 97
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: channel 0: will not send data after close
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: receive packet: type 97
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: rcvd close
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: channel 0: will not send data after close
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: is dead
Sep 18 00:55:56 BSD1234 sshd[16799]: debug2: channel 0: garbage collecting
Sep 18 00:55:56 BSD1234 sshd[16799]: debug1: channel 0: free: direct-tcpip, nchannels 1
Sep 18 00:55:56 BSD1234 sshd[16799]: debug3: channel 0: status: The following connections are open:\r\n  #0 direct-tcpip (t4 r1 i3/0 o3/0 fd 7/7 cc -1)\r\n
Sep 18 01:11:31 BSD1234 sshd[16799]: debug3: receive packet: type 90
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: server_input_channel_open: ctype direct-tcpip rchan 1 win 2097152 max 32768
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: server_request_direct_tcpip: originator 127.0.0.1 port 33524, target 192.168.1.10 port 5000
Sep 18 01:11:31 BSD1234 sshd[16799]: debug2: fd 7 setting O_NONBLOCK
Sep 18 01:11:31 BSD1234 sshd[16799]: debug2: fd 7 setting TCP_NODELAY
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: connect_next: host 192.168.1.10 ([192.168.1.10]:5000) in progress, fd=7
Sep 18 01:11:31 BSD1234 sshd[16799]: debug3: fd 7 is O_NONBLOCK
Sep 18 01:11:31 BSD1234 sshd[16799]: debug3: fd 7 is O_NONBLOCK
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: channel 0: new [direct-tcpip]
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: server_input_channel_open: confirm direct-tcpip
Sep 18 01:11:31 BSD1234 sshd[16799]: debug3: channel 0: waiting for connection
Sep 18 01:11:31 BSD1234 sshd[16799]: debug1: channel 0: connected to 192.168.1.10 port 5000
Sep 18 01:11:31 BSD1234 sshd[16799]: debug3: send packet: type 91
Sep 18 01:11:50 BSD1234 sshd[16799]: debug3: receive packet: type 96
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: rcvd eof
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: output open -> drain
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: obuf empty
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: close_write
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: output drain -> closed
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: read<=0 rfd 7 len 0
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: read failed
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: close_read
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: input open -> drain
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: ibuf empty
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: send eof
Sep 18 01:11:50 BSD1234 sshd[16799]: debug3: send packet: type 96
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: input drain -> closed
Sep 18 01:11:50 BSD1234 sshd[16799]: debug2: channel 0: send close
Sep 18 01:11:50 BSD1234 sshd[16799]: debug3: send packet: type 97
Sep 18 01:11:50 BSD1234 sshd[16799]: debug3: channel 0: will not send data after close
Sep 18 01:11:51 BSD1234 sshd[16799]: debug3: receive packet: type 97
Sep 18 01:11:51 BSD1234 sshd[16799]: debug2: channel 0: rcvd close
Sep 18 01:11:51 BSD1234 sshd[16799]: debug3: channel 0: will not send data after close
Sep 18 01:11:51 BSD1234 sshd[16799]: debug2: channel 0: is dead
Sep 18 01:11:51 BSD1234 sshd[16799]: debug2: channel 0: garbage collecting
Sep 18 01:11:51 BSD1234 sshd[16799]: debug1: channel 0: free: direct-tcpip, nchannels 1
Sep 18 01:11:51 BSD1234 sshd[16799]: debug3: channel 0: status: The following connections are open:\r\n  #0 direct-tcpip (t4 r1 i3/0 o3/0 fd 7/7 cc -1)\r\n

*** REBOOT ***

Sep 18 23:34:31 BSD1234 sshd[640]: debug2: fd 3 setting O_NONBLOCK
Sep 18 23:34:31 BSD1234 sshd[640]: debug3: Fssh_sock_set_v6only: set socket 3 IPV6_V6ONLY
Sep 18 23:34:31 BSD1234 sshd[640]: debug1: Bind to port 2000 on ::.
Sep 18 23:34:31 BSD1234 sshd[640]: debug1: Server TCP RWIN socket size: 65536
Sep 18 23:34:31 BSD1234 sshd[640]: debug2: fd 4 setting O_NONBLOCK
Sep 18 23:34:31 BSD1234 sshd[640]: debug1: Bind to port 2000 on 0.0.0.0.
Sep 18 23:34:31 BSD1234 sshd[640]: debug1: Server TCP RWIN socket size: 65536

/var/log/maillog.0
Code:
Sep 18 01:05:08 BSD1234 sendmail[17291]: u8I558kA017291: from=root, size=249, class=0, nrcpts=1, msgid=<201609180505.u8I558kA017291@BSD1234>, relay=root@localhost
Sep 18 01:05:08 BSD1234 sm-mta[17292]: STARTTLS=server, relay=localhost [127.0.0.1], version=TLSv1.2, verify=NO, cipher=DHE-RSA-AES256-GCM-SHA384, bits=256/256
Sep 18 01:05:08 BSD1234 sendmail[17291]: STARTTLS=client, relay=[127.0.0.1], version=TLSv1.2, verify=FAIL, cipher=DHE-RSA-AES256-GCM-SHA384, bits=256/256
Sep 18 01:05:09 BSD1234 sm-mta[17292]: u8I558Z7017292: from=<root@BSD1234>, size=490, class=0, nrcpts=1, msgid=<201609180505.u8I558kA017291@BSD1234>, proto=ESMTPS, daemon=Daemon0, relay=localhost [127.0.0.1]
Sep 18 01:05:09 BSD1234 sendmail[17291]: u8I558kA017291: to=root, ctladdr=root (0/0), delay=00:00:01, xdelay=00:00:01, mailer=relay, pri=30249, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u8I558Z7017292 Message accepted for delivery)
Sep 18 01:05:09 BSD1234 sm-mta[17293]: u8I558Z7017292: to=<root@BSD1234>, ctladdr=<root@BSD1234> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=local, pri=30768, relay=local, dsn=2.0.0, stat=Sent

*** REBOOT ***

Sep 18 23:34:32 BSD1234 sm-mta[643]: starting daemon (8.15.2): SMTP+queueing@00:30:00
Sep 18 23:34:32 BSD1234 sm-mta[643]: STARTTLS=server: file /etc/mail/certs/dh.param unsafe: No such file or directory
Sep 18 23:34:32 BSD1234 sm-msp-queue[646]: starting daemon (8.15.2): queueing@00:30:00

/var/log/messages
Code:
Sep 17 18:58:59 BSD1234 kernel: GEOM_ELI: Device ada0p4.eli created.
Sep 17 18:58:59 BSD1234 kernel: GEOM_ELI: Encryption: XOR
Sep 17 18:58:59 BSD1234 kernel: GEOM_ELI:     Crypto: software

*** REBOOT ***

Sep 18 23:34:31 BSD1234 syslogd: kernel boot file is /boot/kernel/kernel
Sep 18 23:34:31 BSD1234 kernel: Copyright (c) 1992-2016 The FreeBSD Project.
Sep 18 23:34:31 BSD1234 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Sep 18 23:34:31 BSD1234 kernel: The Regents of the University of California. All rights reserved.
Sep 18 23:34:31 BSD1234 kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Sep 18 23:34:31 BSD1234 kernel: FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016

/var/log/security
Code:
Nothing since July 30th

How can I further isolate the problem to get it fixed? :confused:

Dominique.
 
Back
Top