How to clear TCP sockets that the operating system owns or are orphaned ?

eepete · Feb 20, 2024

I upgraded my production server from V 12.x to 13.2-RELEASE-p10 and the from PHP 7.x to php81-8.1.27
This was to match my development server. I've got new server hardware, and will take that and a new development server to V14 in about 2 months.

I had the usual issues with PHP extensions, and a server I wrote in PHP that has a TCP port open would create a port, and then it had a PHP error and terminated. I got to the point where nothing in the system was causing PHP errors, but then server would not take connections. The output of sockstat looked like this (lots of lines cut out for clarity and space). Addresses change to protect the innocent.

Code:

root@mymachine # sockstat -P tcp -s
USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS       PATH STATE   CONN STATE
www      httpd      37880 3  tcp6   *:80                  *:*                                                                              LISTEN
www      httpd      37880 4  tcp4   *:80                  *:*                                                                              LISTEN
?         ?       ?      ?  tcp4   10.55.55:55:111    192.55.55.55:22333                                          TIME_WAIT

Had 28 of these, the last one CONN STATE was CLOSING. There were also a number of active web sockets that were up and running. I could not figure out how to close these down with sockstat, so I decided to reboot with a shutdown -r now.
.

The ssh connection closed, but the server never went down. Apache was still up. I'm guessing a rc.d or other shutdown mechanism also could not figure out how to close the TCP connections. I've called the hosting site, they will power cycle the machine. It's an older machine, hence the new hardware that is in the queue. During a reboot to do the major update on the OS, it tossed a disk so I'm down to two. Getting the new machine up and installing it just went to the top of the priority list.

Any ideas on how to deal with this should it happen again ? Any insight in to what happened would be appreciated. The sever I wrote has been up and running for 3 years with no issues, but it's clear I need to figure out how to handle this situation going forward. Note also the server file is owned by root, is that why there are the ???? everywhere ? Would it be better to have something else own it ?

thanks in advance.

VladiBG · Feb 21, 2024

It's normal condition in connection termination handshake.
The default wait time is controlled via sysctl net.inet.tcp.msl *2 (30000ms*2=60sec) before the socket is closed. You can try to reduce this timeout to 30sec only if you have thousands TIME_WAIT connections like ~60000 and experience issue when no more connection can be established. In your case 28 time_wait connections are nothing.

eepete · Feb 21, 2024

VladiBG said:
It's normal condition in connection termination handshake.
The default wait time is controlled via sysctl net.inet.tcp.msl *2 (30000ms*2=60sec) before the socket is closed. You can try to reduce this timeout to 30sec only if you have thousands TIME_WAIT connections like ~60000 and experience issue when no more connection can be established. In your case 28 time_wait connections are nothing.

Ok, thank you for that information.
The question marks fours, command, PID and FD were confusing to me.
Do you have any ideas on why the shutdown could not occur ? Since ssh went down, but, the Apache system was still up, there must have been some part of the shutdown process that was unhappy .

VladiBG · Feb 21, 2024

I don't think that those time_wait connections are causing the issue with your shutdown. The best way is to use IPMI to log in on the console of the server and to see on what stage during the reboot it hang up. Most likely it waiting for some process to close or sync the hard disk. Do you have any IPMI access to the server from where you can turn it on/off reset and so on... ?

SirDice · Feb 21, 2024

Port 111 is rpcbind(8), is there an unresponsive NFS mount?

eepete · Feb 21, 2024

SirDice said:
Port 111 is rpcbind(8), is there an unresponsive NFS mount?

There are no network file systems in use.
The two issues after doing the upgrade to 13.2 and PHP 8 are:

using the PEAR exertion in Mail of Mail_mimeDecode is unhappy becuase I'm calling it statistically. PHP 8 makes that illegal now, sometimes it seems that PHP like to pedantically deprecate functionality that is a common programming idiom.

The only new functionality kicked in was new functionality to deposit files via SFTP. to a remote site. I saw a good ssh2_connect(). I did see an error when I did a ssh2_auth_password().
I see that I did not close the connection after I go that error. But this was before I did a ssh2_sftp($conn) but I don't know if that is significant or not. 1st time doing any connections to a SFTP.
Is there a way to see if that connection is still open ?

eepete · Feb 21, 2024

VladiBG said:
I don't think that those time_wait connections are causing the issue with your shutdown. The best way is to use IPMI to log in on the console of the server and to see on what stage during the reboot it hang up. Most likely it waiting for some process to close or sync the hard disk. Do you have any IPMI access to the server from where you can turn it on/off reset and so on... ?

I do not have IPMI access. I also have an SNPP server I wrote that has been up for 4+ years with no problem, but it keeps giving me errors on creating a new socket. I've turned that server off so on the next boot that won't be a complicating issue. The usual "try to get things down to just one problem at a time" issues.

eepete · Feb 21, 2024

SirDice said:
Port 111 is rpcbind(8), is there an unresponsive NFS mount?

I should have been clearer- addresss and ports have been changed. So the ports 111 and 22333 were bogus.
What seems clear now is that I had both a SNPP server that did not always close a port, and the subroutine that deposits a file into a 3rd part SFTP server which did not close a port.
Expanding logging on this also found an uptick in random sites poking the server. They are logged, and there seems to be an uptick on the frequency.
All this seems to have created a bunch of "orphaned sockets", if there is such a thing and might explain why the server didn't want to shutdown properly. In my SNPP server, I do catch a SIGTERM and shut down any active connections, but that did not take care of the "orphaned sockets".
The "sockstat" output makes more sense now, tnx for comments on that.
Networking fun....

eepete · Feb 28, 2024

all is working again. Thanks for the heads up on what 'sockstat' does. A lot of clean-up on the various servers I wrote has things working again.
Core lesson: things that behave like servers need to catch signals and shut them down. I incorrectly assumed that when you take the systems down that connections are just shutdown automatically.

How to clear TCP sockets that the operating system owns or are orphaned ?

Administrator