TIME_WAIT Assassination in FreeBSD?

Hi All,

In my day to day work I have just run into a particular issue. Given it is at the foundations of the TCP stack I am obviously out of my depth here. I would like to know if anyone can tell me.

1. The best or most correct solution to my problem (this is not a FreeBSD related question)? Do I need TIME_WAIT assasination? Or a better port randomisation algorithm? Is it a problem with the Windows OS?
2. Is this even a problem in the FTP stack? Is it just bad design in active mode FTP? Will other protocols be impacted (a big concern of mine)?
3. Does FreeBSD handle this situation? How.

MY PROBLEM
-------------------------------------------------
We have an AIX system which performs a number of FTP (active mode) mget & mdelete operations against a Windows 2003 R2 SP2 x64 FTP server(IIS). This all worked fine until we upgraded the AIX system from 5.3 TL8 to 6.1 TL6.

When we did this, FTP would now hang forever for no apparent reason. However we found that two things had happened which had caused this. The first item revealed the second (bug).

1. As of AIX 6.1 TL6 & AIX 7.1. Ephemeral port allocation by the TCP/IP stack is now random instead of linear (for security). See ---> IBM IZ73313: TCP/IP VULNERABILITY PROTECTION PHASE 1

2. AIX FTP implementation was busted. Fixes provided by IBM to retry operation for default 90 seconds on receipt of FTP response 425 & then continue on if it still gets no incoming connection on PORT provided to server.

So now that we have the FTP fixes for item 2.(above). Everything works you ask? Well kind of. FTP no longer hangs which is good. But we still see problems when AIX decides to use the same Ephemeral in close succession.

For example.
Step 1. AIX client sends FTP server request for data connection on port 38672.
Step 2. Windows FTP server opens data link to 38672, Does it's business & closes the connections. This leaves the socket/port on AIX in the CLOSED state. However the windows server the port is held in TIME_WAIT state for 2MSL(default 120 minimum configurable 30).
Step 3. AIX client sends FTP server another request for data connection on port 38672 (reusable as it was CLOSED). Windows server responds with 425 because the port can not be used. It is still in TIME_WAIT and is likely going to stay in that state longer then 90 seconds (default period FTP is retying for).

The only thing I can find that would seem to address this circumstance is
RFC1122 TCP TIME_WAIT Assassination

But it seems to have reported issues
RFC1337 TIME-WAIT Assassination Hazards in TCP
and here
IETF Internet Draft - Problems with TCP Connections Terminated by RSTs or Timer
draft-heavens-problems-rsts-00.txt - Ian Heavens - July 1995


I'm not sure how keen IBM is to change there TCP stack for a small fry like me. They seem to be wanting me to just reduce the 2MSL on the windows side. But I don't like this because
1. I will still have a delay sometimes, from 0-30 seconds (minimum 2MSL for Windows OS).
2. I don't know if other protocols will be or are being impacted.

**EDIT**
This looks interesting too.
RFC 6056 - Recommendations for Transport-Protocol Port Randomization
I was initially thinking along the same lines, why go to the trouble of assasination if you can just avoid it. Perhaps the randomisation algorithm IBM are using could be better. I have no idea what they do though. I do know though that our server is not busy(heaps of free ports) and the ports seem to get re-used quickly!!

Mailing List post here
 
I'm curious what you find out about this. I know this is off topic, but we have run into a number of strange behaviors in AIX 6.1 TL6 specifically SP4. We just migrated to FreeBSD for our SFTP service and luckily we have not run into the issues you are describing, but most of the client machines are still on AIX 5.3.

Is it possible for you to switch over to passive FTP on your systems? That might avoid the problem instead of getting IBM to deal with it.
 
Sylgeist said:
I'm curious what you find out about this. I know this is off topic, but we have run into a number of strange behaviors in AIX 6.1 TL6 specifically SP4. We just migrated to FreeBSD for our SFTP service and luckily we have not run into the issues you are describing, but most of the client machines are still on AIX 5.3.

Is it possible for you to switch over to passive FTP on your systems? That might avoid the problem instead of getting IBM to deal with it.

Hi Sylgeist,

We have only had experience with 6.1 TL6 SP5 & 6.1 TL4 SP( not sure, I think it was 1). We have had only had three issues with AIX 6.1

1. AIX print subsystem (qdaemon) was crashing on busy systems & not getting respawed. This was around your TL (not sure on SP, probably 1). We have since moved all our 6.1 installs to TL06 SP05 and fixed the issue.

2. System V print subsystem seems to get hung in 6.1 TL6 SP5. The lpNet log start writing errors
"ERROR: class=NonFatal, type=Internal, trace=(listenBSDEvent), Cannot map address to remote machine name."
Simple restarting the print system 'lpsched' fixes it. Only one of our systems uses the System V print subsystem. However it is annoying and we have a support call logged with IBM to look at it.

3. This issue with FTP. In which active FTP hangs due to port randomization. IBM provided an efix for this, they say it will be included in the next release AIX 6.1 TL6 SP6. As I state though it only fixes FTP hanging. The FTP connectivity issue remains. However this would seem to be the norm for any active FTP client on a ephemeral port randomization OS(as my FreeBSD test showed). I don't know that this can be fixed? As this would seem to be normal tcp/ip behavior & I am not sure tcp assassination applies or could help. I'm starting to think this is just bad design of active FTP & the only fix is to move to passive FTP. But this is easier said than done. We have a number of existing systems which do active FTP. Updating all our batch scripts is painful & time consuming. Not to mention we seem to have a large perl script acting as a FTP transport system which is relying on a old/deprecated library) which does not support passive FTP. Fixing it alone will be a pain. However this sounds like my only avenue.

So I think I am going to try compiling a new FTP client for AIX like the one I tested on FreeBSD(see mailing list), that is to say a ftp client that will try passive mode first & fail over to active. This means I won't need to update heaps of scripts. However it won't fix my perl script problem, I'll have to rewrite that script to depend on the newer perl built-in/standard FTP lib NET::FTP.

I am still concerned we may hit other issues though with port randomization. I would be extremely happy if someone with better knowledge of TCP/IP and application protocols could tell me whether port randomization will cause any other application issues like the one I am seeing with active ftp.

Regards Jarrod
 
Back
Top