CLI hanging over SSH-tunnel

mix_room · Oct 24, 2012

I have two machines connected via a IPSec VPN. They can connect to each other just fine.

The problem is the following: when I run commands, the shell sometimes hangs. The shell reliably hangs upon running [cmd=""]tmux a[/cmd] and [cmd=""]top[/cmd], it also hangs running ls -l /usr/ports/devel.

I can kill the SSH-session with <ENTER> ~. and then reconnect again, I can also make other connections during this time.

I have been able to find other threads on the internet suggesting things like adjusting MTU-size and increasing time-outs, but those threads don't have a reply.

Does anyone know what the cause of this problem is? It seems to me that all programs which take a long time to return something hang. Can this be the case?

EDIT: Also dies on

Code:

/usr/ports/ftp/wget % make
===>  Switching to root credentials to create /var/db/ports/wget
Password:
===>  Returning to user credentials

and

Code:

% cd /usr/ports/ftp/wget
<USER>@<HOST>:/usr/ports/ftp/wget % su root
Password:
root@<HOST>:/usr/ports/ftp/wget # make

m6tt · Oct 25, 2012

What does a straight file transfer over 10MB look like in wireshark?
Any IDS/IPS inline with the traffic that you're aware of?
Is wireless involved?
ssh configs fairly stock?

Thoughts
------------------------------
MTU is broken somewhere
Link is unreliable
IDS is seeing a certain encrypted output as a shellcode or something
SSH has keepalive options that might behave this way misconfigured

m6tt · Oct 25, 2012

What does a straight file transfer over 10MB look like in wireshark?
Any IDS/IPS inline with the traffic that you're aware of?
Is wireless involved?
ssh configs fairly stock?

Thoughts
------------------------------
MTU is broken somewhere
Link is unreliable
IDS is seeing a certain encrypted output as a shellcode or something
SSH has keepalive options that might behave this way misconfigured

mix_room · Oct 25, 2012

m6tt said:
What does a straight file transfer over 10MB look like in wireshark?

Would need to be set up, but can be done.

Any IDS/IPS inline with the traffic that you're aware of?

No. There are none.

Is wireless involved?

No, wired. The link on one side is quite slow but that shouldn't really matter.

ssh configs fairly stock?

Yes.

m6tt · Oct 26, 2012

Sorry for the double vision there...

Here are some additional thoughts...I was hoping there was a malignant snort.conf attacking "NOOPs" or something.

Is there any possibility you have something with Java that can run the ISCI netalyzer from the same network segment/switch?
It's pretty good at finding some types of issues.

Another thought might be to try a NIC from a different vendor to rule out hardware issues. You may also want to consider memtest, although people tend to jump at memory a little quickly anymore (memory has gotten a lot better, in my opinion...)

I've seen bad integrated nics do the craziest things, like respond to pings but lock up on file transfers etc.

If the file transfer doesn't go badly, try csup or copying lots of little files over nfs. I've seen bad hardware and bad drivers that needed a certain number of packets before it choked.

Obviously basic networking troubleshooting is worth it too:
1. reboot switches/routers
2. replace all easily replaceable cabling
3. test all hard to replace cabling
4. set all nics to manual speeds and not auto (also less of a problem now, but...)

mix_room · Oct 26, 2012

While I don't want to rule it out, I doubt that there is issues with the hardware, since both hosts run fine when connecting locally or via SSH from the local network. They only seem to go crazy when connecting over the VPN.

I recon I should try to connect via the VPN from both sides, perhaps that will help me locate the problem.

mix_room · Oct 26, 2012

SCP transfers stall, I'm guessing the problem occurs here aswell.

I am getting a lot of checksum errors in a package dump:

Code:

17:38:15.066616 (authentic,confidential): SPI 0x0effe2f3: (tos 0x8, ttl 63, id 38251, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 859a (->869a)!)
...
17:38:16.964631 (authentic,confidential): SPI 0x0effe2f3: (tos 0x8, ttl 63, id 38256, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 8595 (->8695)!)
...
17:38:18.481794 (authentic,confidential): SPI 0x0effe2f3: (tos 0x8, ttl 63, id 38277, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 8580 (->8680)!)
...
17:38:19.366527 (authentic,confidential): SPI 0x0effe2f3: (tos 0x8, ttl 63, id 38281, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 857c (->867c)!)

Could the checksum errors be indicative of something, or do they represent another problem?

mix_room · Oct 29, 2012

I have now solved the problem.

One of the VPN end-points was on a PPPoE interface with a MTU of 1492. Forcing the WAN interface of the gateway to adhere to this solved the problem, SSH transfers now work.

throAU · Oct 30, 2012

Check your network is not set to disallow packet fragmentation. I had a similar issue with fragments with Windows machines behind an IPSEC tunnel over ADSL.

Windows was setting "don't fragment" bit in the TCP packets, the packets once encrypted with IPSEC became too big to fit down the DSL pipe, and were thus dropped. Smaller packets like ping etc worked because they weren't large enough to require fragmenting.

You can perhaps force a smaller MTU on the devices behind your tunnel to ensure they will not send "don't fragment" packets that are too large to fit, or set your router to clear the "df" bit on incoming packets.

edit:
Duh, I see you have fixed the issue already (I really should read the entire thread...). However, looks like the above is the cause. Rather than messing with MTU, you can clear the don't fragment bits, or as you have done, adjust MTU.

m6tt · Oct 31, 2012

You can perhaps force a smaller MTU on the devices behind your tunnel to ensure they will not send "don't fragment" packets that are too large to fit, or set your router to clear the "df" bit on incoming packets.

There are two issues at play. One is the packet size overhead created by IPSEC, which might be magnified by PPPoE for DSL, yielding a mtu in the 1400s usually. This can be even further reduced in L2TP/IPSec over DSL

The other issue is as you said that some hosts have "different" ip stacks, which sometimes is a result of non-compliance (Windows DF) or over-compliance (Windows timestamps...try all the scrub options in PF upstream of a Windows host some time).

So you are correct in that you could solve both issues by either reducing MTU or stripping the DF flag, allowing oversize packets to fragment. I would argue the correct way to do this is both, reduce the MTU to the correct value for your connection (there are whitepapers online explaining IPSEC overhead and mtu, or you can test with ping) and also strip DF over ipsec tunnels if it doesn't affect your traffic negatively. Fragmented packets of course put more load on firewalls and hosts, and depending on the firewall setup can do surprising things.