Solved SSH does not work with DHCP (and more fun)

I changed WLAN on my laptop to use DHCP (instead of a static address), but then SSH does not work anymore.
It gets that far:
Code:
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
and there it hangs forever.

After the WLAN is up and configured, I start a VPN tunnel, and then I start the SSH connections into the VPN tunnel. When changing the WLAN back to static IP, things do work again.

This is the difference in ifconfig:
Code:
<       deftxkey UNDEF AES-CCM 4:128-bit txpower 30 bmiss 7 scanvalid 60
---
>       deftxkey UNDEF AES-CCM 3:128-bit txpower 30 bmiss 7 scanvalid 60

Looking closer...
The ssh sends packets into the tunnel, but does not get an answer:
Code:
13:59:37.628988 IP6 fd00::8101.22 > fd00::1206.52465: Flags [.], ack 39, win 1035, options [nop,nop,TS val 2549388454 ecr 1397418994,nop,nop,sack 1 {1467:2823}], length 0
13:59:37.816797 IP6 fd00::1206.52465 > fd00::8101.22: Flags [.], seq 39:1467, ack 1167, win 1035, options [nop,nop,TS val 1397419694 ecr 2549388454], length 1428
13:59:38.586798 IP6 fd00::1206.52465 > fd00::8101.22: Flags [.], seq 39:1467, ack 1167, win 1035, options [nop,nop,TS val 1397420464 ecr 2549388454], length 1428
13:59:39.926798 IP6 fd00::1206.52465 > fd00::8101.22: Flags [.], seq 39:1467, ack 1167, win 1035, options [nop,nop,TS val 1397421804 ecr 2549388454], length 1428
13:59:42.406794 IP6 fd00::1206.52465 > fd00::8101.22: Flags [.], seq 39:1467, ack 1167, win 1035, options [nop,nop,TS val 1397424284 ecr 2549388454], length 1428

The VPN sends packets onto the WLAN, but does not get an answer (mtu 1500, so these packets are oversized and fragmented by the kernel):
Code:
13:59:37.628865 IP 89.163.152.223.5006 > 192.168.96.100.8211: UDP, length 124
13:59:37.816913 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:37.816925 IP 192.168.96.100 > 89.163.152.223: ip-proto-17
13:59:38.587478 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:38.587495 IP 192.168.96.100 > 89.163.152.223: ip-proto-17
13:59:39.927526 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:39.927541 IP 192.168.96.100 > 89.163.152.223: ip-proto-17
13:59:42.407551 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:42.407567 IP 192.168.96.100 > 89.163.152.223: ip-proto-17

The packets come out of the WLAN, but the second fragment have disappeared:
Code:
13:59:37.901638 IP 89.163.152.223.5006 > 192.168.96.100.8211: UDP, length 124
13:59:38.093897 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:38.865314 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:40.204777 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472
13:59:42.684859 IP 192.168.96.100.8211 > 89.163.152.223.5006: UDP, bad length 1540 > 1472

It seems like the WLAN router would swallow the fragments.
But then, when I stop the VPN, do service netif wlan0 stop, start it again without the DHCP option (and manually configure the same IP-address) and start the VPN again, then the fragments come out of the WLAN and the connection works. Reproducibly.
 
Last edited:
The VPN sends packets onto the WLAN, but does not get an answer (mtu 1500, so these packets are oversized and fragmented by the kernel):

The VPN tunnel should have an MTU of 1492 or lower. That way the encapsulated packets won't exceed 1500. Gigabit networks can support jumbo frames (MTU larger than 1500) but as far as I know wifi is stuck on 1500. Most consumer grade (cheap, unmanaged) switches don't support jumbo frames.
 
The VPN tunnel should have an MTU of 1492 or lower. That way the encapsulated packets won't exceed 1500. Gigabit networks can support jumbo frames (MTU larger than 1500) but as far as I know wifi is stuck on 1500. Most consumer grade (cheap, unmanaged) switches don't support jumbo frames.

I am not using jumbo frames, only IP fragmentation.

1. SSH sends into tun0 with mtu=1500, so packets are 1500 bytes long.
2. openvpn encapsulates this, and the resulting UDP packet is 1568 bytes long.
3. This packet is now inserted into the IP stack again. This is fine, the IP stack can handle up to 16k size.
4. Routing directs the packet to wlan0 with mtu 1500. Now the kernel must fragment according to RFC 791.

And that does work. Only when using DHCP for the wlan0, then the kernel still does fragment, but the subsequent fragments do not get thru the WLAN.
But when sending oversized packets with ping directly into wlan0, then they also get fragmented, and these fragments do traverse the WLAN and appear at the other end! So it is not a problem with the WLAN router. Only when sending into tun0, the fragments get lost afterwards in the WLAN.
 
Another difficulty in debugging this is that one can only activate the VPN about two times after another. When starting it the next time, the NAT/portforward at the other end will not translate the address, for some 15 minutes, or require a reboot.

ipfw nat 2 config same_ports unreg_only ip 89.163.152.223 redirect_port udp 192.168.97.17:5005 5006

It seems libalias has a limited use-count. Or a bug.
 
I think I found the problem.
With DHCP, the router does not only provide an IP-address for the client, and itself as a DNS server (which does not work, because this one doesn't even answer DNS requests), but also itself as the default router. And that doesn't really work either.

The router is configured as a hub - a single subnet alltogether. So the correct defaultroute is not the router itself, but another host on that subnet. The router itself has the correct defaultroute configured, so things should normally work (and should result in an ICMP redirect), but in this case the behaviour is slightly different (and not in conformance with any standard), i.e. the routing "just works" for anything except fragmented packets where it "just not works".
 
Another difficulty in debugging this is that one can only activate the VPN about two times after another. When starting it the next time, the NAT/portforward at the other end will not translate the address, for some 15 minutes, or require a reboot.

ipfw nat 2 config same_ports unreg_only ip 89.163.152.223 redirect_port udp 192.168.97.17:5005 5006

It seems libalias has a limited use-count. Or a bug.

That one was already earlier recorded as
(and nobody gives a f.d.)
 
Back
Top