Jail fails to connect to main host

Hi,

Ever since I've upgraded my server from 10.0 to 10.1 my (one and only) jail is having problems accessing the main host. After some time - can be hours or days - it fails to establish a connection to the main host. E.g. Roundcube webmail can't access the IMAP server, a SSH login to the main host hangs/times out. All I've done is to migrate from 10.0 to 10.1 and the jail configuration from rc.conf to the jail.conf.

My setup: 192.168.1.1 (server), 192.168.1.5 (jail)
Code:
[jail] ~> fetch -o /dev/null http://www.google.ch
fetch: http://www.google.ch: size of remote file is not known
/dev/null                                               18 kB 1932 kBps 00m00s
[jail] ~> ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=57 time=7.252 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.252/7.252/7.252/0.000 ms
[jail] ~> ssh user@server
ssh_exchange_identification: read: Operation timed out
[jail] ~> truss -af ssh ssh@192.168.1.1
--snip--
47260: socket(PF_INET,SOCK_STREAM,6)             = 3 (0x3)
--snip--
47260: sigaction(SIGPIPE,0x0,{ SIG_DFL 0x0 ss_t }) = 0 (0x0)
47260: sigaction(SIGPIPE,{ SIG_IGN 0x0 ss_t },0x0) = 0 (0x0)
47260: sigaction(SIGCHLD,0x0,{ SIG_DFL 0x0 ss_t }) = 0 (0x0)
47260: sigaction(SIGCHLD,{ 0x40b480 0x0 ss_t },0x0) = 0 (0x0)
47260: write(3,"SSH-2.0-OpenSSH_6.4_hpn13v11 Fre"...,47) = 47 (0x2f)
--hangs until "Operation timed out"--
[jail] ~> uname -a
FreeBSD jail.lan 10.1-STABLE FreeBSD 10.1-STABLE #29: Sun Mar  8 19:57:45 CET 2015     root@server.local.lan:/usr/obj/usr/src/sys/Kernel  amd64
I was thinking that maybe there are too many open connections or that the system has run out of sockets, but both netstat -an and sysctl kern.ipc.numopensockets show only a very moderate amount of open connections/sockets.

jail.conf:
Code:
allow.raw_sockets = 1;
exec.clean;
exec.system_user = "root";
exec.jail_user = "root";
exec.start += "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.consolelog = "/var/log/jail_${name}_console.log";
mount.devfs;
#mount.procfs;
allow.mount;
allow.set_hostname = 1;
allow.sysvipc = 1;
path = "/jails/${name}";

inet {
        host.hostname = "jail";
        path = "/usr/jails/inet";
        interface = "em0";
        ip4.addr += "em0|192.168.1.5/32";
}
Anyone has any idea what might be the problem? I'm guessing that something must be wrong with my configuration as I didn't find anything similar so far... And it must only be affecting the internal networking between the host and the jail as the jail can communicate with any other machine just fine.
 
Last edited by a moderator:
Seems odd. When the connections start to fail do they ever start to work again without intervention? What do you see if you tcpdump -i lo0 port 22 to monitor local loopback interface traffic as the connections are failing? Are you using a firewall and are you filtering loopback traffic in any way if so?
 
Yes, it really is a very odd behaviour of my system :( The connections do not work again until the jail is restarted (which isn't always successful, sometimes the restart command hangs). I do not filter my loopback in any way. I've switched back from /etc/jail.conf to /etc/rc.conf for the jail configuration, the jail has proven to be stable ever since. So I'm suspecting that maybe it has something to do with the configuration.... Will report back with further tests and or an update. Thanks for the tcpdump cmd, didn't know that one can capture traffic between host and jail using the loopback interface!
 
Ok, so the jail is showing the strange behaviour again: after some hours/days of running any connections to the main host are rejected. So it can't be due to the move from rc.conf to jail.conf. I've upgraded to head (aka 11) but alas the problem persists. Another interesting thing I've noticed: some machines on my network suddenly weren't able to connect to the main host (=server, 192.168.1.1) any more. It seemed as if the server somehow thought that those machines (e.g. 192.168.1.32) weren't on the same network any more and tried to reach them via the router. I don't think that it was a networking problem as other hosts (or even the same host using a different interface (wireless)) were able to connect just fine. Also, I didn't change anything in my network's layout (the only thing I did was the initial update from 10.0 to 10.1, that's where the problems started). I captured the network traffic between server and jail as suggested, I honestly can't make much out of it: there are repeated SYN packets from the jail to the server which aren't answered by it (=no ACK). Do you have any idea, what might be causing the issue? Both tcpdumps are online: http://www.filedropper.com/tcpdumphost and http://www.filedropper.com/tcpdumpjail
 
Ah yes and what's very strange about it: certain services like dns work just fine, dns queries from the jail are readily answered by the server
 
Minds well post it here to save the download. That does seem odd. Are you using a firewall and if so is there an option to skip filtering on the loopback?
Code:
[NOPARSE]
00:00:00.000000 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88212458 ecr 0], length 0
00:00:02.999992 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88215458 ecr 0], length 0
00:00:03.201235 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88218659 ecr 0], length 0
00:00:03.200999 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88221860 ecr 0], length 0
00:00:03.200938 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88225061 ecr 0], length 0
00:00:03.201439 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88228263 ecr 0], length 0
00:00:06.200001 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88234463 ecr 0], length 0
00:00:12.199565 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88246662 ecr 0], length 0
00:00:24.201056 AF IPv4 (2), length 64: 192.168.1.5.15557 > 192.168.1.1.22: Flags [S], seq 2806737257, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 88270863 ecr 0], length 0
[/NOPARSE]
 
Yes you're right, could have posted the output directly. I did use pf for NAT, the first line of /etc/pf.conf however included a:
Code:
set skip on lo0
.
I've disabled pf a few days ago, it didn't change anything... I'm not using any other firewall (well, at least none that I'm aware of :), my current list of enabled services on the main host:
Code:
# service -e
/etc/rc.d/hostid
/etc/rc.d/zvol
/etc/rc.d/hostid_save
/etc/rc.d/zfs
/etc/rc.d/cleanvar
/etc/rc.d/ip6addrctl
/etc/rc.d/netif
/etc/rc.d/devd
/etc/rc.d/pf
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/usr/local/etc/rc.d/slapd
/usr/local/etc/rc.d/named
/usr/local/etc/rc.d/vboxnet
/etc/rc.d/rpcbind
/etc/rc.d/nfsclient
/etc/rc.d/casperd
/etc/rc.d/cleartmp
/etc/rc.d/dmesg
/etc/rc.d/mountd
/etc/rc.d/nfsd
/etc/rc.d/virecover
/etc/rc.d/hcsecd
/etc/rc.d/motd
/etc/rc.d/ntpd
/etc/rc.d/powerd
/usr/local/etc/rc.d/samba
/usr/local/etc/rc.d/vboxwebsrv
/usr/local/etc/rc.d/vboxheadless
/usr/local/etc/rc.d/sa-spamd
/usr/local/etc/rc.d/rsyncd
/usr/local/etc/rc.d/courier-authdaemond
/usr/local/etc/rc.d/postfix
/usr/local/etc/rc.d/munin-node
/usr/local/etc/rc.d/mmserver
/usr/local/etc/rc.d/fetchmail
/usr/local/etc/rc.d/courier-imap-imapd-ssl
/usr/local/etc/rc.d/courier-imap-imapd
/usr/local/etc/rc.d/avgFreq
/etc/rc.d/sshd
/etc/rc.d/cron
/etc/rc.d/jail
/etc/rc.d/mixer
/etc/rc.d/gptboot
/etc/rc.d/bgfsck
What I don't understand is why I only see the answer of the main host (192.168.1.1) to the jail (192.168.1.5) in the tcpdump on lo0 but not the initial request. I did a tcpdump in parallel on my network interface (em0) and was able to capture the requests from the jail to the main host. Is that the way it should be?

Tcpdump from em0:
Code:
cat tcpdump_em0.txt
No. Time       Source      Destination Protocol Length Info
61 6.452788   192.168.1.5 192.168.1.1 TCP      74     28848→22 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=88555059 TSecr=0
62 6.453238   192.168.1.5 192.168.1.1 TCP      74     [TCP Out-Of-Order] 28848→22 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=88555059 TSecr=0
63 6.453294   192.168.1.5 192.168.1.1 TCP      66     28848→22 [ACK] Seq=1 Ack=1 Win=81600 Len=0 TSval=88555060 TSecr=3448578374
64 6.455411   192.168.1.5 192.168.1.1 TCP      115    28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88555062 TSecr=3448578374
69 6.791944   192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] [TCP segment of a reassembled PDU]
70 7.267947   192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88555875 TSecr=3448578374
71 8.015937   192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88556623 TSecr=3448578374
75 9.311935   192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88557919 TSecr=3448578374
77 9.452967   192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 75#1] 28848→22 [ACK] Seq=50 Ack=1 Win=81600 Len=0
116 11.703936  192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88560311 TSecr=3448578374
122 12.452958  192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 116#1] 28848→22 [ACK] Seq=50 Ack=1 Win=81600 Len=0
126 15.452953  192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 116#2] 28848→22 [ACK] Seq=50 Ack=1 Win=81600 Len=0
127 15.807928  192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88564415 TSecr=3448578374
129 16.031573  192.168.1.5 192.168.1.1 TCP      66     28848→22 [FIN, ACK] Seq=50 Ack=1 Win=81600 Len=0 TSval=88564638 TSecr=3448578374
184 23.816955  192.168.1.5 192.168.1.1 TCP      115    [TCP Retransmission] 28848→22 [FIN, PSH, ACK] Seq=1 Ack=1 Win=81600 Len=49 TSval=88572424 TSecr=3448578374
192 25.400955  192.168.1.5 192.168.1.1 TCP      74     60732→22 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=88574008 TSecr=0
194 25.401367  192.168.1.5 192.168.1.1 TCP      74     [TCP Out-Of-Order] 60732→22 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=88574008 TSecr=0
195 25.401425  192.168.1.5 192.168.1.1 TCP      66     60732→22 [ACK] Seq=1 Ack=1 Win=81600 Len=0 TSval=88574008 TSecr=727534457
204 28.403972  192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 195#1] 60732→22 [ACK] Seq=1 Ack=1 Win=81600 Len=0
217 31.342500  192.168.1.5 192.168.1.1 SSH      71     Client: Encrypted packet (len=5)
218 31.405971  192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 217#1] 60732→22 [ACK] Seq=6 Ack=1 Win=81600 Len=0
223 31.679938  192.168.1.5 192.168.1.1 SSH      71     Client: [TCP Retransmission] , Encrypted packet (len=5)
224 32.155945  192.168.1.5 192.168.1.1 SSH      71     Client: [TCP Retransmission] , Encrypted packet (len=5)
227 32.907958  192.168.1.5 192.168.1.1 SSH      71     Client: [TCP Retransmission] , Encrypted packet (len=5)
231 34.211935  192.168.1.5 192.168.1.1 SSH      71     Client: [TCP Retransmission] , Encrypted packet (len=5)
234 34.405972  192.168.1.5 192.168.1.1 TCP      54     [TCP Dup ACK 231#1] 60732→22 [ACK] Seq=6 Ack=1 Win=81600 Len=0
243 36.619943  192.168.1.5 192.168.1.1 SSH      71     Client: [TCP Retransmission] , Encrypted packet (len=5)
Tcpdump from lo0:
Code:
cat tcpdump_lo0.txt
No. Time      Source      Destination  Protocol Length Info
1   0.000000  192.168.1.1 192.168.1.5  TCP      64     22→28848 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=3448578374 TSecr=88555059
2   2.999661  192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→28848 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=3448578374 TSecr=88555059
3   5.999653  192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→28848 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=3448578374 TSecr=88555059
4   8.999647  192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→28848 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=3448578374 TSecr=88555059
5   18.948129 192.168.1.1 192.168.1.5  TCP      64     22→60732 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=727534457 TSecr=88574008
6   21.950659 192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→60732 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=727534457 TSecr=88574008
7   24.952661 192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→60732 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=727534457 TSecr=88574008
8   27.952658 192.168.1.1 192.168.1.5  TCP      64     [TCP Retransmission] 22→60732 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 SACK_PERM=1 TSval=727534457 TSecr=88574008
This is what I did while the tcpdumps were taken:
ssh 192.168.1.1 [CTRL-C after maybe 10 seconds]
telnet 192.168.1.1 22
 
Last edited by a moderator:
It seemed as if the server somehow thought that those machines (e.g. 192.168.1.32) weren't on the same network any more and tried to reach them via the router.

What I don't understand is why I only see the answer of the main host (192.168.1.1) to the jail (192.168.1.5) in the tcpdump on lo0 but not the initial request. I did a tcpdump in parallel on my network interface (em0) and was able to capture the requests from the jail to the main host. Is that the way it should be?

So both of these seem interesting.

Ah yes and what's very strange about it: certain services like dns work just fine, dns queries from the jail are readily answered by the server

I'm not sure how to make sense of this comment in the context of the first two I mentioned. However, the first two almost make it look like the route for 192.168.1.0/24 which should be in the routing table as a directly connected subnet has disappeared. I'm still not seeing the big picture here on what's the cause. However, let's either rule this out or confirm it. What does netstat -nr show when the networking is not responsive?
 
Hi junovitch, thank you very much for your answer and sticking to my problem :) As I can connect to my server via ssh etc. from my jail at the moment I can't provide you with a netstat -rn. I had trouble connecting to the server from certain hosts today however (as I had before a few days ago): when I issued a ping to an "unavailable" host it was redirected by the router:
Code:
# ping -c 1 192.168.1.61
PING 192.168.1.61 (192.168.1.61): 56 data bytes
36 bytes from router.local.lan (192.168.1.254): Redirect Host(New addr: 192.168.1.61)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
4  5  00 0054 a734   0 0000  40  01 4fe6 192.168.1.1  192.168.1.61

64 bytes from 192.168.1.61: icmp_seq=0 ttl=64 time=1.009 ms

--- 192.168.1.61 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.009/1.009/1.009/0.000 ms#
Other hosts on the same network were able to reach 192.168.1.61 directly, so it can't be neither 192.168.1.61 nor the router (I do not have any static routes for 192.168.1.61/32 on my router) that are causing the problems. Status on the server:
Code:
# arp -a
server.local.Lan (192.168.1.1) at 00:1c:c0:6f:c2:60 on em0 permanent [ethernet]
laptop.local.Lan (192.168.1.32) at 5c:26:0a:2a:37:10 on em0 expires in 1197 seconds [ethernet]
jail.local.Lan (192.168.1.5) at 00:1c:c0:6f:c2:60 on em0 permanent [ethernet]
WinXP.local.Lan (192.168.1.4) at 08:00:27:80:b8:10 on em0 expires in 1200 seconds [ethernet]
? (192.168.1.61) at 00:04:20:05:31:38 on em0 expires in 1150 seconds [ethernet]
? (192.168.1.255) at (incomplete) on em0 expired [ethernet]
Router.local.Lan (192.168.1.254) at 00:0d:b9:00:11:68 on em0 expires in 1150 seconds [ethernet]
Phone.local.Lan (192.168.1.21) at 00:0e:08:bc:ed:94 on em0 expires in 799 seconds [ethernet]
# netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags      Netif Expire
default            192.168.1.254      UGS        em0
10.8.0.0/24        10.8.0.2           UGS       tun0
10.8.0.1           link#4             UHS        lo0
10.8.0.2           link#4             UH        tun0
127.0.0.1          link#2             UH         lo0
192.168.1.0/24     link#1             U          em0
192.168.1.1        link#1             UHS        lo0
192.168.1.5        link#1             UHS        lo0
192.168.1.5/32     link#1             U          em0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::/96                             ::1                           UGRS       lo0
::1                               link#2                        UH         lo0
::ffff:0.0.0.0/96                 ::1                           UGRS       lo0
fe80::/10                         ::1                           UGRS       lo0
fe80::%lo0/64                     link#2                        U          lo0
fe80::1%lo0                       link#2                        UHS        lo0
fe80::%tun0/64                    link#4                        U         tun0
fe80::21c:c0ff:fe6f:c260%tun0     link#4                        UHS        lo0
ff02::/16                         ::1                           UGRS       lo0
#
Nothing sticks out for me (10.8.0.0/24 is a network for OpenVPN. I didn't change the OpenVPN config in ages so it can't be related to my problem but I've stopped the daemon anyway but the server was still unable to reach 192.168.1.61). The server "knows" the host .61 (at least it's MAC address). This is what tcpdump says:
Code:
No. Time      Source        Destination  Protocol Length Info
186 1.028186  192.168.1.1   192.168.1.61 ICMP     98     Echo (ping) request  id=0x2220, seq=0/0, ttl=64 (reply in 190)

Ethernet II, Src: IntelCor_6f:c2:60 (00:1c:c0:6f:c2:60), Dst: PcEngine_00:11:68 (00:0d:b9:00:11:68)

No. Time      Source        Destination Protocol Length Info
189 1.029008  192.168.1.254 192.168.1.1 ICMP     70     Redirect             (Redirect for host)

Ethernet II, Src: PcEngine_00:11:68 (00:0d:b9:00:11:68), Dst: IntelCor_6f:c2:60 (00:1c:c0:6f:c2:60)

No. Time      Source        Destination Protocol Length Info
190 1.029392  192.168.1.61  192.168.1.1 ICMP     98     Echo (ping) reply    id=0x2220, seq=0/0, ttl=64 (request in 186)

Ethernet II, Src: SlimDevi_05:31:38 (00:04:20:05:31:38), Dst: IntelCor_6f:c2:60 (00:1c:c0:6f:c2:60)
This is where things get interesting: so the server is asking .61 for an echo request but it using the router's MAC address (*:68)! Obviously the router is responding with a redirect and 61 (*:38) is answering

A few hours later the host .61 was reachable again without any intervention (well, I've switched it off and turned it back on in the meantime but this shouldn't change anything from the network's perspective):
Code:
ping -c 1 192.168.1.61
PING 192.168.1.61 (192.168.1.61): 56 data bytes
64 bytes from 192.168.1.61: icmp_seq=0 ttl=64 time=0.721 ms

--- 192.168.1.61 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.721/0.721/0.721/0.000 ms
arp and netstat haven't changed (except for 10.8.0.0/24 which has been taken down as described above):
Code:
# arp -a
Server.local.Lan (192.168.1.1) at 00:1c:c0:6f:c2:60 on em0 permanent [ethernet]
Laptop.local.Lan (192.168.1.32) at 5c:26:0a:2a:37:10 on em0 expires in 1152 seconds [ethernet]
InetServer.local.Lan (192.168.1.5) at 00:1c:c0:6f:c2:60 on em0 permanent [ethernet]
WinXP.local.Lan (192.168.1.4) at 08:00:27:80:b8:10 on em0 expires in 1106 seconds [ethernet]
? (192.168.1.61) at 00:04:20:05:31:38 on em0 expires in 1167 seconds [ethernet]
? (192.168.1.255) at (incomplete) on em0 expired [ethernet]
Router.local.Lan (192.168.1.254) at 00:0d:b9:00:11:68 on em0 expires in 503 seconds [ethernet]
Phone.local.Lan (192.168.1.21) at 00:0e:08:bc:ed:94 on em0 expires in 1198 seconds [ethernet]
[Server] ~# netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags      Netif Expire
default            192.168.1.254      UGS        em0
127.0.0.1          link#2             UH         lo0
192.168.1.0/24     link#1             U          em0
192.168.1.1        link#1             UHS        lo0
192.168.1.5        link#1             UHS        lo0
192.168.1.5/32     link#1             U          em0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::/96                             ::1                           UGRS       lo0
::1                               link#2                        UH         lo0
::ffff:0.0.0.0/96                 ::1                           UGRS       lo0
fe80::/10                         ::1                           UGRS       lo0
fe80::%lo0/64                     link#2                        U          lo0
fe80::1%lo0                       link#2                        UHS        lo0
ff02::/16                         ::1                           UGRS       lo0
[Server] ~#
This time the server is using the correct MAC address (*:38) from .61 directly and not from the router:
Code:
No. Time     Source      Destination  Protocol Length Info
111 0.810198 192.168.1.1 192.168.1.61 ICMP     98     Echo (ping) request  id=0xa33b, seq=0/0, ttl=64 (reply in 113)

Ethernet II, Src: IntelCor_6f:c2:60 (00:1c:c0:6f:c2:60), Dst: SlimDevi_05:31:38 (00:04:20:05:31:38)

No. Time     Source       Destination Protocol Length Info
113 0.810878 192.168.1.61 192.168.1.1 ICMP     98     Echo (ping) reply    id=0xa33b, seq=0/0, ttl=64 (request in 111)

Ethernet II, Src: SlimDevi_05:31:38 (00:04:20:05:31:38), Dst: IntelCor_6f:c2:60 (00:1c:c0:6f:c2:60)
No wonder .61 answers directly... This is getting weirder and weirder (and slowly driving me insane :-/ ). Anyone knows what's going on?
 
Pretty strange. Does arp -da to delete all ARP entries when that wierdness is happening help? Maybe forcing a fresh ARP request would be helpful there.
 
Ok, so the server had problems connecting to .61 again (tried to reach it via the router, see above for details)... I've shut down every service and unloaded all kernel modules (if possible) until I ended up with just the sshd and some very basic processes like init. The server still sent echo requests for .61 using the router's MAC address as the destination. So I issued a route flush and voilà, suddenly the server tried to reach .61 directly! This is what the routing tables looked like before and after the flush:
Code:
Routing tables before the flush

Internet:
Destination        Gateway            Flags      Netif Expire
default            192.168.1.254      UGS        em0
127.0.0.1          link#2             UH         lo0
192.168.1.0/24     link#1             U          em0
192.168.1.1        link#1             UHS        lo0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::/96                             ::1                           UGRS       lo0
::1                               link#2                        UH         lo0
::ffff:0.0.0.0/96                 ::1                           UGRS       lo0
fe80::/10                         ::1                           UGRS       lo0
fe80::%lo0/64                     link#2                        U          lo0
fe80::1%lo0                       link#2                        UHS        lo0
ff02::/16                         ::1                           UGRS       lo0

Routing tables after the flush

Internet:
Destination        Gateway            Flags      Netif Expire
127.0.0.1          link#2             UH         lo0
192.168.1.0/24     link#1             U          em0
192.168.1.1        link#1             UHS        lo0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               link#2                        UH         lo0
fe80::%lo0/64                     link#2                        U          lo0
fe80::1%lo0                       link#2                        UHS        lo0
The default gateway is missing in the IPv4 section (and some other stuff in IPv6 but I do not use it at the moment plus it's for lo0 only anyway). I've issued a /etc/rc.d/routing start which added 192.168.1.254 back as the default gateway, .61 was still reachable directly.

My conclusion: I'm guessing that this is strictly a problem on my FreeBSD server and that it has nothing to do with my network. Somehow the routing table seems to be messed up internally but it doesn't show in netstat -rn. Could this be due to faulty hardware (RAM, Ethernetcard)? Any other ideas?
 
I would suggest dropping your results on the FreeBSD-net mailing list. Keep it short, specifically these three things are what I see as most valuable: the initial tcpdump -i em0 results that show packets on em0 when they should have been on lo0, the second set of tcpdump -i em0 results which shows arp -an doesn't agree the gateway MAC address being the destination, and finally the fact that kicking the routing tables with a route flush; service routing start fixes it. This is all good troubleshooting that is pointing at something, I'm just not sure what it is yet and the wider audience would be helpful.
 
Yes, you're right, I'll do that as soon as the system misbehaves. Right now it is stable but this is a state which only lasts for a few days max... Thank you ever so much for your help!
 
Yes, you're right, I'll do that as soon as the system misbehaves. Right now it is stable but this is a state which only lasts for a few days max... Thank you ever so much for your help!
Actually the three things I pointed out are what you already have above. There's something going on there... I just don't know what it is. There's no need to wait.
 
Hmmm, 2 weeks have passed now and the problem hasn't reoccurred. This sucks ... I didn't really change anything except for the flushing of the routing table. I have rebooted the system since and was expecting the problem to reappear. So far I am out of "luck"... In any case, will post again as soon as the problem happens again.
 
Aaaaannnd it has happened again :( Flushing the route instantly brought everything back to normal. I'm posting this on the FreeBSD-Net mailing list as you've suggested.
 
Back
Top