Problem with "random" socket resets

dcole · Jan 10, 2013

Hello,

I am having a problem where my code will occaisionally get to a point where all new TCP socket connections are sent a FIN flag immediately after the TCP handshake. I am not exactly sure what is causing this, as I have wrappers to catch errors on all socket operations. As an experiment, I was trying to determine if some limit was being hit, so I took out the close() calls for the socket fds, and it did make it so that the socket resets went away, however, i eventually hit a point where no new connections were completed. At that point, I grep'd for some sysctl values, which are as follows:

Code:

[jon@] /usr/lib# sysctl -a | grep sock
  2 kern.ipc.maxsockbuf: 2097152
  3 kern.ipc.sockbuf_waste_factor: 8
  4 kern.ipc.maxsockets: 25600
  5 kern.ipc.numopensockets: 1363
  6 net.inet.ip.mcast.maxsocksrc: 128
  7 net.inet.tcp.syncache.rst_on_sock_fail: 1
  8 net.inet6.ip6.mcast.maxsocksrc: 128
  9 security.jail.param.allow.socket_af: 0
 10 security.jail.param.allow.raw_sockets: 0
 11 security.jail.allow_raw_sockets: 0
 12 security.jail.socket_unixiproute_only: 1
 13 [jon@] /usr/lib# sysctl -a | grep thread
 14 kern.cam.ctl.block.num_threads: 14
 15 kern.geom.eli.threads: 0
 16 kern.threads.max_threads_hits: 0
 17 kern.threads.max_threads_per_proc: 2000
 18 vm.stats.vm.v_kthreadpages: 0
 19 vm.stats.vm.v_kthreads: 24
 20 vfs.nfsd.minthreads: 1
 21 vfs.nfsd.maxthreads: 1
 22 vfs.nfsd.threads: 0
 23 net.isr.numthreads: 1
 24 net.isr.bindthreads: 0
 25 net.isr.maxthreads: 1
 26 [jon@] /usr/lib# sysctl -a | grep file
 27 kern.maxfiles: 49312
 28 kern.bootfile: /boot/kernel/kernel
 29 kern.maxfilesperproc: 18000
 30 kern.openfiles: 7758
 31 kern.corefile: %N.core
 32 kern.filedelay: 30
 33 debug.softdep.jwait_filepage: 1255
 34 debug.softdep.write.freefile: 0
 35 debug.softdep.current.freefile: 0
 36 debug.softdep.total.freefile: 116205
 37 hw.snd.latency_profile: 1
 38 p1003_1b.mapped_files: 200112
 39 [jon@] /usr/lib# sysctl -a | grep proc
 40 kern.maxproc: 10000
 41 kern.maxfilesperproc: 18000
 42 kern.maxprocperuid: 9000
 43 kern.shutdown.kproc_shutdown_wait: 60
 44 kern.sigqueue.max_pending_per_proc: 128
 45 kern.threads.max_threads_per_proc: 2000

Does anyone see anything I'm missing about the number of files/sockets open I have compared to some max values here? I thought maybe it was a memory issue, but when I ran top, I had plenty. Is there anything else I can check as to what might be causing the sockets to send the reset messages? it's kind of hard to debug because no seg fault happens or anything to trip my debugger.

expl · Jan 10, 2013

kern.ipc.maxsockets - maximum number of unclosed sockets
kern.maxfiles - maximum number of unclosed file descriptors (including sockets)
kern.ipc.somaxconn - maximum size of connection queue (that are waiting to go through accept() )
kern.ipc.nmbclusters - maximum size of network stack cluster buffer (actual physical size will depend on architecture)
kern.ipc.maxsockbuf - maximum combined socket buffer (in kernel space)
net.inet.tcp.recvspace - maximum TCP incoming buffer size
net.inet.tcp.sendspace - maximum TCP outgoing buffer size

If either of these global limits are exceeded new connections and/or data will be dropped.

dcole · Jan 10, 2013

What is the appropriate way to check to see if you're exceeding buffers? Do these errors get logged anywhere automatically? I found this article:
http://rerepi.wordpress.com/2008/04/19/tuning-freebsd-sysoev-rit/

It talks a little bit about the nmbclusters and I was going through it some, but it is more about adjusting them, and doesnt really talk about debugging an existing problem. The one thing I did notice is that my tcp.sendspace and tcp.recvspace were different values. I dont know if my code made any distinction there.

For my particular problem, I currently have my code running, and it has gotten into the state that is causing the tcp connections to be dropped. Is there anything I can check while it is running that might indicate why those connections are being dropped?

expl · Jan 10, 2013

dmesg should pick it up.

dcole · Jan 10, 2013

Well, I dont see any errors in dmesg so I am kind of at a loss for what else to do. I did up the number of nmbclusters , but i still havent found a great explanation of what that is/means yet.

dcole · Jan 10, 2013

It seems like much of this should not be a problem, by the way. This server is only making 10's of connections at a time. We're talking like maybe 20 page loads, and 10 tcp connections per. Doesnt seem like the sort of thing that should need any kind of tweaking, but maybe I am wrong. I have checked and double checked the socket accept() and close() calls and cant seem to find anything going wrong there.

Is there any kind of automated tool that will let you find leaking socket file descriptors, etc? sort of a valgrind-ish type tool for sockets?

expl · Jan 11, 2013

dcole said:
Well, I dont see any errors in dmesg so I am kind of at a loss for what else to do. I did up the number of nmbclusters , but i still havent found a great explanation of what that is/means yet.

It is memory buffer for storing raw packets that are waiting to be processed (incoming) or are already processed and are waiting for to be pushed to network interface (outgoing). If you are having only 10s of connections this is very unlikely to do any difference if you tune default value up.

dcole said:
Is there any kind of automated tool that will let you find leaking socket file descriptors, etc? sort of a valgrind-ish type tool for sockets?

Well valgrind is capable to do this, there is an option --track-fds=<yes|no> [default: no] that prints unclosed(leaked) file descriptors on termination. Since sockets are just fds this should pick them up.

dcole · Jan 25, 2013

Back from the dead...

I seem to have been able to track down a few small bugs and such that resolved some of my issues, and not others. I have noticed something interesting about the "hanging" open sockets. I have a set of PF rules for traffic from certain clients gets redirected to a loopback interface on a particular port. My server's program has a socket bound to this port on the loopback interface, and is using listen/accept to accept incoming connections. Do you think having that PF rule in there is somehow adversely affecting my ability to close those socket connections? Not all, but some of the connections established through that loopback interface are not getting closed apparently - even though most are. I have tried to track down to make sure all exection paths include a close() on the socket fd that is returned from the "accept()" and as far as I can tell they are all closed.

expl · Jan 26, 2013

dcole said:
Do you think having that PF rule in there is somehow adversely affecting my ability to close those socket connections?

No, should make no difference for the network stack, as PF just translates addresses/ports in packets.

dcole said:
I have tried to track down to make sure all exection paths include a close() on the socket fd that is returned from the "accept()" and as far as I can tell they are all closed.

And how did you check that? You either overwrite the descriptors (and lose reference to them this way) or simply not close them. A closed descriptor con not leak on well designed OS.