Network timeouts and disconnects under load

Q: What's the best way to discover the cause of network timeouts?

Background: Under regular load my network connection appears to work fine. However, after ~15mins of heavy load I get network timeouts, disconnections, and can't even ping local addresses. After about 4-5 minutes of complete inactivity, things will start up again, but only for a further minute or so. This pattern repeats (1 min of connection, 4-5 mins of timeouts) until I reboot (ifconfig down/up doesn't clear things up). During this time, only this machines connection seems to be affected - I can connect to the outside and maintain a good connection from other machines (so long as I bypass this machine).

How I'm stressing the connection: 32 simultaneous SSL connections to my nntp provider. Maxing out at ~1.0MB/s total. No ISP or news provider caps, and my machine / net interface should comfortably sustain this without issue (as it has on other operating systems). I've attempted the same load with two different nntp apps. Same result. Top shows my machine, memory and swap are all fine before and during these periods. Router shows no dropped packets or network errors.

Things I've tried:

pf enabled / disabled - no effect.
dnsmasq enabled / disabled - no effect.
different apps - no effect.
sysclt tuning - no effect.

I tried tuning network sysctl options as described in the handbook, 'man tuning' and some of the more sensible online howtos. Same behavior.

I've been trying to get to the bottom of this for a few days, and am now out of ideas. I'm currently having to reboot my machine every 30 mins or so to clear these connections.

It feels like I'm filling buffers faster than I'm clearing them, or exceeding some hard limit, but that isn't supported by the sysctl tuning I've attempted. It seems I might need to learn some lower level tools to determine the cause.

7.0-RELEASE-p7, all packages fully up-to-date.

Ideas anyone?
r-c-e said:
Perhaps you're maxing out mbuf clusters? What does "netstat -m" show you?

Thanks r-c-e, netstat -m is exactly what I was looking for.

I should have posted again yesterday as I believe I discovered a workaround, but I wanted to test things further to be sure. Anyway, here's an update:

After all kinds of sysctl tuning, I went back to my default ifconfig setup and disabled jumbo frames.

I'd enabled jf, against my better judgment, in order to improve the pitiful smb performance to/from my NAS (less than half what I see using cifs under linux with jf disabled). I'd already done sustained, error free, 5-15 MB/s transfers to/from the NAS with jf enabled so, although disappointed, didn't think anything of it. However, when I disabled jf and re-ran the nntp tests, everything was happy again. No errors, no timeouts, just perfect.

I rebooted and ran the tests 3 more times, 32 SSL connections, sustained 1.0MB/s for 3-4 hours, not a single issue.

Like an idiot, I didn't try again with jf enabled to be 100% sure, but instead decided to celebrate by upgrading to 7.1-RC2 so I could work on the other issues I have.

I just tried re-enabling jumbo frames to check the netstat -m output, and whadyaknow, with 7.1-RC2 (+ jf) I see no more issues.

I'm using the sk driver (3Com 3C940 Gigabit) and it looks like there was a pretty substantial cvs checkin related to jumbo frames in August that might have fixed things, but at this stage it's hard to know for sure. If there's an easy way to downgrade the sk module then I'd be happy to test it, if it would help anyone. If not then I'm not going to complain further, as everything seems ok now. At least with my connection ...

Thanks again, and HNY :)

Oh, BTW here's the output of netstat -m on 7.1-RC2 towards the end of a 5GB inet transfer @ 1.0MB.s with jf enabled (mtu 9000). Looks like I have a long way to go before running out of any kind of buffer. Just wish I had a 7.0 one to compare it with...

[mart@bsddesktop /usr/home/mart]$ netstat -m
303/1242/1545 mbufs in use (current/cache/total)
1/249/250/25600 mbuf clusters in use (current/cache/total/max)
1/243 mbuf+clusters out of packet secondary zone in use (current/cache)
0/236/236/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
298/253/551/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
2762K/4029K/6791K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/5/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines