This is an absolutely infuriating problem, so if you don't have lots of time on your hands I suggest you save yourself the trouble.
First, the situation. I have a FBSD machine in a DMZ (vlan2 on lagg0), connected to my FBSD router. The router also has connections to the LAN (vlan3 on lagg0), the wireless network (vlan4 on lagg0), and the internet (fxp0). All machines have pf active on all non-loopback interfaces, and, excluding the problem described below, everything works perfectly.
Now, the only problem I have is that the machine in the DMZ cannot sync with CVS via csup. I've changed CVS servers, as I'm trying to grab the FBSD source repo, but it happens with all their CVS servers. What's really strange is I can actually see the connection establish, the hosts exchange data, and then, about 4 or 5 seconds later, it will always stop forwarding packets on both interfaces (DMZ and WAN) for the csup connection (other connections are still fine), even though I can still see the packets arrive with tcpdump. This problem is also only specific to CVS too (or maybe just csup?), which is strange. I've pulled >100MB files with wget without problem, using the exact same firewall rules (just different ports in the same firewall rules). Furthermore, the router itself has no issues using csup to sync with the same servers for the same repos!
This is downright maddening. The only other things I can think to try are another CVS client on the problematic machine (assuming I can easily get one w/o using CVS), or swapping a machine that can use CVS just fine on my LAN into the DMZ and seeing if it still works, or taking the machine out of the DMZ, putting it on my LAN, and seeing if it can sync there.
Does anyone have any other thoughts on how I might solve or even diagnose the problem, before I spend a few more hours hacking away at it?
Edit: For the curious, here's a little more detail. The csup client log looks like this
As I said earlier, it will pull a few files in the first few seconds if I rm /usr/src/*. But it will hang on anything it can't pull in the first few seconds. After the connection gets mysteriously severed by the router, the client eventually times out and sends an RST, as does the server.
UPDATE Dec 01: I finally got some time to work on troubleshooting this. Between when I last experienced it, and when I started working on solving it, I'd done one system update to the router (to 8.1-p2) and rebooted a few times. The DMZ host with problems hasn't had any reboots whatsoever (100+ days uptime now). Now, with absolutely 0 changes to the rule set on either the router or the DMZ host, I have no problems, and CVS works just fine! This is really frustrating as now I may never know the cause of this bug!
Maybe some gamma ray burst flipped some bits in the memory (unlikely--I remain suspicious of wayward nondeterministic bug somewhere in the code), but until I can recreate the problem, I won't be able to debug it.
UPDATE Dec 02: I did manage to get it to break again last night. However, it's working again this morning, with 0 changes to anything. I'm now certain that this bug depends on the phase of the moon, and isn't worth spending time on unless you're a far more knowledgeable and skilled than I am at kernel debugging...
FINAL UPDATE Dec 04: Well, I've finally updated the server machine to 8.1 (was previously on 8.0) since I was able to get it to sync the source when it was in one of its good moods, and I can't seem to ever get CVS to fail now. So it seems it was something specific to 8.0 that was fixed in 8.1, or something went wrong during compilation/installation/use of the last system.
First, the situation. I have a FBSD machine in a DMZ (vlan2 on lagg0), connected to my FBSD router. The router also has connections to the LAN (vlan3 on lagg0), the wireless network (vlan4 on lagg0), and the internet (fxp0). All machines have pf active on all non-loopback interfaces, and, excluding the problem described below, everything works perfectly.
Now, the only problem I have is that the machine in the DMZ cannot sync with CVS via csup. I've changed CVS servers, as I'm trying to grab the FBSD source repo, but it happens with all their CVS servers. What's really strange is I can actually see the connection establish, the hosts exchange data, and then, about 4 or 5 seconds later, it will always stop forwarding packets on both interfaces (DMZ and WAN) for the csup connection (other connections are still fine), even though I can still see the packets arrive with tcpdump. This problem is also only specific to CVS too (or maybe just csup?), which is strange. I've pulled >100MB files with wget without problem, using the exact same firewall rules (just different ports in the same firewall rules). Furthermore, the router itself has no issues using csup to sync with the same servers for the same repos!
This is downright maddening. The only other things I can think to try are another CVS client on the problematic machine (assuming I can easily get one w/o using CVS), or swapping a machine that can use CVS just fine on my LAN into the DMZ and seeing if it still works, or taking the machine out of the DMZ, putting it on my LAN, and seeing if it can sync there.
Does anyone have any other thoughts on how I might solve or even diagnose the problem, before I spend a few more hours hacking away at it?
Edit: For the curious, here's a little more detail. The csup client log looks like this
Code:
# csup src.sup
Connected to 68.66.37.246
Updating collection src-all/cvs
Checkout src/ObsoleteFiles.inc
Receiver: Operation timed out
UPDATE Dec 01: I finally got some time to work on troubleshooting this. Between when I last experienced it, and when I started working on solving it, I'd done one system update to the router (to 8.1-p2) and rebooted a few times. The DMZ host with problems hasn't had any reboots whatsoever (100+ days uptime now). Now, with absolutely 0 changes to the rule set on either the router or the DMZ host, I have no problems, and CVS works just fine! This is really frustrating as now I may never know the cause of this bug!

UPDATE Dec 02: I did manage to get it to break again last night. However, it's working again this morning, with 0 changes to anything. I'm now certain that this bug depends on the phase of the moon, and isn't worth spending time on unless you're a far more knowledgeable and skilled than I am at kernel debugging...
FINAL UPDATE Dec 04: Well, I've finally updated the server machine to 8.1 (was previously on 8.0) since I was able to get it to sync the source when it was in one of its good moods, and I can't seem to ever get CVS to fail now. So it seems it was something specific to 8.0 that was fixed in 8.1, or something went wrong during compilation/installation/use of the last system.