Intel X710 Ethernet Issues- TCP connections hang up

I've got two servers that are twins. They are SuperMicro mother board SYS-1019-WTR, Xeon Gold 6246R 16 core processor, 6 memory lanes, 96GB, triple 4GB MLC2 Sata. Purchased 2023. Ethernet configured as: ifconfig_ixl0="inet 192.168.xxx.xxx netmask 255.255.0.0 mtu 1468"
I also have a much older SuperMicro, the Ethernet on that is a 'igb0' flavor; ifconfig_igb0="inet 192.168.xxx.xxx netmask 255.255.0.0 mtu 1468"
The new servers arefreebsd-version 14.1-RELEASE-p5, I was able to upgrade the old server to 14.3-RELEASE-p7

When the two new servers were on the same LAN getting their software developed, and the MTU was 1500, they could ssh/scp files between them and the old 'igb0' server.
One server is now out in the real world on a fiber connect. This is the "Hosted Server". The other is still on my local LAN. This is the digital twin., The old 'igb0' sever is also on the LAN.
The smaller MTU is because that is what it took to make the LAN able to use StarLink with it's CGNAT set-up, and that is what it took to make ssh connections out to the Hosted Server and keep the SSH connection up over longer periods of time (1-5 days). In the AI chat, the possiblity that the 710 Ethernet hardware might be unhappy with the 1468 MTU. I suspect that the old server can connect to the Host Server out in the real world because it has to go through a router, to StarLink, then another router (albeit also CGNAT for the fixed IP address), and then a final router for port forwarding that has the Hosted Sever on it. The router could be compensating for the 1468 MTU from the old server, which pointed to a problem with the new Digital Twin server somehow.

The "Old Server" can ssh/scp out to the hosted server no problem. But the Digital Twin can not ssh/scp to either the old server or the hosted server. It just hangs. I had a four hour Google AI session working on the problem. While AI has never solve the problem, it often gets close and makes me aware of what to look at. I had the AI system create a summary of our discussion, which I enclose:

Subject: FreeBSD 14.1: ixl(4) Client Sees SYN-ACK in tcpdump but Handshake Times Out (SYN_SENT)
Environment:
Client: FreeBSD 14.1-RELEASE, Intel X710-series NIC using ixl(4) driver.
Target Server: FreeBSD 14.1-RELEASE, Intel i210-series NIC using igb(4) driver.
Network: Same Layer 2 LAN/Switch.
Tuning: Both hosts use MTU 1468 (MSS 1428) due to upstream Starlink requirements.
Comparison: Windows 11 on the same LAN connects to the target port without any issues.
The Problem:
Outbound SSH or nc from the ixl client to the igb server hangs indefinitely. Verbose SSH logs show a hang immediately after Connecting to.... On the client, the socket remains in SYN_SENT.
Observed in tcpdump:
Client sends SYN .
Target Server receives SYN and responds with SYN-ACK [S.].
Client-side tcpdump physically shows the SYN-ACK arriving at the ixl0 interface.
The OS ignores the packet. The client never sends the final ACK, and instead re-transmits the SYN.
Troubleshooting Steps Taken (No success):
Firewalls: Disabled PF on both ends. IPFW is not loaded.
TCP Extensions: Set net.inet.tcp.rfc1323=0, sack.enable=0, and blackhole=0.
Offloading: Disabled rxcsum, txcsum, tso, and lro on both ixl and igb interfaces.
Kernel Checks: Cleared hostcache and verified rfc1122_strong_es=0.
ARP Issue: Initially, the server would not automatically ARP for the client; a manual arp -s on the server was required to get the SYN-ACK onto the wire. Even with static ARP and successful ping (0.013ms RTT), TCP handshakes still fail.
QoS: Tested with ssh -o IPQoS=none to rule out IP_TOS 0x48 (AF21) drops by network hardware.
It appears the ixl driver or the 14.1 kernel is rejecting these inbound SYN-ACKs despite them being visible in promiscuous mode. Is this a known regression in the iflib based ixl driver or a specific conflict with non-standard MTUs on the X710?

Hardware & Contextual Note:
Identical Hardware: The local "problem" client and the target server are identical motherboards purchased at the same time.
Off-site Success: An identical production server (same motherboard/NIC/FreeBSD 14.1) is located off-site. The old local igb server can connect to the off-site server without issue over the Starlink and other routers in the path.
The Discrepancy: The ixl(4) client successfully handles TCP handshakes when the Destination MAC is the Starlink gateway (routed traffic), but it "ghosts" the SYN-ACK when the Destination MAC is a local peer on the same switch (L2 traffic).
ARP Behavior: The target server fails to automatically populate its ARP table for the client. Even after a manual arp -s is added and ping succeeds, the ixl driver appears to drop the inbound SYN-ACK before it reaches the TCP stack.

The last thing suggest to try (I've not done this yet) was:
In the bios:
This is primarily a firmware-level setting. Check your BIOS/UEFI settings under the NIC configuration for a "Hardware LLDP" or "Firmware LLDP" toggle.
Disabling LLDP in Supermicro BIOS
To disable the hardware agent, you must access the UEFI Device Settings during the boot process:
Enter BIOS: Press the <Del> or <F2> key during system boot.
Navigate to Advanced: Go to the Advanced tab in the BIOS menu.
Device Settings: Look for PCIe/PCI/PnP Configuration or a direct Intel(R) Ethernet Connection X710 entry under the Advanced tab.
NIC Configuration: Select the specific ixl port (e.g., NIC Configuration).
LLDP Agent: Find the setting labeled LLDP Agent and set it to Disabled.

I'm in over my head on this one. I'm hoping the above will lead to someone recognizing the problem and they can advise what to try next. The posts I found were about 2 years old, hopefully this is known issue. Once I can get the machine back on line (it can't even do a package update), I plan to upgrade to 14.3. Once that works, I'll update the Hosted Server to 14.3 too.
TIA
 
Last edited:
An update: (more work with Google AI, so hopefully right direction even if there are syntax errors)
Looking in dmesg, I see: fw 4.1.59148 api 1.9 nvm 4.11 etid 80001db8 oem 1.265.0
Looking on line, it seems that 'api 1.9' is a very old firmware level. FreeBsd requires API 1.15 or higher.

There is a tool "Intel NVM Update Tool for FreeBSD" that can be downloaded. Regreably, this problem keeps me from being able to do port installs. But I could download it on my working "old" igb server and scp over from the old to the "Digital twin".

Does this sound like a reasonable plan ?
 
if you go mucking around with the MTU, you have to also use pf to clamp the MSS of any routed/NAT'd traffic. otherwise, TCP handshakes leave your network with an MTU that the underlying path cannot support, and this breaks random stuff in exactly the way you're describing.

my condolences on trusting the slop bot.
 
when the issue started (which was when the network switch from a DSL to bypassed StarLink router to 3rd party router fror the LAN), the MTU was 1500. The ssh and scp that used to work failed, even to the local machine where it used to work.
Various StarLink theads mentioned lowering the MTU due to the CGNAT nature of StarLink.

Would it make sense to set the MTU to 1500 and continue debugging, since changing it can cause other problems?
 
OK will put MTU back to 1500. The Google AI had me do some things with TCP dump that showed that the ACK state of the initial handshake was being dropped. That lead to "Is the OS relying on the hardware". But the options of 'VLAN_MTU,JUMBO_MTU,HWSTATS,MEXTPG' would lead one to believe that the hardware was not being used to compute checksums.
I can't do any port loads, but there is an Intel update program I can put on a USB stick and run to update the firmware.
I think I'm gong to dig into using tcpdump to see each state change during the TCP connection, and if the "missing ack" is true then go the firmware update route.

Making this even crazier is that this box used to work on the DSL line, and the production server is exactly the same ixl0 api 1.9 as this server and it works great on the fiber drop with a fixed IP and a 3rd party router for port forwarding. From that perspective, the only thing that changed on the server on my LAN is it was connected to StarLink in bypass mode and through a 3rd party router. The "old" sever has no problem to ssh and scp to the hosted router.

I was not aware that pf was sensitive to MTUs. Right now, PF is off on the LAN system.

More to look at, tnx for the comments. A lot of learning left for me to do.
 
when you've got multiple routers going on, you have to be able to compare the packets at each hop ideally, or at least at the source end and target end. you also have to be careful to allow the right ICMP through, otherwise path MTU discovery breaks. again, the book of pf covers this quite well, whereas the chatbot will mislead you in subtle ways. there's no substitute for thinking through and understanding things on your own.
 
The book (will be the 4th edition, ships out March 17) is ordered.
Everything I'm working on is contingent on running with StarLink it areas with no other communications available. I also need to setup a VPN so the outside world can get back into the deployed system. And was thinking Open Sense so the deployed resource can use LTE Data, A hardwire ethernet with HDCP address, and StarLink as the final backup.

In the interim, I'll update the firmware on the Ethernet chip on the motherboard. While this is a bigger server, deployed severs will be 8 core atoms. they both will run the exact same FreeBSD OS and other software.
 
FWIW: Here's a tcpdump showing what goes on when I try to ssh into the remote machine:
21:45:11.885981 IP 192.168.xxx.xxx.21847 > 208.160.xxx.xxx.1019: Flags , seq 3059303938, win 65535, options [mss 1460], length 0
21:45:11.927037 IP 208.160.xxx.xxx.1019 > 192.168.xxx.xxx21847: Flags [S.], seq 3318167620, ack 3059303939, win 65535, options [mss 1428], length 0
21:45:12.903438 IP 192.168.xxx.xxx.21847 > 208.160.xxx.xxx.1019: Flags , seq 3059303938, win 65535, options [mss 1460], length 0
21:45:12.939881 IP 208.160.xxx.xxx.1019 > 192.168.xxx.xxx21847: Flags [S.], seq 3318167620, ack 3059303939, win 65535, options [mss 1428], length 0

I never get the see the final ACK Whill change MTU back to 1500 again, but, last time I did this it made no difference.
Indeed, as atax 1A suggest, pf is now the main culprit to examine/lean about.
 
Code:
ping -D -s 1472 destination_ip
ping -D -s 1500 destination_ip
ping -D -s 8972 destination_ip
route -n get destination_ip
For me only -s 1400 works, no idea why this means mtu 1448.
 
I think I'm dealing with 3 separate problems. Have 3 instead a single problem certainly explains why this things are so difficult.
1) The X710 Intel Ethernet driver, which seems at time to be flakey based on searches all over the net. My firmware level is 'api 1.9' and the current version is 1.15. So there is a firmware update in my future. This was hampered by the fact that the server can't even do a pkg update.

2) The 'Starlink' factor: The server on my LAN worked great until I had to switch over to Starlink. ssh and scp both broke. Other machines on the LAN (both PC windows 11 and other FreeBSD systems) could not keep a ssh session up for more than an hour or two. That requires changes in the ssh configuration. Starlink uses CGNAT so a number of sites recommended lowering the MTU. On my Windows machine, I had to take the size of a ping down to 1420 when I set the "No Fragmentation" option. There is no way to come in from the outside Internet to your LAN without paying for a 3-4X more expensive Starlink IP V6 fixed IP. The DDNS solutions (No-Ip, etc) don't work. Some have made a tunnel to a server on the internet work, so I'll be going down that path and that means that headers get bigger. Like many rural areas, all ISP providers are IPV4 only, they do not support IPV6 so the IPV6 fixed IP is not a solution. I'm already having to turn off IPV6 to make things work.
As if often the case, if you line up all the solutions on line end to end, they don't point anywhere. The bottom line is changes need to be made in system configurations to work over Starlink due to CGNAT, headers, VPN impact and even TCP drivers because the Starlink system changes satellites every 15 seconds or so and this constant change in latency seems to upset some drivers. I've got a lot to learn about a lot of things due to the changes in connectivity when working via satellite vs. terrestrial. I think we all have a bit to learn.

3) Late lastnight, I did get the FreeBSD server on my LAN working. I had change my pf.conf, here was the final line that worked:
pass out quick on ixl0 proto { tcp, udp } to any modulate state
I'm still learning about all this, but, it seems that the ACK was being dumped by either the X710 or FreeBSD. The "modulate state" was telling the system to "losen up on the strictness" of processing packets. When the book on PF shows up, I hope to understand this better. Now I can at least update packages and more importantly load the pkg to download the system driver changes that work with the X710.

Once the firmware and system software is updated, then I can worry about the optimal MTU. When I bring up the VPN tunnel so I can come into my LAN from a server on the net, that may need to be changed.

I know there is a lot of "AI Sucks" view out there, and I'm right with you. But multiple 3 hour "chats" with the google AI has been very helpful. I view AI like any other tool (MSO oscilloscope, signal generator, VNA, spectrum analyzer) or software diagnostic tools. The tools are dumb, and there is some operator skill needed. The AI was poor at keeping in mind I was on FreeBSD 14.1, and kept providing "Type this" solutions that were wrong because having been trained on Linux, windows, and lots of different FreeBSD versions, the AI behaved like a dunk expert. I had to continuously remind it I was on FreeBSD 14.1. If I could not have recognized when it was making a mistake, it would have been useless. AI would have me use a spectrum analyzer to measure a voltage. The value of AI was pointing out that I needed to measure a voltage which got me looking in new directions.

So my journey continues. I'll work on and juggle these three issues and learn what I need to to fix them. The proliferation of "Internet over Satellite" is creating new challenges in system configuration. For everything I'm doing (digital communications and servers for first responders deployed into areas with absolutely no communications at all) this has to be figured out. Getting networking working, and adding some of the custom hardware needed for deployable servers in vehicles will be a big win for everyone. The fire agency I work with had a team deployed into an area working on downed trees with chainsaws that had zero comms for 14 hours at a time. This is not safe and not acceptable.

I'll post when I figure this all out. I thank everyone for their input.
 
Back
Top