IPF + IPNAT intermittent problems, getting worse as uptime increases

cy@ · Oct 17, 2016

Can you send output from dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }' please?

Sdt probes don't work on 10 due to additional (new) dtrace support in 11.

Datapanic · Oct 17, 2016

I am running dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }' now. The gateway was just started up so, right now, there are no NAT failures when running ipfstat. The output of dtrace is:

Code:

[root@gateway02 ~]# dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }'
dtrace: description 'sdt:::ipf_fi_bad_* ' matched 34 probes
dtrace: buffer size lowered to 1m

So I guess that until there is a NAT failure, dtrace isn't going to show anything. I've never used dtrace!

Datapanic · Oct 18, 2016

More than 6 hours later... Here's the ipfstat|egrep -i NAT output:

Code:

25      input block reason IPv4 NAT failure
0       input block reason IPv6 NAT failure
4       output block reason IPv4 NAT failure
0       output block reason IPv6 NAT failure

Meanwhile, dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }' yields nothing:

Code:

[root@gateway02 ~]# dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }'
dtrace: description 'sdt:::ipf_fi_bad_* ' matched 34 probes
dtrace: buffer size lowered to 1m

I'll keep the gateway running. But so far, nothing comes forth with the dtrace() command...

I really hope to get this bug fixed and not have to go to pf or ipfw!

Datapanic · Oct 18, 2016

Cy, Here is the output of dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }':

Code:

[root@gateway02 ~]# dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }'
dtrace: description 'sdt:::ipf_fi_bad_* ' matched 34 probes
dtrace: buffer size lowered to 1m
CPU     ID                    FUNCTION:NAME
  0  58081 none:ipf_fi_bad_checkv4sum_manual
              ipl.ko`ipf_makefrip+0xf42
              ipl.ko`ipf_check+0x16a
              kernel`pfil_run_hooks+0x83
              kernel`ip_input+0x39d
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_demux+0x12a
              kernel`ether_nh_input+0x322
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_input+0x26
              kernel`if_input+0xa
              kernel`lem_rxeof+0x513
              kernel`lem_handle_rxtx+0x32
              kernel`taskqueue_run_locked+0x14a
              kernel`taskqueue_thread_loop+0xe8
              kernel`fork_exit+0x85
              kernel`0xffffffff80f8467e

copper · Oct 19, 2016

I would guess thats because it's not a bad checksum problem , hence no matches to dtrace.

I would think dtrace on nat might show a problem, but will probably be quite noisy

copper · Oct 19, 2016

There is a bug filed here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208566 , but not progress.

what is the output of ipnat -s ?

Datapanic · Oct 19, 2016

I will post results of ipnat -s after the server runs for a while. It primarily NATs for http and smtp traffic. I am not running the 11.0-RELEASE for this particular host all the time as it's one of my gateways and I need reliability! But no problems flipping the server from one instance to another. It's running on ESXi 5.5, so it's just a matter of bringing one guest down and the other up.

Datapanic · Oct 20, 2016

Select output from ipfstat:

Code:

68      input block reason IPv4 NAT failure
0       input block reason IPv6 NAT failure
22      output block reason IPv4 NAT failure
0       output block reason IPv6 NAT failure

dtrace -n 'sdt:::ipf_fi_bad_* { stack(); }':

Code:

  0  58116           none:ipf_fi_bad_th_urg
              ipl.ko`ipf_makefrip+0xbb7
              ipl.ko`ipf_check+0x16a
              kernel`pfil_run_hooks+0x83
              kernel`ip_input+0x39d
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_demux+0x12a
              kernel`ether_nh_input+0x322
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_input+0x26
              kernel`if_input+0xa
              kernel`lem_rxeof+0x513
              kernel`lem_handle_rxtx+0x32
              kernel`taskqueue_run_locked+0x14a
              kernel`taskqueue_thread_loop+0xe8
              kernel`fork_exit+0x85
              kernel`0xffffffff80f8467e

  0  58119       none:ipf_fi_bad_th_rst_syn
              ipl.ko`ipf_makefrip+0xbb7
              ipl.ko`ipf_check+0x16a
              kernel`pfil_run_hooks+0x83
              kernel`ip_input+0x39d
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_demux+0x12a
              kernel`ether_nh_input+0x322
              kernel`netisr_dispatch_src+0xa5
              kernel`ether_input+0x26
              kernel`if_input+0xa
              kernel`lem_rxeof+0x513
              kernel`lem_handle_rxtx+0x32
              kernel`taskqueue_run_locked+0x14a
              kernel`taskqueue_thread_loop+0xe8
              kernel`fork_exit+0x85
              kernel`0xffffffff80f8467e

ipnat -s:

Code:

0       proxy create fail in
0       proxy fail in
68      bad nat in
68      bad nat new in
0       bad next addr in
81      bucket max in
0       clone nomem in
0       decap bad in
0       decap fail in
0       decap pullup in
0       divert dup in
0       divert exist in
68      drop in
0       exhausted in
0       icmp address in
0       icmp basic in
254     inuse in
0       icmp mbuf wrong size in
64      icmp header unmatched in
0       icmp rebuild failures in
0       icmp short in
0       icmp packet size wrong in
0       IFP address fetch failures in
56116   packets untranslated in
0       NAT insert failures in
3182    NAT lookup misses in
53828   NAT lookup nowild in
0       new ifpaddr failed in
0       memory requests failed in
0       table max reached in
3952    packets translated in
68      finalised failed in
0       search wraps in
0       null translations in
0       translation exists in
0       no memory in
0%      hash efficiency in
22.57%  bucket usage in
0       minimal length in
-1      maximal length in
0.000   average length in
0       proxy create fail out
0       proxy fail out
22      bad nat out
52      bad nat new out
0       bad next addr out
39      bucket max out
0       clone nomem out
0       decap bad out
0       decap fail out
0       decap pullup out
0       divert dup out
0       divert exist out
22      drop out
0       exhausted out
0       icmp address out
0       icmp basic out
257     inuse out
0       icmp mbuf wrong size out
2025    icmp header unmatched out
0       icmp rebuild failures out
0       icmp short out
0       icmp packet size wrong out
0       IFP address fetch failures out
62515   packets untranslated out
0       NAT insert failures out
4588    NAT lookup misses out
60641   NAT lookup nowild out
0       new ifpaddr failed out
0       memory requests failed out
0       table max reached out
2853    packets translated out
52      finalised failed out
0       search wraps out
0       null translations out
0       translation exists out
0       no memory out
0%      hash efficiency out
22.57%  bucket usage out
0       minimal length out
-1      maximal length out
0.000   average length out
0       log successes
0       log failures
245     added in
152     added out
0       active
0       transparent adds
0       divert build
397     expired
0       flush all
0       flush closing
0       flush queue
0       flush state
0       flush timeout
172     hostmap new
0       hostmap fails
32      hostmap add
0       hostmap NULL rule
0       log ok
0       log fail
0       orphan count
5       rule count
2       map rules
3       rdr rules
0       wilds

Switching the gateway back to 9.3-RELEASE for now. Hope this info helps!

ChrisInTheMorning · Oct 21, 2016

Hi , you should try dtrace NAT , it will be noisy

dtrace -f 'fbt:ipl:*nat* { stack(); }'

every nat probe will give stack() output, so expect lots of output! it will be hard to tie these 120+ probes output to a specific nat failure, ideally if you can just cause one failure and timestamp it somehow against this output will help

horseflesh · Jan 12, 2017

cy@ -- I stumbled into this issue too, I think. Do you need more logs?

Markus Mueller-Heidelberg · Jan 29, 2017

Hello All,

This issue is still present and now, after 9.3 reached his EOL date, another solution as changing to pf/ipfw would be great. I upgraded several FreeBSD Firewall systems from 9.3 to 10.x/11.0 and got this issue, depending of the passed traffic, with every single system. Different HW, so it must be a software issue. I rolled back to 9.3 in any case as the access lines weren't usable anymore. E.g. ... Webtraffic. Hit 3 to 10 times the reload button before you get the page isn't fun. Now the day has come were using 9.3 isn't possible anymore.

Clear the NAT table isn't solving the problem. Actually I have a test system with 11.0 and I need to reboot to get a proper access to the net again.

If I can supply any information/data I would be glad to do.

Best Regards
Markus

ggulchin · Dec 10, 2018

I'm experiencing same things on 11.2-RELEASE-p5. My firewall and NAT rules were copied from oder (freebsd 7) server. I see problem with IP Filter: v5.1.2. Please help!

Datapanic · Dec 10, 2018

My suggestion is to move over to pf. I'm not sure if Cy's fixes ever made it to the release code, but overall, pf is much more versatile than ipf and not that hard to learn. Otherwise, setup a cron job to reboot the firewall every 24 hours or so.

kpa · Dec 10, 2018

Seconded, IPF in FreeBSD is (was?) supported by a single person and looks like that support has petered out. PF on the other hand enjoys constant attention from multiple developers because it's so widely used in FreeBSD.