ena interface down and up repeatedly

Jaehak Lee · Oct 26, 2018

I have a FreeBSD 11.2 instance in AWS.
It was installed with 11.1 and upgraded to 11.2 yesterday.

Code:

# uname -a
FreeBSD db-20 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0: Thu Sep 27 08:16:24 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

# kldstat
Id Refs Address            Size     Name
1   16 0xffffffff80200000 20647f8  kernel
2    1 0xffffffff82266000 19120    if_ena.ko
3    1 0xffffffff82280000 381080   zfs.ko
4    2 0xffffffff82602000 a380     opensolaris.ko
5    1 0xffffffff82819000 1820     fdescfs.ko

After upgraded to 11.2 I have repeated log in /var/log/messages

Code:

# tail -n 100 /var/log/messages
Oct 26 11:26:47 db-20 kernel: ena0: device is going DOWN
Oct 26 11:26:47 db-20 kernel: ena0: device is going UP
Oct 26 11:26:47 db-20 kernel: ena0: queue 0 - cpu 0
Oct 26 11:26:47 db-20 kernel: ena0: queue 1 - cpu 1
Oct 26 11:26:47 db-20 kernel: ena0: queue 2 - cpu 2
Oct 26 11:26:47 db-20 kernel: ena0: queue 3 - cpu 3
Oct 26 11:56:47 db-20 kernel: ena0: device is going DOWN
Oct 26 11:56:47 db-20 kernel: ena0: device is going UP
Oct 26 11:56:47 db-20 kernel: ena0: queue 0 - cpu 0
Oct 26 11:56:47 db-20 kernel: ena0: queue 1 - cpu 1
Oct 26 11:56:47 db-20 kernel: ena0: queue 2 - cpu 2
Oct 26 11:56:47 db-20 kernel: ena0: queue 3 - cpu 3
Oct 26 12:26:47 db-20 kernel: ena0: device is going DOWN
Oct 26 12:26:48 db-20 kernel: ena0: device is going UP
Oct 26 12:26:48 db-20 kernel: ena0: queue 0 - cpu 0
Oct 26 12:26:48 db-20 kernel: ena0: queue 1 - cpu 1
Oct 26 12:26:48 db-20 kernel: ena0: queue 2 - cpu 2
Oct 26 12:26:48 db-20 kernel: ena0: queue 3 - cpu 3
Oct 26 12:56:47 db-20 kernel: ena0: device is going DOWN
Oct 26 12:56:47 db-20 kernel: ena0: device is going UP
Oct 26 12:56:47 db-20 kernel: ena0: queue 0 - cpu 0
Oct 26 12:56:47 db-20 kernel: ena0: queue 1 - cpu 1
Oct 26 12:56:47 db-20 kernel: ena0: queue 2 - cpu 2
Oct 26 12:56:47 db-20 kernel: ena0: queue 3 - cpu 3
Oct 26 13:26:47 db-20 kernel: ena0: device is going DOWN
Oct 26 13:26:48 db-20 kernel: ena0: device is going UP
Oct 26 13:26:48 db-20 kernel: ena0: queue 0 - cpu 0
Oct 26 13:26:48 db-20 kernel: ena0: queue 1 - cpu 1
Oct 26 13:26:48 db-20 kernel: ena0: queue 2 - cpu 2
Oct 26 13:26:48 db-20 kernel: ena0: queue 3 - cpu 3

It repeats every 30 minutes.

Is there some problem in ena driver?

ikbendeman · Oct 29, 2018

is there anything in dmesg -a | grep ena?

Jaehak Lee · Oct 30, 2018

Here are my result of dmesg -a.

Code:

 # dmesg -a | grep -E "ena|ENA"
ena0: <ENA adapter> mem 0x83000000-0x83003fff at device 3.0 on pci0
ena0: Elastic Network Adapter (ENA)ena v0.7.0
ena0: initalize 4 io queues
ena0: Ethernet address: 06:4d:4b:64:e1:86
ena0: Allocated msix_entries, vectors (cnt: 5)
ena0: evtchn0: link is UP
ena0: link state changed to UP
xbd0: synchronize cache commands enabled.
xbd14: synchronize cache commands enabled.
xbd13: synchronize cache commands enabled.
xbd12: synchronize cache commands enabled.
xbd11: synchronize cache commands enabled.
xbd10: synchronize cache commands enabled.
xbd9: synchronize cache commands enabled.
xbd8: synchronize cache commands enabled.
xbd7: synchronize cache commands enabled.
xbd6: synchronize cache commands enabled.
xbd5: synchronize cache commands enabled.
ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
DHCPREQUEST on ena0 to 255.255.255.255 port 67
ena0: device is going DOWN
ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
Starting Network: lo0 ena0.
ena0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9001
        inet6 fe80::44d:4bff:fe64:e186%ena0 prefixlen 64 scopeid 0x1
ena0: device is going DOWN
ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
... repeats 3 times more
ena0: Found a Tx that wasn't completed on time, qid 2, index 65.
ena0: device is going DOWN
ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
... repeats continually every 30 minutes

Thanks.

ikbendeman · Oct 30, 2018

Have you tried temporarily setting static IP to see if interface will stay up? The other option would be to increase logging level of syslog, which, I haven't messed with for some time. It's precisely every 30 minutes?

ikbendeman · Oct 30, 2018

If it is precisely every 30 mins, maybe a look at /etc/crontab would help. How is DHCP/ifconfig set in /etc/rc.conf?

Jaehak Lee · Oct 31, 2018

It's not going down when I set static IP.
But, is it right way to set static IP on AWS instance?

It repeats precisely every 30 mins.

Code:

# cat /var/log/messages
Oct 31 08:47:25 db-20 kernel: ena0: device is going DOWN
Oct 31 08:47:25 db-20 kernel: ena0: device is going UP
Oct 31 08:47:25 db-20 kernel: ena0: queue 0 - cpu 0
Oct 31 08:47:25 db-20 kernel: ena0: queue 1 - cpu 1
Oct 31 08:47:25 db-20 kernel: ena0: queue 2 - cpu 2
Oct 31 08:47:25 db-20 kernel: ena0: queue 3 - cpu 3
Oct 31 09:17:25 db-20 kernel: ena0: device is going DOWN
Oct 31 09:17:25 db-20 kernel: ena0: device is going UP
Oct 31 09:17:25 db-20 kernel: ena0: queue 0 - cpu 0
Oct 31 09:17:25 db-20 kernel: ena0: queue 1 - cpu 1
Oct 31 09:17:25 db-20 kernel: ena0: queue 2 - cpu 2
Oct 31 09:17:25 db-20 kernel: ena0: queue 3 - cpu 3
Oct 31 09:47:25 db-20 kernel: ena0: device is going DOWN
Oct 31 09:47:26 db-20 kernel: ena0: device is going UP
Oct 31 09:47:26 db-20 kernel: ena0: queue 0 - cpu 0
Oct 31 09:47:26 db-20 kernel: ena0: queue 1 - cpu 1
Oct 31 09:47:26 db-20 kernel: ena0: queue 2 - cpu 2
Oct 31 09:47:26 db-20 kernel: ena0: queue 3 - cpu 3

No suspicious settings in /etc/crontab

Code:

# cat /etc/crontab
# /etc/crontab - root's crontab for FreeBSD
#
# $FreeBSD: releng/11.2/etc/crontab 194170 2009-06-14 06:37:19Z brian $
#
SHELL=/bin/sh
PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin
#
#minute hour    mday    month   wday    who     command
#
*/5     *       *       *       *       root    /usr/libexec/atrun
#
# Save some entropy so that /dev/random can re-seed on boot.
*/11    *       *       *       *       operator /usr/libexec/save-entropy
#
# Rotate log files every hour, if necessary.
0       *       *       *       *       root    newsyslog
#
# Perform daily/weekly/monthly maintenance.
1       3       *       *       *       root    periodic daily
15      4       *       *       6       root    periodic weekly
30      5       1       *       *       root    periodic monthly
#
# Adjust the time zone if the CMOS clock keeps local time, as opposed to
# UTC time.  See adjkerntz(8) for details.
1,31    0-5     *       *       *       root    adjkerntz -a

Interface setting in /etc/rc.conf

Code:

# cat /etc/rc.conf | grep -E "interface|if"
ifconfig_DEFAULT="SYNCDHCP accept_rtadv"
ipv6_activate_all_interfaces="YES"

ikbendeman · Oct 31, 2018

try:

Code:

ifconfig_ena0="DHCP accept_rtadv"

It may be SYNCDHCP causing problems. Do you happen to know what dhcpd your dhcp server runs?

ikbendeman · Oct 31, 2018

To me, it looks like it's the wrong setting, your carrier doesn't support synchronous mode (which I've never had to use, so...), or a conflict between the dhcpd and the client

Jaehak Lee · Oct 31, 2018

And I found some message in dmesg.
It says "bound to 10.1.20.20 -- renewal in 1800 seconds."
This is part of dmesg

Code:

ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
Starting dhclient.
DHCPREQUEST on ena0 to 255.255.255.255 port 67
DHCPACK from 10.1.20.1
ena0: device is going DOWN
ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
ena0: queue 2 - cpu 2
ena0: queue 3 - cpu 3
bound to 10.1.20.20 -- renewal in 1800 seconds.
/etc/rc.d/dhclient: WARNING: failed to start dhclient
Starting Network: lo0 ena0.

But in another instance with ena interface.

Code:

# uname -a
FreeBSD adc-web-10 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 11:55:48 UTC 2017     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

It's dmesg says like this
(same message "bound to 10.1.20.10 -- renewal in 1800 seconds." but not going down)

Code:

ena0: device is going UP
ena0: queue 0 - cpu 0
ena0: queue 1 - cpu 1
Starting dhclient.
DHCPREQUEST on ena0 to 255.255.255.255 port 67
DHCPACK from 10.1.20.1
bound to 10.1.20.10 -- renewal in 1800 seconds.
/etc/rc.d/dhclient: WARNING: failed to start dhclient
Starting Network: lo0 ena0.

All of these instances uses ena driver 0.7.0

And I found this link
amzn-drivers : ena : skip setting the MTU for ENA if it is not changing
Is this link related with my case?
ena driver version of above is 0.8.1

Thanks ikbendeman.

ikbendeman · Oct 31, 2018

Your rc.conf is part of the problem, it would appear. If static IP works, then you probably don't need synchronous mode, change DEFAULT to ena0. Unless your service provide/modem requires SYNCDHCP, don't enable it. I think it's mostly used for fiber NIC's or integrated modems, but not my area of expertise. Did you try the above settings? If those don't work, is there anything you see from sysctl -a | grep ena? Before switching to different upstream code/using a diff (probably what you would have to do to fix that kmod issue), I would attempt first trying that. I'll look to see if there's been changes in the driver on 11-STABLE.

ikbendeman · Oct 31, 2018

On AWS, a network interface can get reinitialized every 30 minutes due
to the MTU being (re)set when a new DHCP lease is obtained.

If DHCP works for you instead of SYNCDHCP, you won't have to worry about the MTU bug. According to what I've found on AWS doc, unless you've setup the dhcpd, are using software that requires it, or a slave node and don't, for some reason, have access to the master (dhcpd) requirements, you shouldn't need SYNCDHCP.

ikbendeman · Oct 31, 2018

Otherwise you can apply that diff to your source tree and rebuild the kernel module, or upgrade to 11-STABLE, as my source tree (11-STABLE) has the fix applied. Hope this helped. Any other questions, let me know.

Jaehak Lee · Oct 31, 2018

Setting DHCP instead of SYNCDHCP is not worked.
Interface ena is going down.

Thanks a lot ikbendeman.
I'll check diff between 11.1 and 11.2.

SirDice · Oct 31, 2018

There seems to be some misconception regarding the difference between SYNCDHCP and DHCP. The only thing SYNCDHCP does differently with regards to DHCP is that the boot scripts stop and wait for the interface to actually get an IP address before continuing the boot process. It does NOT change how dhclient(8) operates. A 'regular' DHCP simply starts dhclient(8) in the background and doesn't wait for the interface to actually receive anything. This could potentially cause problems with services getting started before the interface has an IP address.

ikbendeman · Oct 31, 2018

Thanks, my bad. Like I said, I'd not used it. I was taking the--obviously false--assumption that it had to do with synchronous mode on the adapter for some cloud based application requirement, i.e. IP hopping, but alas I didn't man

ban25 · Nov 8, 2018

I've experienced the same issue since upgrading to 11.2. This didn't happen on prior versions -- I've been using FreeBSD on EC2 since 10.0.

bookwormep · Nov 9, 2018

I had a similar problem with endless UP and DOWN connecting linkages. My hardware uses a
different driver than yours, but here is a method I used which seems to work:

https://forums.freebsd.org/threads/wireless-network-using-the-iwn-4-driver.63606/

You add the "-ht" option to your ifconfig wlan0 and save in /etc/rc.conf.
It disables High Throughput by using this option. It is detailed in the link above, hope it helps.

ikbendeman · Nov 9, 2018

He probably doesn't want to disable high throughput. It looks like the driver issue was causing DHCP to retry every 30 minutes even though one instance already recieved an ACK from the DHCP server. Did patching fix?

Phishfry · Nov 9, 2018

ht is for wifi channels. aws is the cloud.
ht20 ht40 are different modes for wireless connections.
http://support.huawei.com/enterprise/en/knowledge/EKB1000079063

ikbendeman · Nov 9, 2018

Isn't ena0 10GB ethernet? If I remember correctly, likely depricated, but low throughput mode on ethernet adapters would drop them down to 10M.

Phishfry · Nov 9, 2018

https://medium.com/@paccattam/aws-enhanced-networking-an-overview-aee8a852cf5c

Phishfry · Nov 9, 2018

Back to the original question. The first place I start when I have ethernet problems is to turn on debugging for the interface.
This is done via the sysctl function. You need to find the correct area to add the debug setting.
I would start with sysctl -a |grep ena_sysctl
If that is the correct location then add a sysctl for debugging:
sysctl ena_sysctl.0.debug=1
Then your /var/log/debug.log should show the problem.
You might also want to look at your /var/log/messages and /var/log/devd.log for clues.

I found the sysctl name here:
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena

Phishfry · Nov 9, 2018

I wonder if its not an IPv6 config problem.
What happens if you only use:
/etc/rc.conf
ifconfig_DEFAULT="SYNCDHCP"

Phishfry · Nov 9, 2018

This seems like it might be worth trying: /net/dhcpdump for debugging dhclient.
http://www.freebsdonline.com/content/view/713/524/

Jaehak Lee · Nov 22, 2018

Hi, I didn't notice so many replies.

Thank you Phishfry for replying.

I've tested like this.

sysctl -a |grep ena_sysctl has no output.
ifconfig_DEFAULT="SYNCDHCP" has no effect.

I installed dhcpdump and got this result below.
(But I can't get meaning)

Is there some clues?

Result before change ifconfig_DEFAULT="SYNCDHCP accept_rtadv"

Code:

root@db-20:~ # dhcpdump -i ena0



  TIME: 2018-11-21 10:54:41.937
    IP: 10.1.20.20 (06:4d:4b:64:e1:86) > 10.1.20.1 (06:a3:79:60:4d:3e)
    OP: 1 (BOOTPREQUEST)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 4ab5b9c3
  SECS: 0
 FLAGS: 0
CIADDR: 10.1.20.20
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 06:4d:4b:64:e1:86:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         3 (DHCPREQUEST)
OPTION:  61 (  7) Client-identifier         01:06:4d:4b:64:e1:86
OPTION:  12 (  5) Host name                 db-20
OPTION:  55 ( 10) Parameter Request List      1 (Subnet mask)
                                             28 (Broadcast address)
                                              2 (Time offset)
                                            121 (Classless Static Route)
                                              3 (Routers)
                                             15 (Domainname)
                                              6 (DNS server)
                                             12 (Host name)
                                            119 (Domain Search)
                                             26 (Interface MTU)

---------------------------------------------------------------------------

  TIME: 2018-11-21 10:54:41.938
    IP: 10.1.20.1 (06:a3:79:60:4d:3e) > 10.1.20.20 (06:4d:4b:64:e1:86)
    OP: 2 (BOOTPREPLY)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 4ab5b9c3
  SECS: 0
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 10.1.20.20
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 06:4d:4b:64:e1:86:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         5 (DHCPACK)
OPTION:  54 (  4) Server identifier         10.1.20.1
OPTION:  51 (  4) IP address leasetime      3600 (60m)
OPTION:   1 (  4) Subnet mask               255.255.255.0
OPTION:  28 (  4) Broadcast address         10.1.20.255
OPTION:  15 ( 31) Domainname                ap-northeast-1.compute.internal
OPTION:   6 (  4) DNS server                10.1.0.2
OPTION:  12 ( 13) Host name                 ip-10-1-20-20
OPTION:  26 (  2) Interface MTU             9001
OPTION:   3 (  4) Routers                   10.1.20.1
---------------------------------------------------------------------------

result after change ifconfig_DEFAULT="SYNCDHCP"

Code:

  TIME: 2018-11-21 11:24:41.243
    IP: 10.1.20.20 (06:4d:4b:64:e1:86) > 10.1.20.1 (06:a3:79:60:4d:3e)
    OP: 1 (BOOTPREQUEST)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 4ab5b9c3
  SECS: 0
 FLAGS: 0
CIADDR: 10.1.20.20
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 06:4d:4b:64:e1:86:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         3 (DHCPREQUEST)
OPTION:  61 (  7) Client-identifier         01:06:4d:4b:64:e1:86
OPTION:  12 (  5) Host name                 db-20
OPTION:  55 ( 10) Parameter Request List      1 (Subnet mask)
                                             28 (Broadcast address)
                                              2 (Time offset)
                                            121 (Classless Static Route)
                                              3 (Routers)
                                             15 (Domainname)
                                              6 (DNS server)
                                             12 (Host name)
                                            119 (Domain Search)
                                             26 (Interface MTU)

---------------------------------------------------------------------------

  TIME: 2018-11-21 11:24:41.244
    IP: 10.1.20.1 (06:a3:79:60:4d:3e) > 10.1.20.20 (06:4d:4b:64:e1:86)
    OP: 2 (BOOTPREPLY)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 4ab5b9c3
  SECS: 0
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 10.1.20.20
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 06:4d:4b:64:e1:86:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         5 (DHCPACK)
OPTION:  54 (  4) Server identifier         10.1.20.1
OPTION:  51 (  4) IP address leasetime      3600 (60m)
OPTION:   1 (  4) Subnet mask               255.255.255.0
OPTION:  28 (  4) Broadcast address         10.1.20.255
OPTION:  15 ( 31) Domainname                ap-northeast-1.compute.internal
OPTION:   6 (  4) DNS server                10.1.0.2
OPTION:  12 ( 13) Host name                 ip-10-1-20-20
OPTION:  26 (  2) Interface MTU             9001
OPTION:   3 (  4) Routers                   10.1.20.1
---------------------------------------------------------------------------

ena interface down and up repeatedly

Administrator