FreeBSD 12.2 - Mysterious offline problem

wxppro · Nov 22, 2020

Hello fellow FreeBSD users and experts,

I have been using FreeBSD for some time now. I have some background in Unix and I found it very familiar in many ways. In comparison to Linux, I like FreeBSD being very straightforward.

I am using FreeBSD 12.2 for a home server and I ran into a strange problem. The server will go offline (lost network connectivity) for some time, and then resume. I suspect it is related to memory management. But during the time period that the system goes offline, the log does not show anything. My question hence is, what debugging/logging can I enable so that I can see what happened?

This is an Azulle Byte 3 mini computer with 4GB memory. I have the FreeBSD OS installed on its built-in 32GB eMMC disk. I have another mechanical hard drive attached to the system. The hard drive is in one ZFS pool. On the system, I have Samba and Syncthing installed. I created two virtual machines using Bhyve to run Pi-Hole. I also have a jail to run Transmission. In terms of networking, I created a virtual bridge (bridge0) that includes the main NIC (re0), two tap interfaces for the virtual machines, and an epair interface for the jail.

The system has worked fine almost all the time. It is not a high stress system. But occasionally, it will become unresponsive for some time. It is like once or twice in a week. It may last 15 to 30 minutes. During this time, ping returns host down. Then it comes back online as if nothing has happened. I checked the system log (/var/log/messages). No information is there. The syncthing log shows that it failed to connect to other servers, indicating that network connectivity was lost for some time. I also checked the system log in the virtual machines. They are Debian systems. They are also like nothing ever happened.

I read somewhere that if FreeBSD runs into memory shortage, it will try to page memory to disk. Processes may hang after trying to access the disk for more than 20 seconds. It may resume once the paging is finally done. Revise the default vfs.zfs.arc_max parameter (75% of available memory) may help. So I changed it to be 2147483648 (2GB). With this change, it seems this lost connectivity thing happens less frequently. But it still happens.

I am not sure if it is the reason. But my challenge is that I need to get some information first. How can I obtain more information about this offline period? No network access. Attaching a monitor only gives a blank screen.

Thanks much!

ralphbsz · Nov 23, 2020

Down for 15 minutes or half hour? Explaining that with ARC, memory pressure, or paging IO is virtually impossible. Look at it this way: In half hour, the mechanical disk can 180,000 IOs, the SSD many times more. Paging all of the 4GiB of memory to the hard disk takes 40 seconds (much less on the root disk), you are hanging 20 or 50 times longer.

Also, I'm not sure I believe the correlation with adjusting arc_max. I'm not saying that you're wrong, I'm just saying that to actually believe that, you would have to present some measurements (like rate of hangs/day before and after).

Here is my suggestion: You said you have a monitor attached. Get that to work, so you can log in on it. Blank screen might be as easy to fix as power-cycling the monitor once (works for mine, for some reason FreeBSD booting confuses my monitor, which admittedly was bought for $99 20 years ago). Log in, start something like top on it, or perhaps (if you can have multiple windows, for example with screen) do vmstat in one, iostat in another, and gstat in a third (there is a lot of overlap), and perhaps even have a few pings going (ping an internal address, like your thermostat or weather station or whatever device is always online, and ping 8.8.8.8 or 1.1.1.1). Then, when the outage happens, walk over to the monitor and report what you see.

If no monitor: write a little script that gathers the output from utilities such as vmstat and friends (see above), and logs it into a file. Since your outages are long enough to be noticeable, you don't need much, just one every 15 seconds or so. Then see whether that keeps going, and whether you see any activity.

Final super-simple suggestion: when the outage happen, look at the disk activity light, ethernet activity light, and (if you have an ethernet hub with lights), the traffic on the network.

Argentum · Nov 23, 2020

wxppro said:
I am using FreeBSD 12.2 for a home server and I ran into a strange problem. The server will go offline (lost network connectivity) for some time, and then resume. I suspect it is related to memory management. But during the time period that the system goes offline, the log does not show anything. My question hence is, what debugging/logging can I enable so that I can see what happened?

If there is nothing in syslog, try to get something in. In such case I would put something like ifconfig|logger into crontab. Just record the output of your network interface every minute and see how does it look like.

goshanecr · Nov 23, 2020

Maybe problem in if_re(), it has very annoying bugs like: PR 166724, so first of all I think try to check solutions like:
- other nic
- tuning buf sysctl's described in provided bug PR.

For me it helps.

wxppro · Nov 24, 2020

Thanks much, ralphbsz. I know the long hanging time does not make sense, if paging is done correctly. I do have a reason to look at memory management. It is because the problem became less severe once a M.2 SSD is inserted and swap partition is moved to it. Previously, the swap partition is on the much slower eMMC disk. The machine would not resume at all (after 2~3 hours), or maybe I did not give it enough time to resume. I remembered seeing error messages of “out of swap space” once rebooted. The two bhyve virtual machines were killed. Later I inserted a M.2 SSD and moved the swap partition (I made it 8GB) to the SSD. My observation is that the problem still happens but seems less frequently. More importantly, the system can resume to normal and the two virtual machines are not killed. That leads me to suspect that a faster SSD makes the problem less severe. Then I read that a process may hang if not succeed in accessing the disk for 20 seconds (https://www.freebsd.org/doc/en/books/faq/troubleshoot.html#idp59131080). If there is a glitch in the paging process, a reasonable hypothesis is that the system resumes to normal once paging is finally done. It feels like the system just freezes for some time. And there is no log of error.

I must admit that there is a bias in this reasoning – kinda like fixed onto a hypothesis and looking for proofs. You are completely right that I need to collect more information, otherwise it will be a long shot in the darkness. I have implemented a monitoring mechanism on another peer LAN machine. It will ping the server and if not reachable, it will send an email to me. Hopefully I can then use the monitor during the down time (yes, it works as long as it is not attached after booting). The challenge is that, so far the down time happened at 2 or 3AM… I guess I will try to set up some cron job to do the data collection for me (thanks to Argentum!). Hope it will catch some useful information.

This server is not a heavy duty one. I cannot think of a trigger at 2 or 3AM. Maybe syncthing is trying to scan the disk. It is probably the only CPU/memory intensive app. Maybe it does not work well with ZFS? So somehow it has stressed the system?

And thanks to goshanecr – I am actually aware of the Realtek NIC problem (watchdog time out). I have replaced the built-in re0 driver with a compiled one using Realtek’s latest driver. But I will read the bug report to see if Realteck NIC is the culprit.

FYI - dmesg output regarding re0:

Code:

re0: <Realtek PCIe GbE Family Controller> port 0xe000-0xe0ff mem 0x81204000-0x81204fff,0x81200000-0x81203fff irq 22 at device 0.0 on pci1
re0: Using Memory Mapping!
re0: Using 1 MSI-X message
re0: ASPM disabled
re0: version:1.96.04
re0: Ethernet address: xx:xx:xx:xx:xx:xx
re0: Ethernet address: xx:xx:xx:xx:xx:xx
re0: link state changed to UP
re0: promiscuous mode enabled

Many thanks to you all!

wxppro · Nov 30, 2020

Today my server had another outage and I have collected some information. Before getting into the details, I want to ask a question: What could cause a FreeBSD system to delay timed actions? That is, a script is supposed to run a task every two minutes, but it took the system way longer to get to it. The intervals in between were 17 and 7 minutes.

OK, here we go. Following folks’ suggestion, I implemented a monitoring mechanism and recorded the information into log files. To make it easier to understand, I have the following device names:

pfSense: My router
Samba: The FreeBSD machine of interest
Pihole1: Virtual machine 1 on the FreeBSD machine
Jail: A vnet peer jail on the FreeBSD machine.
Printer: A network printer.

The monitoring mechanism includes:

Ping from my router to several devices every two minutes.
Ping from Samba to Pihole1, Jail, pfSense, Printer, Google DNS, and CNN every two minutes.
Record system information on Samba every five minutes: ifconfig, vmstat, and top.

The outage

I ran a script on my router (pfSense). Every two minutes, it pings a bunch of devices and records an error message if not reachable. Here is the script:

Code:

while true
   do
      date >> /root/ping.log
      echo "Ping Samba, Pihole, Jail, and Printer.  Silence is good." >> /root/ping.log

      ping -c 4 192.168.xx.xxx > /dev/null
      if [ "$?" -ne 0 ]; then
        echo "Cannot reach Samba." >> /root/ping.log
      fi

     ### Repeat the above four lines for other devices.

      sleep 120
   done

Below is the part that is related to the outage. It is clear that the outage started sometime between 19:27:42 and 19:30:03, and ended before 19:50:14.

[FONT=courier new]Sun Nov 29 19:27:42 EST 2020
Ping Samba, Pihole1, Jail, and Printer. Silence is good.

Sun Nov 29 19:30:03 EST 2020
Ping Samba, Pihole1, Jail, and Printer. Silence is good.
Cannot reach Samba
Cannot reach Pihole1.
Cannot reach Jail.

Sun Nov 29 19:32:48 EST 2020
Ping Samba, Pihole1, Jail, and Printer. Silence is good.
Cannot reach Samba.
Cannot reach Pihole1.
Cannot reach Jail.

### The same repeated every two minutes ……

Sun Nov 29 19:47:20 EST 2020
Ping Samba, Pihole1, Jail, and Printer. Silence is good.
Cannot reach Samba.
Cannot reach Pihole1.
Cannot reach Jail.

Sun Nov 29 19:50:14 EST 2020
Ping Samba, Pihole, Jail, and Printer. Silence is good.[/FONT]

Log on Samba

Like happened before, Samba became inaccessible for about 20 minutes. After it came back online, I checked the logs on Samba.

The first log is the same ping log as implemented on the router. I set it to ping the router, virtual machine 1 on itself, the LAN printer, Google DNS (8.8.8.8) and cnn.com. The interval is two minutes. What is unusual in the log is the prolonged intervals during the outage period. See comments below preceded with ###.

[FONT=courier new]Sun Nov 29 19:26:37 EST 2020
Ping pfSense, Pihole1, Printer, Google DNS, and CNN. Silence is good.[/FONT]

[FONT=arial]### You can see that the logs are about 2 minutes apart.[/FONT]

[FONT=courier new]Sun Nov 29 19:28:56 EST 2020
Ping pfSense, Pihole1, Printer, Google DNS, and CNN. Silence is good.[/FONT]

[FONT=arial]### At this point, the outage should have happened. The weird thing is that the next log entry is indeed 17 minutes later.[/FONT]

[FONT=courier new]Sun Nov 29 19:45:29 EST 2020
Ping pfSense, Pihole1, Printer, Google DNS, and CNN. Silence is good.
Cannot reach Pihole1.[/FONT]

[FONT=arial]### The next log entry is 7 minutes later.[/FONT]

[FONT=courier new]Sun Nov 29 19:52:14 EST 2020
Ping pfSense, Pihole1, Printer, Google DNS, and CNN. Silence is good.[/FONT]

[FONT=arial]### Outage is gone by now, and the two minutes interval resumed.[/FONT]

[FONT=courier new]Sun Nov 29 19:54:32 EST 2020
Ping pfSense, Pihole1, Printer, Google DNS, and CNN. Silence is good.[/FONT]

I also recorded ifconfig, vmstat, and top on a five-minute interval. A similar pattern is that, during the outage, the system had a prolonged interval to record entries.

Code:

### The previous entry is at 19:20:00, five minutes earlier.

Sun Nov 29 19:25:00 EST 2020
re0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=2019<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,WOL_MAGIC>
    ether xx:xx:xx:xx:xx:xx
    inet 192.168.xx.xxx netmask 0xffffff00 broadcast 192.168.xx.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether xx:xx:xx:xx:xx:xx
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto stp-rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: epair3a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 6 priority 128 path cost 2000
    member: tap2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 5 priority 128 path cost 2000000
    member: tap1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 4 priority 128 path cost 2000000
    member: re0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 1 priority 128 path cost 20000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
tap1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether xx:xx:xx:xx:xx:xx
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    Opened by PID 1768
tap2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether xx:xx:xx:xx:xx:xx
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    Opened by PID 1769
epair3a: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=8<VLAN_MTU>
    ether xx:xx:xx:xx:xx:xx
    inet6 fe80::29:72ff:fe00:d50a%epair3a prefixlen 64 scopeid 0x6
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

procs     memory        page                    disks     faults         cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 mm0   in    sy    cs us sy id
3 0 0 542968048  109784    79   3   0   0   118  375   0   0  187  1088  1886  1  4 95

last pid: 38976;  load averages:  0.15,  0.36,  0.34  up 2+04:44:16    19:25:00
54 processes:  1 running, 51 sleeping, 2 zombie
CPU:  0.1% user,  0.7% nice,  1.3% system,  3.0% interrupt, 95.0% idle
Mem: 608M Active, 1056M Inact, 612M Laundry, 1346M Wired, 107M Free
ARC: 509M Total, 185M MFU, 124M MRU, 1912K Anon, 8247K Header, 191M Other
     72M Compressed, 404M Uncompressed, 5.63:1 Ratio
Swap: 8192M Total, 1228M Used, 6964M Free, 14% Inuse


Sun Nov 29 19:30:07 EST 2020

### ifconfig information is exactly the same as above.  Omitted to save some space.

procs     memory        page                    disks     faults         cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 mm0   in    sy    cs us sy id
1 0 0 543019084  195128    80   3   0   0   120  378   0   0  188  1090  1890  1  4 95

last pid: 39090;  load averages: 24.96,  7.20,  2.88  up 2+04:49:24    19:30:08
59 processes:  1 running, 56 sleeping, 2 zombie
CPU:  0.1% user,  0.7% nice,  1.3% system,  3.0% interrupt, 95.0% idle
Mem: 1337M Active, 90M Inact, 859M Laundry, 1253M Wired, 190M Free
ARC: 485M Total, 61M MFU, 226M MRU, 3104K Anon, 5053K Header, 190M Other
     107M Compressed, 202M Uncompressed, 1.89:1 Ratio
Swap: 8192M Total, 1518M Used, 6674M Free, 18% Inuse

### Note that the next entry is 20 minutes later.

Sun Nov 29 19:50:00 EST 2020

### Again, ifconfig information is the same.

procs     memory        page                    disks     faults         cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 mm0   in    sy    cs us sy id
10 0 0 543026396  275084    79   3   0   0   119  378   0   0  187  1084  1884  1  5 94

last pid: 39135;  load averages: 31.00, 41.30, 37.09  up 2+05:09:16    19:50:00
61 processes:  2 running, 57 sleeping, 2 zombie
CPU:  0.1% user,  0.7% nice,  1.4% system,  3.4% interrupt, 94.4% idle
Mem: 1258M Active, 17M Inact, 856M Laundry, 1310M Wired, 289M Free
ARC: 542M Total, 62M MFU, 283M MRU, 2235K Anon, 5272K Header, 190M Other
     168M Compressed, 278M Uncompressed, 1.66:1 Ratio
Swap: 8192M Total, 1523M Used, 6669M Free, 18% Inuse

### The outage is gone.  Five minutes interval resumed.

Sun Nov 29 19:55:00 EST 2020

### Same ifconfig information

procs     memory        page                    disks     faults         cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 mm0   in    sy    cs us sy id
1 0 5 543270828  231712    80   3   0   0   123  382   0   0  205  1100  1930  1  5 94

last pid: 39304;  load averages:  2.22, 16.57, 26.78  up 2+05:14:16    19:55:00
64 processes:  1 running, 61 sleeping, 2 zombie
CPU:  0.1% user,  0.7% nice,  1.4% system,  3.5% interrupt, 94.3% idle
Mem: 1017M Active, 69M Inact, 334M Laundry, 2083M Wired, 226M Free
ARC: 1132M Total, 419M MFU, 277M MRU, 211M Anon, 10M Header, 214M Other
     510M Compressed, 903M Uncompressed, 1.77:1 Ratio
Swap: 8192M Total, 717M Used, 7475M Free, 8% Inuse

I am not sure how much value these logs can provide. To me, they seem quite “normal.” The only abnormal thing is the prolonged intervals. It feels like the system entered into a “slow motion” status and then resumed.

What can I do to investigate further? Thanks.

goshanecr · Dec 3, 2020

wxppro

I can suggest that problem related to RAM usage. Look, your devices stop responding at 19:30, and starts againg at 19:50. Your info shows that in that interval you have increased swap file usage, and after it less again.
Since your server has not much RAM, you can set zfs ram usage to something like 1024M in /boot/loader.conf

Code:

vm.kmem_size=1024M
vm.kmem_size_max=1024M

also what services on FreeBSD machine can use lot of RAM?

So I think wher system got into swap usage - it start be inresponsible.

wxppro · Dec 3, 2020

goshanecr. Thanks much. I will study these parameters and then test it out.

I am of the opinion that this is not so much a networking problem. So I had another post in the General forum: https://forums.freebsd.org/threads/swap-problem-memory-management-problem.77903/

From the discussion there, it seems that Syncthing is the main suspect. I will disable it to see how it goes. Then I will try some memory management parameters to see if the issue can be contained.

wxppro · Dec 16, 2020

I am quite certain that this is not a networking issue. By setting the parameter vfs.zfs.arc_max to 1G, the network connectivity outage does not happen again.

Please refer to the other thread: https://forums.freebsd.org/threads/swap-problem-memory-management-problem.77903/

diizzy · Dec 16, 2020

Periodic script runs at 3 AM (scans storage and populates the locate db).

wxppro · Dec 16, 2020

diizzy said:
Periodic script runs at 3 AM (scans storage and populates the locate db).

Can you be a little bit more specific on what to do? Thanks.

diizzy · Dec 17, 2020

There's not much you can do about that since it's a system service (well, I guess could disable it in theory but that's not your main issue) but your overall load is all over the place

"load averages: 31.00, 41.30, 37.09"
Swap: 8192M Total, 1228M Used, 6964M Free, 14% Inuse
This isn't great, kill your VMs, Transmission and monitor using top (and ssh) what eats memory.
My bet is that Syncthing eats a lot of memory, see: https://github.com/syncthing/syncthing/issues/4976

wxppro · Dec 18, 2020

I am 99.9999% certain Syncthing is the one causing problem on my home server. Per my reading, it uses kqueue to maintain a list of open files. When there are too many files opened, the system will be dragged down to crawling.

The problem is that Syncthing does not cause the same problem on Ubuntu. I installed Ubuntu on this very server before. Never had this problem. So, resource wise, this small server is sufficient. But Syncthing on FreeBSD requires a lot more resources, or I should say that the way Syncthing uses resources on FreeBSD is not that great?

rootbert · Dec 18, 2020

syncthing probably uses kqueue on FreeBSD which unfortunately does not scale well with many files

FreeBSD 12.2 - Mysterious offline problem

wxppro​

wxppro