FreeBSD 12.2 - Mysterious offline problem

wxppro

New Member

Reaction score: 1
Messages: 2

Hello fellow FreeBSD users and experts,

I have been using FreeBSD for some time now. I have some background in Unix and I found it very familiar in many ways. In comparison to Linux, I like FreeBSD being very straightforward.

I am using FreeBSD 12.2 for a home server and I ran into a strange problem. The server will go offline (lost network connectivity) for some time, and then resume. I suspect it is related to memory management. But during the time period that the system goes offline, the log does not show anything. My question hence is, what debugging/logging can I enable so that I can see what happened?

This is an Azulle Byte 3 mini computer with 4GB memory. I have the FreeBSD OS installed on its built-in 32GB eMMC disk. I have another mechanical hard drive attached to the system. The hard drive is in one ZFS pool. On the system, I have Samba and Syncthing installed. I created two virtual machines using Bhyve to run Pi-Hole. I also have a jail to run Transmission. In terms of networking, I created a virtual bridge (bridge0) that includes the main NIC (re0), two tap interfaces for the virtual machines, and an epair interface for the jail.

The system has worked fine almost all the time. It is not a high stress system. But occasionally, it will become unresponsive for some time. It is like once or twice in a week. It may last 15 to 30 minutes. During this time, ping returns host down. Then it comes back online as if nothing has happened. I checked the system log (/var/log/messages). No information is there. The syncthing log shows that it failed to connect to other servers, indicating that network connectivity was lost for some time. I also checked the system log in the virtual machines. They are Debian systems. They are also like nothing ever happened.

I read somewhere that if FreeBSD runs into memory shortage, it will try to page memory to disk. Processes may hang after trying to access the disk for more than 20 seconds. It may resume once the paging is finally done. Revise the default vfs.zfs.arc_max parameter (75% of available memory) may help. So I changed it to be 2147483648 (2GB). With this change, it seems this lost connectivity thing happens less frequently. But it still happens.

I am not sure if it is the reason. But my challenge is that I need to get some information first. How can I obtain more information about this offline period? No network access. Attaching a monitor only gives a blank screen.

Thanks much!
 

ralphbsz

Son of Beastie

Reaction score: 1,717
Messages: 2,675

Down for 15 minutes or half hour? Explaining that with ARC, memory pressure, or paging IO is virtually impossible. Look at it this way: In half hour, the mechanical disk can 180,000 IOs, the SSD many times more. Paging all of the 4GiB of memory to the hard disk takes 40 seconds (much less on the root disk), you are hanging 20 or 50 times longer.

Also, I'm not sure I believe the correlation with adjusting arc_max. I'm not saying that you're wrong, I'm just saying that to actually believe that, you would have to present some measurements (like rate of hangs/day before and after).

Here is my suggestion: You said you have a monitor attached. Get that to work, so you can log in on it. Blank screen might be as easy to fix as power-cycling the monitor once (works for mine, for some reason FreeBSD booting confuses my monitor, which admittedly was bought for $99 20 years ago). Log in, start something like top on it, or perhaps (if you can have multiple windows, for example with screen) do vmstat in one, iostat in another, and gstat in a third (there is a lot of overlap), and perhaps even have a few pings going (ping an internal address, like your thermostat or weather station or whatever device is always online, and ping 8.8.8.8 or 1.1.1.1). Then, when the outage happens, walk over to the monitor and report what you see.

If no monitor: write a little script that gathers the output from utilities such as vmstat and friends (see above), and logs it into a file. Since your outages are long enough to be noticeable, you don't need much, just one every 15 seconds or so. Then see whether that keeps going, and whether you see any activity.

Final super-simple suggestion: when the outage happen, look at the disk activity light, ethernet activity light, and (if you have an ethernet hub with lights), the traffic on the network.
 

Argentum

Active Member

Reaction score: 32
Messages: 101

I am using FreeBSD 12.2 for a home server and I ran into a strange problem. The server will go offline (lost network connectivity) for some time, and then resume. I suspect it is related to memory management. But during the time period that the system goes offline, the log does not show anything. My question hence is, what debugging/logging can I enable so that I can see what happened?

If there is nothing in syslog, try to get something in. In such case I would put something like ifconfig|logger into crontab. Just record the output of your network interface every minute and see how does it look like.
 

goshanecr

Active Member

Reaction score: 32
Messages: 199

Maybe problem in if_re(), it has very annoying bugs like: PR 166724, so first of all I think try to check solutions like:
- other nic
- tuning buf sysctl's described in provided bug PR.

For me it helps.
 
OP
W

wxppro

New Member

Reaction score: 1
Messages: 2

Thanks much, ralphbsz. I know the long hanging time does not make sense, if paging is done correctly. I do have a reason to look at memory management. It is because the problem became less severe once a M.2 SSD is inserted and swap partition is moved to it. Previously, the swap partition is on the much slower eMMC disk. The machine would not resume at all (after 2~3 hours), or maybe I did not give it enough time to resume. I remembered seeing error messages of “out of swap space” once rebooted. The two bhyve virtual machines were killed. Later I inserted a M.2 SSD and moved the swap partition (I made it 8GB) to the SSD. My observation is that the problem still happens but seems less frequently. More importantly, the system can resume to normal and the two virtual machines are not killed. That leads me to suspect that a faster SSD makes the problem less severe. Then I read that a process may hang if not succeed in accessing the disk for 20 seconds (https://www.freebsd.org/doc/en/books/faq/troubleshoot.html#idp59131080). If there is a glitch in the paging process, a reasonable hypothesis is that the system resumes to normal once paging is finally done. It feels like the system just freezes for some time. And there is no log of error.

I must admit that there is a bias in this reasoning – kinda like fixed onto a hypothesis and looking for proofs. You are completely right that I need to collect more information, otherwise it will be a long shot in the darkness. I have implemented a monitoring mechanism on another peer LAN machine. It will ping the server and if not reachable, it will send an email to me. Hopefully I can then use the monitor during the down time (yes, it works as long as it is not attached after booting). The challenge is that, so far the down time happened at 2 or 3AM… I guess I will try to set up some cron job to do the data collection for me (thanks to Argentum!). Hope it will catch some useful information.

This server is not a heavy duty one. I cannot think of a trigger at 2 or 3AM. Maybe syncthing is trying to scan the disk. It is probably the only CPU/memory intensive app. Maybe it does not work well with ZFS? So somehow it has stressed the system?

And thanks to goshanecr – I am actually aware of the Realtek NIC problem (watchdog time out). I have replaced the built-in re0 driver with a compiled one using Realtek’s latest driver. But I will read the bug report to see if Realteck NIC is the culprit.

FYI - dmesg output regarding re0:
Code:
re0: <Realtek PCIe GbE Family Controller> port 0xe000-0xe0ff mem 0x81204000-0x81204fff,0x81200000-0x81203fff irq 22 at device 0.0 on pci1
re0: Using Memory Mapping!
re0: Using 1 MSI-X message
re0: ASPM disabled
re0: version:1.96.04
re0: Ethernet address: xx:xx:xx:xx:xx:xx
re0: Ethernet address: xx:xx:xx:xx:xx:xx
re0: link state changed to UP
re0: promiscuous mode enabled


Many thanks to you all!
 
Top