UFS Mysterious "Out of inodes" in NanoBSD

Hello, everybody.

I have a strange problem with one of my servers. Some facts:
  • Prepared via slightly modified NanoBSD from 11.2-STABLE #0 r342216
  • Is running on a 8GB pendrive, UEFI boot, 4 GPT partitions,
    • #1 is UEFI
    • #2 is /cfg (r/o)
    • #3 is / (r/o)
    • #4 is /data, (r/w, created via makefs -t ffs -B little -o optimization=space ... ).
  • Is running (among the others)
    • mosquitto with persistent storage on /data (the only program writing to it), pushing data to external server.
    • telegraf pulling system stats and logs from system and pushing it to abovementioned mosquitto (which means I have logs even when disk is not accessible).
During normal operation, df -hi reports /data having 8 inodes used out of 3.1k (spiking to 9 when mosquitto compacts database by copying to a new file - this is confirmed by Telegraf graph of inode usage: it is always flat 8 with tiny spikes on compaction).
However, from time to time /data runs havoc. Logs are overflowed by [FONT=courier new]<3>pid 657 (mosquitto), uid 885 inumber 4 on /data: out of inodes[/FONT] (and I can't login via SSH, which is strange as logon does not use any files or dirs on /data). Inode usage log does not indicate lack of inodes, it stays flat at 8 (no spikes on rotation as this doesn't happen any longer). There's no sudden rise of processed entries or something, no incerase of used space, no nothing suspicious. All applications continue to operate normal (unless they need to do some writing, as login or DB compaction - luckily the only application I really care about runs entirely in-memory with no fs accesses). As for now, only 1 out of 6 machines exhibits this problem and one problematic has the pendrive replaced, to rule out physical damage to the media.

It feels as if all partitions (including all tmpfs'es) turned suddenly read-only without any reason (and any trace in syslog for that matter)

TL;DR;
mosquitto reports out of inodes on a pendrive, when there's clearly not the case.

Where to go next?
 
Oh, I do. This is effective command used by nanobsd.conf to make my /data. Creation of /data is one of a few changes I made to nanobsd.conf.

Code:
NANO_DATADIR=$(realpath $(pwd)/Data)

# ...

NANO_ENDIAN=little
NANO_MAKEFS_UFS="makefs -t ffs  -B ${NANO_ENDIAN} "

#...

NANO_SLICE_DATA_SIZE=5g
NANO_SLICE_DATA=p4

# ...

 if [ ! -d "${NANO_DATADIR}" ]; then
    echo "Faking data dir, it's empty"
    NANO_DATADIR=${NANO_LOG}/_.empty
     mkdir -p ${NANO_DATADIR}
 fi

 sz=${NANO_SLICE_DATA_SIZE:+-s ${NANO_SLICE_DATA_SIZE}}
 eval "${NANO_MAKEFS_UFS}" -o optimization=space $sz \
      "${NANO_LOG}/_.${NANO_SLICE_DATA}" "${NANO_DATADIR}"
 
Where to go next?
Basically there are several explanations:
  • You really run out of inodes for a short period of time. telegraf probably samples the inode usage at certain time intervals only, so this might remain undetected.
  • Something is wrong with your file system. Have you tried running fsck(8) on it?
  • You do not run out of inodes, which means that the error message must be caused by a bug.
If I were in that situation, I think I would try find out what the mosquitto process is doing exactly at the time when the problem occurs. One way to do that would be to use ktrace(1) or truss(1) on the process, or maybe even dtrace(1). However, that might be difficult on a small system that runs on a USB stick. Also I have to admit that I don't know the mosquitto software and how it works.

By the way, is there a reason that you specify -B little? I'm not sure if UFS even supports different byte order, but I think it would be advisable to stick to the default anyway. Also, -o optimization=space is rather useless because the kernel automatically switches between time and space optimization when the allocation crosses the „minfree“ limit (see the tunefs(8) manual page for more details). However, none of these should be responsible for the problem that you described.
 
is there a reason that you specify -B little?

I don't. It is a default part of standard nanobsd.conf
If I were in that situation, I think I would try find out what the mosquitto process is doing exactly at the time when the problem occurs.
... and I can't login via SSH, which is strange ...
I'm unable to do anything "diagnosy" when situation appears. I can't login locally for policy reasons, the only way is SSH - which ceases to work along with mosquitto. Which, in turn, gives me suggestion the problem does not lie in mosquitto itself. It is a mere victim of the same villain that cripples ssh. And it's not that sshd dies, it just won't let me login (OTOH ssh to non-shell service works, but it doesn't access any files on start).
 
I'm unable to do anything "diagnosy" when situation appears. I can't login locally for policy reasons, the only way is SSH - which ceases to work along with mosquitto.
Yes, I understood that. What I meant is to prepare tracing in advance, for example by starting mosquitto with ktrace. However, the problem with this approach is that the trace file can grow rather large very quickly, and tracing also has a negative impact on performance. That's why I wrote that this might not be feasible with a small system that runs on a USB stick.

Which, in turn, gives me suggestion the problem does not lie in mosquitto itself. It is a mere victim of the same villain that cripples ssh. And it's not that sshd dies, it just won't let me login (OTOH ssh to non-shell service works, but it doesn't access any files on start).
You can try running ssh with maximum verbosity (-vvv). It then prints very detailed information about what it is doing, including executing the shell (if it gets that far). I suspect the problem is with your shell, or maybe with one of the profile or startup scripts of your shell. Maybe it tries to create a file for some reason (e.g. for the shell history), which fails if the system is out of inodes.

Depending on what shell you use, you might also try to run it with debugging options. Most bourne shells support -v and -x (including FreeBSD's /bin/sh, zsh, bash). The former option causes all commands to be echoed before expansion (basically while they are read from scripts), the latter causes the commands to be printed after expansion, right before they're executed. These are very useful for debugging shell scripts and shell behavior in general.
 
Back
Top