ZFS sshfs of remote storage to vnet jail with nullfs to child jail

I am trying to figure out how to better handle a dropped sshfs connection and need to do some testing but am not sure about all that I should test.

The setup is:
  • a vnet jail, with a child jail
  • the vnet jail establishes an sshfs connection to remote storage
  • the vnet jail then uses nullfs (rw, mountpoint must be empty) to mount a sub-directory from the sshfs mount to a child jail mountpoint
What we experienced was a dropped sshfs and the connection was down for awhile (not intermittent). I had the reconnect in place but not
ServerAliveInterval, which I am planning to add.

I see in sshfs() it says that apps attempting to use files from the sshfs mount will appear frozen - and this is exactly what I saw. The web app was no longer up and we got 502 responses. But it also seemed more than that to me. I could access the child jail without any trouble, but I could not run a simple ls to see what was in a directory that was not part of the sshfs->nullfs mount. Any filesystem-related commands I tried would hang. Cancel attempts on the commands were largely unresponsive. But I could run htop.

Though it was different than a zfs disk-full event, it felt very similar to that. But services were running...NGINX, mysql, etc. The problem appeared to be filesystem-based in nature.

When using nullfs on a sshfs, can a locked up sshfs potentially freeze (even indirectly) all nullfs operations in the jail, for example even nullfs mounts that are not related to the sshfs mount?

I believe I need to modify our child jail web app to handle a missing filesystem better, but my test list:
  • test web app for missing filesystem handling
  • test ServerAliveInterval
  • ? sshfs-based nullfs mount - does it need to be remounted after reconnection?
I guess I am wondering if there is something I can do to make this workable should it happen again, or is this the nature of the beast due to the setup as it is?

My objective would be:
  1. handle the missing filesystem - the web app should run without it
  2. do something so the filesystem in the child jail is not unresponsive in this scenario - I have no idea about the root cause of this unresponsive behavior

I'm interested in learning more about this but believe I am missing insight on something. Thanks!
 
Back
Top