FreeBSD 13.1 : ZFS NFS : .zfs/snapshot : Stale file handle

miconof · Sep 5, 2022

Hi, since upgrading to 13.1-RELEASE of FreeBSD I can't anymore access to .zfs/snapshot folder over NFS.

On Ubuntu or Debian client when I tried to acces do .zfs/snapshot I obtain : Stale file handle

Code:

medic:/home/user1 on /home/user1 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.80,mountvers=3,mountport=850,mountproto=udp,local_lock=none,addr=192.168.0.80)

I have 2 server one in 13.0-p7 and the other in 13.1-p2
Several disk bay, all in multi-attachment
each bay connected to the 2 server

I use this setup since 12.0 Release and before 13.1 all was ok with snapshot access.

I use carp to be able to distribute the load over my two server and in case of trouble or upgrade needed I can import all my pool in one and then upgrade on the other.

So I have several IP for this data service, one by pool export in fact.

I only have the stale file handle on .zfs/snapshot over NFS on the 13.1 server, if I import my pool on the 13.0 it works has normal.

Locally (On FreeBSD) I can list the snapshots normally on booth server.

I have to upgrade my booth server to 13.1 because with 13.0 I was facing an other trouble which is solver under 13.1.

Has I say NFS setup is based on a carp IP :

Code:

lagg1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
[...]
        inet 192.168.0.80 netmask 0xffffff00 broadcast 192.168.0.255 vhid 80
[...]
        laggproto lacp lagghash l2,l3,l4
        laggport: bnxt0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: bnxt1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>                                                                                                                                                                             
        groups: lagg                                                                                                                                                                                                                                                              
        carp: MASTER vhid 80 advbase 1 advskew 100                                                                                                                                                                                                               
[...]                                                                                                                                                                                                                          
        media: Ethernet autoselect                                                                                                                                                                                                                                                
        status: active                                                                                                                                                                                                                                                            
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

My NFS config :

Code:

rpcbind_enable="YES"
nfs_server_enable="YES"
nfs_server_flags="-u -t -h 192.168.0.80 -h 192.168.0.81 -h 192.168.0.82 -h 192.168.0.83 --minthreads 12 --maxthreads 24"
mountd_enable="YES"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"

My sharenfs setup on the pool/vol :

Code:

# zfs get sharenfs tank/home/user1
NAME              PROPERTY  VALUE                                  SOURCE
tank/home/user1  sharenfs  -network 192.168.0.0 -mask 255.255.255.0  local

It's seems there is the same trouble with TrueNas 13 see here :Forum TrueNAS - Stale file handle" when list snapshots (.zfs)

If anyone can help.
Thanks.

miconof · Sep 5, 2022

An other issu which also appear on TrueNAS :
Deleting a snapshot in which a simple "ls" via NFS has been attempted will completely block and leave the zfs destroy process in an unkillable state IO (trouble).
On TrueNAS it seems that in this case whole system will become unstable or even totally unusable...

skeletor · Sep 12, 2022

Have you tried

Code:

zfs set snapdir=hidden tank/home/user1

?

mer · Sep 12, 2022

So the snapshot is exported via NFS, someone is actively "in it" (doing a ls) and then on the server side you are trying to do a zfs destroy of that snapshot?

ivosevb · Sep 27, 2022

We have the same issue, it's easy to reproduce every time on our backup storage.
On 13.1 server:

Code:

zfs create tank0/exports/test
zfs snap tank0/exports/test@now

On linux or FreeBSD client:

Code:

mount -o vers=3 zfs1:/exports/test /mnt
ls -al /mnt/.zfs/snapshot
ls -al /mnt/.zfs/snapshot/now

First ls command finish ok, second return "Stale file handle". Then on the server

Code:

zfs destroy tank0/exports/test@now

and command hangs forever. Open another ssh session to server, reboot, on console we see that reboot process started but never finished. We have to manually power cycle server.

Initially our production storage totally hangs, after we tried to see our yesterday database dump from snapshot on the nfs client and routine "zfs rolling snaphosts with cron" with zfsnap on the server.

blanchet · Sep 27, 2022

I confirm that the same issue happens with TrueNAS Core 13.0u2 (based on FreeBSD 13.1).
Users cannot access the ZFS snapshots through NFS, so it prevents restoring files in self-service mode.

There is no workaround yet (except downgrading to FreeBSD 12.3)

A ticket already exists in the bugzilla database

266236 – ZFS NFS : .zfs/snapshot : Stale file handle

bugs.freebsd.org

miconof · Sep 29, 2022

still no news ...

miconof · Sep 29, 2022

skeletor said:
Have you tried

Code:

zfs set snapdir=hidden tank/home/user1

?

I don't want to hide them ! it's a feature !

mer said:
So the snapshot is exported via NFS, someone is actively "in it" (doing a ls) and then on the server side you are trying to do a zfs destroy of that snapshot?

exactly

miconof · Sep 29, 2022

but when you mean actively "in it" in fact because of the "Stale file handle" the user is not really in it !

ivosevb · Oct 7, 2022

There is a resolution to the problem (bug 266236 from above link). After we applied the patch from Mark Johston everything is working normal now.