What happens when an NFS server reboots

tuaris · May 17, 2015

I have a NFS server with a ZFS backend (NAS4Free) and a client machine (9.3-RELEASE). The client machine has an application that is always reading and writing data to the mounted NFS export.

The NFS server is rebooted (power/hardware failure, kernel panic, or human error) without properly stopping the application on the client machine. As expected processes trying to access the mount would 'hang' until the server comes back.

In my specific example the application is a Bitcoin wallet. It keeps a log and databases on the NFS mount. When the NFS server reboots it hangs then resumes when the server is back. Good.

Problem/question is the log stops updating after the server is back online. The databases appear to continue reading and writing normally. Why? and more importantly, are the databases being corrupted?

According to fstat and procstat:

Code:

fstat /mnt/blockchains/Bitcoin/debug.log
USER  CMD  PID  FD MOUNT  INUM MODE  SZ|DV R/W NAME
root  bitcoind 23416  4 /mnt/blockchains  2126 -rw-------  53316008  w  /mnt/blockchains/Bitcoin/debug.log

Code:

procstat -f 23416
  PID COMM  FD T V FLAGS  REF  OFFSET PRO NAME
23416 Bitcoind  text v r r--------  -  - -  /usr/local/bin/bitcoind
23416 Bitcoind  cwd v d r--------  -  - -  /mnt/blockchains/Bitcoin
23416 Bitcoind  root v d r--------  -  - -  /
23416 Bitcoind  0 v x rw-------  33  31 -  -
23416 Bitcoind  1 v x rw-------  33  31 -  -
23416 Bitcoind  2 v x rw-------  33  31 -  -
23416 Bitcoind  3 v r rw-------  1  0 -  -
23416 Bitcoind  4 v r -wa------  1 53316062 -  /mnt/blockchains/Bitcoin/debug.log
23416 Bitcoind  5 v r -wa------  1  0 -  -
23416 Bitcoind  6 s - rw---n---  1  0 TCP ::.22346 ::.0
23416 Bitcoind  7 s - rw---n---  1  0 TCP 0.0.0.0:22346 0.0.0.0:0
23416 Bitcoind  8 v r -w-------  1  6569 -  -
23416 Bitcoind  9 v r rw-------  1  0 -  -
23416 Bitcoind  10 s - rw---n---  1  0 TCP 10.3.3.10:37200 x.x.x.x:22346
23416 Bitcoind  11 v r rw-------  1  0 -  -
23416 Bitcoind  12 s - rw---n---  1  0 TCP 10.3.3.10:41444 x.x.x.x:22346
23416 Bitcoind  13 v r rw-------  1 10485760 -  /mnt/blockchains/Bitcoin/database/log.0000000009
23416 Bitcoind  14 k - rw-------  2  0 -  -
23416 Bitcoind  15 p - rw---n---  1  0 -  -
23416 Bitcoind  16 p - rw---n---  1  0 -  -
23416 Bitcoind  17 s - rw---n---  2  0 TCP ::.22345 ::.0
23416 Bitcoind  18 s - rw---n---  1  0 TCP 10.3.3.10:50601 x.x.x.x:22346
23416 Bitcoind  20 v r rw-------  1  0 -  /mnt/blockchains/Bitcoin/wallet.dat
23416 Bitcoind  22 v r rw-------  1  0 -  -

Matthew Dresden · May 19, 2015

Not sure if this would be applicable to your application, but it may depend on how you have mounted the NFS mount.

If you created a directory and then mount NFS to that directory it probably keeps writing to that local path when the NFS connection dies.

I have seen many servers fill local disk up with logs for this reason.

As a best practice at my company we mount all NFS with autofs(5).

This creates a local link that when NFS is down is not writable.

Matthew Dresden · May 19, 2015

Just another thought.

If it continued writing to the local directory after the NFS connection dies, it may fail to remount the NFS there again until the local directory is no longer in use, but I would have to make a test scenario before iI could be certain of that.

After your connection dies, do an lsof -p <pid of bitcoin process> to see list of open files.

tuaris · May 19, 2015

Decided to run some tests and restart some of the 'Altcoins', looks like the database files are indeed getting damaged:

Code:

2015-05-19 04:50:46 UTC Verifying last 2500 blocks at level 1
2015-05-19 04:50:55 UTC ERROR: CBlock::ReadFromDisk() : GetHash() doesn't match index
2015-05-19 04:50:55 UTC ERROR: LoadBlockIndex() : block.ReadFromDisk failed
2015-05-19 04:50:55 UTC  block index  50791ms
2015-05-19 04:50:55 UTC Loading wallet...
2015-05-19 04:50:56 UTC nFileVersion = 60300
2015-05-19 04:50:56 UTC Error loading blkindex.dat
 wallet  1195ms
2015-05-19 04:50:56 UTC Done loading
2015-05-19 04:50:56 UTC mapBlockIndex.size() = 175040
2015-05-19 04:50:56 UTC nBestHeight = 174957
2015-05-19 04:50:56 UTC setKeyPool.size() = 103
2015-05-19 04:50:56 UTC mapWallet.size() = 385
2015-05-19 04:50:56 UTC mapAddressBook.size() = 1
2015-05-19 04:50:56 UTC PPCoin: Error loading blkindex.dat

2015-05-19 04:50:56 UTC DBFlush(false)
2015-05-19 04:50:56 UTC addr.dat refcount=0
2015-05-19 04:50:56 UTC addr.dat checkpoint
2015-05-19 04:50:56 UTC addr.dat closed
2015-05-19 04:50:57 UTC blkindex.dat refcount=0
2015-05-19 04:50:57 UTC blkindex.dat checkpoint
2015-05-19 04:50:57 UTC blkindex.dat closed
2015-05-19 04:50:57 UTC wallet.dat refcount=0
2015-05-19 04:50:57 UTC wallet.dat checkpoint
2015-05-19 04:50:57 UTC wallet.dat detach
2015-05-19 04:50:57 UTC wallet.dat closed
2015-05-19 04:50:57 UTC StopNode()
2015-05-19 04:50:57 UTC DBFlush(true)
2015-05-19 04:50:57 UTC addr.dat refcount=0
2015-05-19 04:50:57 UTC addr.dat checkpoint
2015-05-19 04:50:57 UTC addr.dat closed
2015-05-19 04:51:02 UTC PPCoin exiting

2015-05-19 04:55:01 UTC

Interesting that I had to restart the application in order to reveal the error. I will look at autofs()

tuaris · May 19, 2015

I'm on 9.3-RELEASE with these boxes so my option is to use amd(). I tested it out with net-p2p/zetacoin and so far I like this better.

/etc/rc.conf:

Code:

# NFS Client
nfs_client_enable="YES"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"
amd_enable="YES"

#Wallets
zetacoin_enable="YES"
zetacoin_datadir="/host/storage01/mnt/internal/Blockchains/zetacoin"
...

I know this is slightly off topic, but I can't seem to make sense of /etc/amd.map to try and re-map my paths to something like what I used to have in /etc/fstab:

Code:

storage01:/mnt/internal/Blockchains  /mnt/blockchains  nfs  rw  0  0

What happens when an NFS server reboots

tuaris

Matthew Dresden

Matthew Dresden

tuaris

tuaris