ZFS Metadata corruption after send/receive - kernel crashes on making changes to directory

jose.n · Jul 13, 2017

Hello. I've been bitten by what seems to an enduring bug in the replication of pools with send/receive.

I had been using a relatively large ZFS volume reliably during the last year, for storing backups from Windows systems made with robocopy (microsoft's NIH rsync) in windows. Recently I had to migrate the volume to a bigger server, and decided to send/receive the whole pool to a new pool. Did a recursive snapshot to the root volume and then sent it:

Code:

zfs snapshot -r zpool@temp
zfs send -Rv zpool@temp | ssh newserveripaddress zfs receive -F -d newzpool

my plan was to then move the system disks to the new server and keep using the system with the new pool.

Oddly, I had occasional losses of connection (even though I was using a direct attachment between the two servers), thus I changed my copy plan to save partially received state:
zfs send -Rv zpool@temp | ssh newserveripaddress zfs receive -F -d -s newzpool
and was able to resume unfinished transfers:
zfs send -vt receive_resume_token | ssh newserveripaddress zfs receive -s -F -d newzpool
and send the remaining snapshots using the with the "-I" option. The whole process eventually finished and I checked that the information was correct by comparing hashes on both sides for the few million files involved. All matched.

But as soon as I placed the new pool in production (same system disks, same configuration), I had kernel panics whenever Windows started a new backup to the samba volume:

Code:

panic: solaris assert: 0 == zfs_acl_node_read(dzp, &paclp, B_FALSE), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c, line: 1692
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80b24477 at kdb_backtrace+0x67
#1 0xffffffff80ad97e2 at vpanic+0x182
#2 0xffffffff80ad9653 at panic+0x43
#3 0xffffffff824b520a at assfail+0x1a
#4 0xffffffff82263084 at zfs_acl_ids_create+0x1b4
#5 0xffffffff822689d0 at zfs_make_xattrdir+0x40
#6 0xffffffff82268c95 at zfs_get_xattrdir+0xc5
#7 0xffffffff8227e7e6 at zfs_lookup+0x106
#8 0xffffffff822871d1 at zfs_setextattr+0x181
#9 0xffffffff8110f03f at VOP_SETEXTATTR_APV+0x8f
#10 0xffffffff80b9c404 at extattr_set_vp+0x134
#11 0xffffffff80b9c544 at sys_extattr_set_file+0xf4
#12 0xffffffff80fa26ae at amd64_syscall+0x4ce
#13 0xffffffff80f8488b at Xfast_syscall+0xfb

The panic is always due to the same code, with the same backtrace. The line seems to be a verification of the correctness of some extended attribute on access to a file. It would be reasonable to throw an error and even offline the file systems, but crashing the whole kernel over this failure seems counterproductive for a check. Has anyone found similar problems after sending/recv pools?

I tracked in the logs the file that caused the crash, found the specific directory where the problems happens, and noticed this odd behavior:

Code:

# ls -la
total 292
d---------+  7 200500  400513      8  3 jul 20:51 .
d---------+ 12 200500  400513     13  4 jul 20:33 ..
d---------+  2 200500  400513      4 19 jun 18:21 1. Ficha de abertura
d---------+  2 200500  400513      5 22 jun 15:59 2. Decisão de contratar
d---------+  2 200500  400513     12 20 jun 11:38 3. Peças procedimento
ls: ./4. Proposta: No such file or directory
d---------   2 200500  400513     13  3 jul 21:00 4. Proposta
ls: ./5. Esclarecimentos: No such file or directory
d---------   2 200500  400513      4  4 jul 20:54 5. Esclarecimentos
----------+  1 200500  400513  28672 12 jun 11:51 Thumbs.db

I can enter the "No such file or directory" subdirectories, but again ls shows similar behavior:

Code:

ls -la
total 2590
ls: ./.: No such file or directory
d---------  2 200500  400513      13  3 jul 21:00 .
d---------+ 7 200500  400513       8  3 jul 20:51 ..
----------+ 1 200500  400513  655361 29 jun 17:07 Anexo I.pdf
----------+ 1 200500  400513  193260 29 jun 17:07 BV_Recorder.pdf
etc

The files themselves are readable, contents are correct in this copied pool with the metadata problems. The filesystem in the original pool seems to have correct metadata and shows no problem whatsoever. But attempting to create any new file in one of these corrupted directories in the copied pool results in a kernel crash. Attempting to delete the directory has the same effect. Renaming existing files does not cause the crash.

I found a similar bug reported for FreeBSD 10.1, but never closed, and some similar cases over at the FreeNAS forums. I posted an update there (PR 198457) because the problem remains present in 11.0-RELEASE-p9.

I'm willing to do more testing or poking around with zdb if that may help finding the cause of this problem and there are other people interested. Otherwise I guess I'll just start a new pool from scratch, but remain wary of trusting send/receive in the future.

ZFS Metadata corruption after send/receive - kernel crashes on making changes to directory

jose.n