Trouble with ZFS import/export

kathode · Jun 13, 2017

Hi all,

I'm fairly new to ZFS, but I've been using derivatives of FreeBSD for a few years now, namely pfSense. I have set-up a system (Intel NUC NUC5CPYH, updated to latest firmware) to have a remote replica of a zfs volume on a USB 3.0-attached Seagate Backup+ Hub 8TB). I was slightly in doubt about its stability, but cannot see anything wrong beyond usb related messages while booting:

Code:

Copyright (c) 1992-2016 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-RELEASE-p9 #0: Tue Apr 11 08:48:40 UTC 2017
    root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: Intel(R) Celeron(R) CPU  N3050  @ 1.60GHz (1600.05-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x406c3  Family=0x6  Model=0x4c  Stepping=3
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,AESNI,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 4294967296 (4096 MB)
avail memory = 3975299072 (3791 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <INTEL  NUC5CPYB>
WARNING: L1 data cache covers less APIC IDs than a core
0 < 1
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
random: unblocking device.
ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 128/32 (20160527/tbfadt-650)
ioapic0 <Version 2.0> irqs 0-114 on motherboard
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff8101d970, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard
acpi0: <INTEL NUC5CPYB> on motherboard
acpi0: Power Button (fixed)
unknown: I/O range not supported
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0x80000000-0x80ffffff,0x88000000-0x8fffffff at device 2.0 on pci0
vgapci0: Boot video device
ahci0: <AHCI SATA controller> port 0xf060-0xf07f mem 0x81427000-0x814277ff at device 19.0 on pci0
ahci0: AHCI v1.31 with 2 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
xhci0: <XHCI (generic) USB 3.0 controller> mem 0x81400000-0x8140ffff at device 20.0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
usbus0 on xhci0
pci0: <encrypt/decrypt> at device 26.0 (no driver attached)
hdac0: <Intel (0x2284) HDA Controller> mem 0x81414000-0x81417fff at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pcib1: [GIANT-LOCKED]
pcib2: <ACPI PCI-PCI bridge> at device 28.1 on pci0
pci1: <ACPI PCI bus> on pcib2
pci1: <network> at device 0.0 (no driver attached)
pcib3: <ACPI PCI-PCI bridge> at device 28.2 on pci0
pci2: <ACPI PCI bus> on pcib3
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0x81204000-0x81204fff,0x81200000-0x81203fff at device 0.0 on pci2
re0: Using 1 MSI-X message
re0: Chip rev. 0x54000000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: b8:ae:ed:7d:4a:55
re0: netmap queues/slots: TX 1/256, RX 1/256
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
acpi_lid0: <Control Method Lid Switch> on acpi0
acpi_button0: <Sleep Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atrtc0: <AT realtime clock> at port 0x70 irq 8 on isa0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
ppc0: cannot reserve I/O port range
est0: <Enhanced SpeedStep Frequency Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
usbus0: 5.0Gbps Super Speed USB v3.0
ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
            to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
nvme cam probe device init
hdacc0: <Realtek (0x0283) HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek (0x0283) Audio Function Group> at nid 1 on hdacc0
pcm0: <Realtek (0x0283) (Analog)> at nid 27 and 25 on hdaa0
pcm1: <Realtek (0x0283) (Front Analog Headphones)> at nid 33 on hdaa0
pcm2: <Realtek (0x0283) (Internal Digital)> at nid 30 on hdaa0
hdacc1: <Intel (0x2883) HDA CODEC> at cad 2 on hdac0
hdaa1: <Intel (0x2883) Audio Function Group> at nid 1 on hdacc1
pcm3: <Intel (0x2883) (HDMI/DP 8ch)> at nid 7 on hdaa1
ugen0.1: <0x8086> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <KINGSTON SV300S37A60G 505ABBF0> ATA8-ACS SATA 3.x device
ada0: Serial Number 50026B733104EDC9
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 57241MB (117231408 512 byte sectors)
SMP: AP CPU #1 Launched!
Timecounter "TSC" frequency 1600053580 Hz quality 1000
Trying to mount root from zfs:zroot/ROOT/default []...
Root mount waiting for: usbus0
uhub0: 13 ports with 13 removable, self powered
Root mount waiting for: usbus0
ugen0.2: <Seagate> at usbus0
uhub1: <Seagate Backup+ Hub, class 9/0, rev 2.10/48.85, addr 1> on usbus0
uhub1: MTT enabled
Root mount waiting for: usbus0
uhub1: 3 ports with 3 removable, self powered
ugen0.3: <vendor 0x05e3> at usbus0
uhub2: <vendor 0x05e3 USB2.0 Hub, class 9/0, rev 2.00/32.98, addr 2> on usbus0
uhub2: MTT enabled
Root mount waiting for: usbus0
uhub2: 4 ports with 4 removable, self powered
Root mount waiting for: usbus0
usb_alloc_device: set address 4 failed (USB_ERR_TIMEOUT, ignored)
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 4 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_TIMEOUT, ignored)
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 4 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_TIMEOUT, ignored)
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 4 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 4 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
Root mount waiting for: usbus0
usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_TIMEOUT, ignored)
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 4 failed, USB_ERR_TIMEOUT
ugen0.4: <Unknown> at usbus0 (disconnected)
uhub_reattach_port: could not allocate new device
ugen0.4: <Seagate> at usbus0
uhub3: <Seagate> on usbus0
Root mount waiting for: usbus0
uhub3: 3 ports with 3 removable, self powered
Root mount waiting for: usbus0
ugen0.5: <Seagate> at usbus0
umass0: <Seagate Backup+ Hub BK, class 0/0, rev 3.10/1.00, addr 5> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x0100
umass0:2:0: Attached to scbus2
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
da0: <Seagate Backup+ Hub BK D781> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number NA8TLK0H
da0: 400.000MB/s transfers
da0: 7630885MB (15628053167 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>
GEOM_ELI: Device ada0p3.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
re0: link state changed to DOWN
re0: link state changed to UP

I've wondered about those earlier messages about USB, and I've tried a few different parameters in /boot/loader.conf, but as long as I've got the hard-drive plugged in, this seems to be the case at boot time. In no case do I receive any further messages about erroneous usb behaviour.

I've set up my zfs pool using geli underneath, so that the entire volume is encrypted.

Under normal circumstances, I can import my can import the pool by doing geli attach -pk /tmp/key /dev/da0 followed by zpool import master. However upon doing this a few times, the pool becomes unexportable, and I've been wondering why this is the case:

Code:

root@sunflower:~ # geli attach -pk /tmp/key /dev/da0
root@sunflower:~ # zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot  51.5G  1.11G  50.4G         -     0%     2%  1.00x  ONLINE  -
root@sunflower:~ # zpool status
  pool: zroot
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
     ada0p4    ONLINE       0     0     0

errors: No known data errors
root@sunflower:~ # zpool import -N master
root@sunflower:~ # zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
master  7.25T   167M  7.25T         -     0%     0%  1.00x  ONLINE  -
zroot   51.5G  1.11G  50.4G         -     0%     2%  1.00x  ONLINE  -
root@sunflower:~ # zpool status
  pool: master
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    master      ONLINE       0     0     0
     da0.eli   ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
     ada0p4    ONLINE       0     0     0

errors: No known data errors
root@sunflower:~ # zpool export master
cannot export 'master': pool is busy

There aren't any processes using the pool, so I don't know what is the reason for the frequent "pool is busy" messages that I am getting. I would appreciate some help to try and get to the bottom of the problem.

Thanks

_martin · Jun 16, 2017

At glance I'd say that pool is busy due to mounted FS. Did you verify that none of the filesets are mounted from this pool ?

ANOKNUSA · Jun 16, 2017

zfs unmount -a will tell you which filesystems are in use by the system. fstat(1) can help you narrow things down from there.

_martin · Jun 16, 2017

ANOKNUSA said:
zfs unmount -a will tell you which filesystems are in use by the system. fstat(1) can help you narrow things down from there.

Woah, don't do this blindly if you have more ZFS pools - it unmounts all FS. Just do the classic mount and check if any FS from that pool is mounted.

kathode · Jun 18, 2017

Thanks for your suggestions. I must add that zpool import -N master is supposed *not* to mount the filesystem. Even if I do mount it using zpool import master I can unmount it (and indeed as part of zpool export it gets unmounted if it is mounted), but I cannot export the pool. Not even if using zpool export -f master

I've tried fstat and found nothing using it, as far as I could see (it's not even mounted, how could I possibly check it?)

Since my original post I've opened a bug report which makes it clear that the odd behaviour is not related to geli/encryption at all, and is seemingly random: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219972 (see comment 1). I would very much appreciate if someone could help me reproduce the issue on their system. (It's almost as if some race-condition exists somewhere.. since I cannot reproduce it 100% of the time)

I've started digging the zfs code from the source to see if I can understand the root cause of this issue.. but since I'm not very familiar with it, it's going to take a while before I'm in a position to be able to analyse it. I could also do with some help to debug the kernel too

_martin · Jun 18, 2017

Right, I didn't notice the -N there.

I could help with tests though I don't have any box where I can connect disks via USB. I did run it on the machine like this:

Setup:

Code:

# dd if=/dev/zero of=blob00 bs=1024k count=256
# mdconfig -a -t vnode blob00
# geli init /dev/md0
# geli attach /dev/md0
# geli status
   Name  Status  Components
md0.eli  ACTIVE  md0
#
# zpool create crypt0 /dev/md0.eli
# zfs create crypt0/lun0              # eh, forgot the -V but it doesn't matter .. 
# zfs list -r crypt0
NAME          USED  AVAIL  REFER  MOUNTPOINT
crypt0        142K   208M    19K  /crypt0
crypt0/lun0    19K   208M    19K  /crypt0/lun0
#
# zpool export crypt0
#

So I set the loop and was importing/exporting:

Code:

# while true; do echo; zpool import -N crypt0 && sleep 1 && zfs list -r crypt0 && zpool export crypt0 && sleep 1 ; done

Worked without problem ..

truss/dtrace might be a good option here too - to know what is going on.

kathode · Jun 18, 2017

Thanks for trying.

I wonder, have you tried with two pools, and send/recv between them like I describe in comment 1 of #219972? It seems that this is necessary to trigger my problem.

I will attempt to rule out the usb devices by following your strategy with files, and also try to follow up with truss/dtrace. I've now compiled a new kernel with debugging enabled, just in case I have some more time to dig into kernel land with this.

Thanks

kathode · Jun 18, 2017

I've successfully reproduced the problem using only two blobs, as I describe below.

First I created the blobs following your method:

Code:

root@sunflower:~ # dd if=/dev/zero of=blob00 bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 0.441321 secs (608253781 bytes/sec)
root@sunflower:~ # dd if=/dev/zero of=blob11 bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 0.350000 secs (766959449 bytes/sec)
root@sunflower:~ # mdconfig -a -t vnode blob00
md0
root@sunflower:~ # mdconfig -a -t vnode blob11
md1
root@sunflower:~ # zpool create master1 /dev/md0
root@sunflower:~ # zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
master1   240M   522K   239M         -     1%     0%  1.00x  ONLINE  -
zroot    51.5G  3.55G  47.9G         -     3%     6%  1.00x  ONLINE  -

I then created a file: root@sunflower:~ # echo "testing" > /master1/test

and ran a couple of tests on this:

Code:

root@sunflower:~ # zpool export master1
root@sunflower:~ # zpool import master1
root@sunflower:~ # zpool export master1
root@sunflower:~ # zpool import master1

... (all successfully!)

I then created the second pool, and export/imported fine:

Code:

root@sunflower:~ # zpool create -O atime=off slave1 /dev/md1
root@sunflower:~ # zpool export slave1
root@sunflower:~ # zpool import -N slave1

At this point I created a snapshot in master1 and sent it over to slave1:

Code:

root@sunflower:~ # zfs snapshot -r "master1@today"
root@sunflower:~ # zfs send -Rev "master1@today" | zfs recv -Fu slave1
full send of master1@today estimated size is 12K
total estimated size is 12K
TIME        SENT   SNAPSHOT
root@sunflower:~ # zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
master1   240M   228K   240M         -     2%     0%  1.00x  ONLINE  -
slave1    240M   197K   240M         -     2%     0%  1.00x  ONLINE  -
zroot    51.5G  3.55G  47.9G         -     3%     6%  1.00x  ONLINE  -
root@sunflower:~ # zfs mount slave1
root@sunflower:~ # ls /slave1
test
root@sunflower:~ # zpool export slave1
root@sunflower:~ # zpool import slave1
root@sunflower:~ # zpool export slave1
cannot export 'slave1': pool is busy

..and bang the pool is now unexportable following the recv

I wonder if you could try reproducing the above? I think I've now ruled out both geli and usb as problem sources..

ANOKNUSA · Jun 18, 2017

You last post here reminds me that something similar has happened to me too: If the send/receive process is interrupted in any way, the target pool will remain in a permanent "busy" state because the filesystem creates a sort of "lock dataset" or partial snapshot during the receive process. (I don't know if "locking" is what's actually happening here, but it's similar to the lock or PID files that other processes create.) I'm trying to hunt down the solution I found when this last happened to me; I'll post it if I can.

If the process isn't being interrupted in some way, that may indeed by a sign of a bug.

_martin · Jun 18, 2017

Well, I tried it your way and I don't have a problem at all:

Code:

# while true; do zpool export crypt1 && zpool import crypt1 ; echo -ne "." ; done
....................................^C

My setup is 11.0-RELEASE-p1 #0 r306420.

UPDATE: tried this also on 11.0-RELEASE-p3 #1 r308208 and no problem either. I suggest to do some tracing with truss/dtrace to see what's the problem ..

kathode · Jun 19, 2017

Thanks both for your help!

I'm now a bit puzzled as to what it is that I'm doing different.. I've tested with both 11.0-RELEASE-p9 and 11.1-BETA-2 now.. I will see if I can reproduce it with an earlier version. I've got an ESXi box at hand, so I'll try to reproduce this on a VM as well.

Regarding truss, I have just run it on that failing command, but don't seem to get very far with the output (complete output available at: https://pastebin.com/hibmpwCA):

Code:

...
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)        = 0 (0x0)
getcontext(0x7fffffffdc30)            = 0 (0x0)
modfind("zfs")                    = 3 (0x3)
openat(AT_FDCWD,"/dev/zfs",O_RDWR,00)        = 3 (0x3)
open("/dev/zero",O_RDONLY,0666)            = 4 (0x4)
open("/etc/zfs/exports",O_RDONLY,0666)        = 5 (0x5)
openat(AT_FDCWD,"/dev/zfs",O_RDWR,00)        = 6 (0x6)
__sysctl(0x7fffffffc4f0,0x2,0x7fffffffc540,0x7fffffffc538,0x800e9b02c,0x15) = 0 (0x0)
__sysctl(0x7fffffffc540,0x4,0x7fffffffc5f4,0x7fffffffc5f8,0x0,0x0) = 0 (0x0)
ioctl(3,0xc0185a05 { IORW 0x5a('Z'), 5, 24 },0xffffc588) = 0 (0x0)
fstat(4,{ mode=crw-rw-rw- ,inode=21,size=0,blksize=4096 }) = 0 (0x0)
ioctl(4,TIOCGETA,0xffffd200)            ERR#25 'Inappropriate ioctl for device'
lseek(4,0x0,SEEK_SET)                = 0 (0x0)
lseek(4,0x0,SEEK_CUR)                = 0 (0x0)
getfsstat(0x0,0x0,0x1)                = 13 (0xd)
getfsstat(0x802c3b000,0x2ff0,0x1)        = 13 (0xd)
lseek(4,0x1,SEEK_CUR)                = 1 (0x1)
lseek(4,0x0,SEEK_CUR)                = 1 (0x1)
lseek(4,0x1,SEEK_CUR)                = 2 (0x2)
lseek(4,0x0,SEEK_CUR)                = 2 (0x2)
lseek(4,0x1,SEEK_CUR)                = 3 (0x3)
lseek(4,0x0,SEEK_CUR)                = 3 (0x3)
lseek(4,0x1,SEEK_CUR)                = 4 (0x4)
lseek(4,0x0,SEEK_CUR)                = 4 (0x4)
lseek(4,0x1,SEEK_CUR)                = 5 (0x5)
lseek(4,0x0,SEEK_CUR)                = 5 (0x5)
lseek(4,0x1,SEEK_CUR)                = 6 (0x6)
lseek(4,0x0,SEEK_CUR)                = 6 (0x6)
lseek(4,0x1,SEEK_CUR)                = 7 (0x7)
lseek(4,0x0,SEEK_CUR)                = 7 (0x7)
lseek(4,0x1,SEEK_CUR)                = 8 (0x8)
lseek(4,0x0,SEEK_CUR)                = 8 (0x8)
lseek(4,0x1,SEEK_CUR)                = 9 (0x9)
lseek(4,0x0,SEEK_CUR)                = 9 (0x9)
lseek(4,0x1,SEEK_CUR)                = 10 (0xa)
lseek(4,0x0,SEEK_CUR)                = 10 (0xa)
lseek(4,0x1,SEEK_CUR)                = 11 (0xb)
lseek(4,0x0,SEEK_CUR)                = 11 (0xb)
lseek(4,0x1,SEEK_CUR)                = 12 (0xc)
lseek(4,0x0,SEEK_CUR)                = 12 (0xc)
lseek(4,0x1,SEEK_CUR)                = 13 (0xd)
lseek(4,0x0,SEEK_CUR)                = 13 (0xd)
ioctl(3,0xc0185a03 { IORW 0x5a('Z'), 3, 24 },0xffffc1e8) ERR#16 'Device busy'
cannot export 'slave1': pool is busy
write(2,"cannot export 'slave1': pool is "...,37) = 37 (0x25)
close(3)                    = 0 (0x0)
close(4)                    = 0 (0x0)
close(5)                    = 0 (0x0)
close(6)                    = 0 (0x0)
exit(0x1)                   
process exit, rval = 1

The 'device busy' is likely surfacing from here.

kathode · Jun 19, 2017

So far I'm unable to reproduce this on FreeBSD 11.0-RELEASE-p1 #0 r306420 running as a VM on ESXi. I used the VDMK image from the official FreeBSD downloads page, which uses UFS rather than ZFS for the root filesystem (I hope that isn't in itself a potential source of problems..).

Code:

FreeBSD  11.0-RELEASE-p1 FreeBSD 11.0-RELEASE-p1 #0 r306420: Thu Sep 29 01:43:23 UTC 2016     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

For comparison, here is the same section of truss' output for zpool export slave1 when run on 11.0-Release-p1:

Code:

...
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)        = 0 (0x0)
getcontext(0x7fffffffdb80)            = 0 (0x0)
sysarch(AMD64_GET_XFPUSTATE,0x7fffffffdb48)    = 0 (0x0)
modfind("zfs")                    = 490 (0x1ea)
openat(AT_FDCWD,"/dev/zfs",O_RDWR,00)        = 3 (0x3)
open("/dev/zero",O_RDONLY,0666)            = 4 (0x4)
open("/etc/zfs/exports",O_RDONLY,0666)        ERR#2 'No such file or directory'
openat(AT_FDCWD,"/dev/zfs",O_RDWR,00)        = 5 (0x5)
__sysctl(0x7fffffffc580,0x2,0x7fffffffc5d0,0x7fffffffc5c8,0x800e9b02c,0x15) = 0 (0x0)
__sysctl(0x7fffffffc5d0,0x4,0x7fffffffc684,0x7fffffffc688,0x0,0x0) = 0 (0x0)
ioctl(3,0xc0185a05 { IORW 0x5a('Z'), 5, 24 },0xffffc618) = 0 (0x0)
fstat(4,{ mode=crw-rw-rw- ,inode=27,size=0,blksize=4096 }) = 0 (0x0)
ioctl(4,TIOCGETA,0xffffd290)            ERR#25 'Inappropriate ioctl for device'
lseek(4,0x0,SEEK_SET)                = 0 (0x0)
lseek(4,0x0,SEEK_CUR)                = 0 (0x0)
getfsstat(0x0,0x0,0x1)                = 4 (0x4)
getfsstat(0x802c32000,0xec0,0x1)        = 4 (0x4)
lseek(4,0x1,SEEK_CUR)                = 1 (0x1)
lseek(4,0x0,SEEK_CUR)                = 1 (0x1)
lseek(4,0x1,SEEK_CUR)                = 2 (0x2)
lseek(4,0x0,SEEK_CUR)                = 2 (0x2)
lseek(4,0x1,SEEK_CUR)                = 3 (0x3)
lseek(4,0x0,SEEK_CUR)                = 3 (0x3)
lseek(4,0x1,SEEK_CUR)                = 4 (0x4)
ioctl(3,0xc0185a12 { IORW 0x5a('Z'), 18, 24 },0xffffc1e8) = 0 (0x0)
ioctl(3,0xc0185a05 { IORW 0x5a('Z'), 5, 24 },0xffffaf38) = 0 (0x0)
lseek(4,0x0,SEEK_CUR)                = 4 (0x4)
unmount("/slave1",0)                = 0 (0x0)
ioctl(3,0xc0185a28 { IORW 0x5a('Z'), 40, 24 },0xffffbb88) = 0 (0x0)
rmdir("/slave1")                = 0 (0x0)
ioctl(3,0xc0185a03 { IORW 0x5a('Z'), 3, 24 },0xffffc278) = 0 (0x0)
close(3)                    = 0 (0x0)
close(4)                    = 0 (0x0)
close(5)                    = 0 (0x0)
exit(0x0)                   
process exit, rval = 0

I then updated FreeBSD using freebsd-update and rebooted the VM, so now I'm on FreeBSD 11.0-RELEASE-p10. Once more, I also cannot reproduce the odd behaviour.

I've tried allocating two cores to the VM, and still even on p10 I cannot reproduce the issue anymore. I'm now wondering whether the issue I'm reporting could be related to having the root filesystem also on ZFS? This is something that I haven't covered on my VM tests yet, as that requires a bit more configuration than I can afford to do at this time of the day, but I will definitely try it just to rule this scenario out.

Or perhaps the specific CPU on my other device has some specific interaction with the synchronization primitives on FreeBSD that is causing this issue? Or there is faulty hardware to blame? (I thoroughly tested the memory with Memtest86+ a few days ago, though.)

kathode · Jun 20, 2017

Small update on this..

I've configured a VM with a ZFS root filesystem on FreeBSD 11.0-RELEASE-p1 (amd64 virtualized on ESXi host with two CPU cores allocated on one socket). I then performed nearly all the above steps up to and including:

Code:

root@freetest2:~ # zfs snapshot -r "master1@today"
root@freetest2:~ # zfs send -Rev "master1@today" | zfs recv -Fu slave1
full send of master1@today estimated size is 10K
total estimated size is 10K
TIME        SENT   SNAPSHOT
root@freetest2:~ # zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
master1   240M   224K   240M         -     2%     0%  1.00x  ONLINE  -
slave1    240M   414K   240M         -     2%     0%  1.00x  ONLINE  -
zroot    5.97G   992M  5.00G         -     8%    16%  1.00x  ONLINE  -
root@freetest2:~ # zfs mount slave1
root@freetest2:~ # ls /slave1/
test
root@freetest2:~ # zpool export slave1
root@freetest2:~ # zpool import slave1
root@freetest2:~ # zpool export slave1
root@freetest2:~ # zpool import slave1
root@freetest2:~ # zpool export slave1
root@freetest2:~ # zpool import slave1
root@freetest2:~ # zpool export slave1
root@freetest2:~ # zpool import slave1

At the last step I decided to press some random keys on my ssh terminal.. and zpool import slave1 got stuck this time, no amount of CTRL-C could get it unstuck.

I peeked at the ESXi VM terminal and got the following:

Code:

(da0:mpt0:0:0:0): UNMAP. CDB: 42 00 00 00 00 00 00 00 78 00
(da0:mpt0:0:0:0): CAM status: SCSI Status Error
(da0:mpt0:0:0:0): SCSI status: Busy
(da0:mpt0:0:0:0): Retrying command.

This could be due to VM drivers? I'm not sure, but nevertheless makes me worried.

UPDATE: I've since gotten around this problem by selecting a different disk controller for the VM. This isn't a problem anymore. However the zfs problem persists, which I discuss in the following reply.

kathode · Jun 22, 2017

Ok, there is definitely a mysterious bug, likely a race condition which is going to be hard to debug.

For the sake of trying a few more things, I decided to get the FreeBSD test suite. To do this I initiated a make buildworld. While this was running I ran my test case (as explained above in my post, which I have now written as a bash script) and I could not reproduce the bug. However, once my make buildworld finished, the problem came back.

Until recently, I could not reproduce this issue on my ESXi host, where I allocated 2 vCPUs on a single virtual socket, whose host is an Intel Xeon-D 1518 (4 core). However, I reserved 1000MHz for the VM, and changed the scheduling affinity to: 0,2 (I have 8 logical cores as Hyper-Threading is active). I then ran my bash script.. and surprise, surprise at the 56th attempt to import/export the pool it came back as 'pool is busy' from then onwards:

Code:

Final test of export/import for slave1
(56/100)cannot export 'slave3': pool is busy
...
(99/100)cannot export 'slave3': pool is busy
cannot import 'slave3': a pool with that name is already created/imported,
and no additional pools with that name were found
(100/100)cannot destroy 'slave3': pool is busy
mdconfig: ioctl(/dev/mdctl): Device busy

I now have no doubt that there is a race condition of sorts in the FreeBSD code. Until now I thought this must be just a problem with my Intel NUC platform, but I can now more or less reliably reproduce this issue on a VM as well. The trick is that with two cores available, the other one needs to be low taxed, as in, there not must be much else running on that other core, otherwise it is not possible for the bug to surface.

I would appreciate if anyone could have a go at trying to reproduce the bug, given the right conditions, ie, where some other core is generally available, and then comparing that to a system that is being taxed. On my Intel NUC I can easily avoid the bug by running another process such as dd if=/dev/zero of=/dev/null at the same time as the test bash script. Interestingly, it seems that by killing this process once the script has already sent/received the snapshot between the pools is sufficient to exhibit the problem, which suggests the bug is in the export ioctl code (but somehow depends on there having been a snapshot sent across earlier).

I include my test script for completeness at: https://pastebin.com/YcKSU1LA.

ANOKNUSA · Jun 22, 2017

Sorry for not checking back in sooner, kathode. I saw the bug report you'd posted, but have you also brought this up on the freebsd-fs@ mailing list? It's pretty active and watched by the development team, so there's a good chance you'll find some willing testers there.

kathode · Jun 22, 2017

Hi ANONKNUSA, thanks for your reply. I opened issue PR 219972 on the bugtracker on the 13th of June, which subsequently got assigned to freebsd-fs@, so I think the original report, along with my latests updates on this issue have been sent around. I wait for further input from the community as this is obviously a tricky issue to test and debug. I've looked around into the source code for the zpool ioctls but it is tricky for someone unfamiliar like me to get a good grasp of it.

Thanks!

_martin · Jun 23, 2017

Yeah, I'm also crazy busy these days so I didn't look closer either. But this bug did catch my attention. Were you able to reproduce this on physical server too ? Did you try to attach debugger to the stuck zpool command ?

UPDATE: now this is interesting, I was able to reproduce this bug r306420 on a different HW.

ltrace shows:

Code:

# ltrace -f zpool export crypt1
..
..
pid 68917] getopt(2, 0x7fffffffeac0, "fF")                                                                                       = -1
[pid 68917] zpool_open_canfail(0x802c1c000, 0x7fffffffed3d, 0x7fffffffed3d, 1, 0xffff800802c2a2d0)                                = 0x802c2c000
[pid 68917] zpool_disable_datasets(0x802c2c000, 0, 512, 986, 17)                                                                  = 0
[pid 68917] zpool_export(0x802c2c000, 0, 0x619060, 1, 0x1ffffffffffffffccannot export 'crypt1': pool is busy
)                                                         = 0xffffffff
[pid 68917] zpool_close(0x802c2c000, 0x7fffffffba30, 37, 0x83156b424db6c00b, 0x7ff7ff165273)                                      = 0x83156b424db6c00b
[pid 68917] libzfs_fini(0x802c1c000, 0x802c0d000, 0x801457600, 0x802c0d000, 0x7ff7ff165273)                                       = 0x83156b424db6c00b
[pid 68917] getenv("ZFS_ABORT")                                                                                                   = NULL
[pid 68917] exit(1 <no return ...>

I used export ZFS_ABORT=1 to have dump available from this command. Problem is I don't have kernel/world compiled with debug symbols, it's quite a RE game doing this without it .. I'll try to find some time to do so ..

UPDATE2:
I'm using GENERIC kernel on this HW so I downloaded debug symbols and tried it. It took some time to trigger the bug 2nd time so I agree with you -- seems to be a race condition there. My backtrace shows:

Code:

(gdb) bt
#0  0x00000008011fe7d8 in ioctl () from /lib/libc.so.7
#1  0x00000008024f7273 in zcmd_ioctl_compat (fd=3, request=3, zc=<optimized out>, cflag=0) at /usr/src/cddl/lib/libzfs_core/../../../sys/cddl/contrib/opensolaris/common/zfs/zfs_ioctl_compat.c:989
#2  0x0000000800e91e5a in zcmd_ioctl (fd=3, request=3, zc=0x7fffffffc610) at /usr/src/cddl/lib/libzfs/../../../cddl/contrib/opensolaris/lib/libzfs/common/libzfs_compat.c:110
#3  0x0000000800e7e3b5 in zpool_export_common (zhp=0x802c2c000, force=B_FALSE, hardforce=B_FALSE, log_str=0x619060 <history_str> "zpool export crypt1") at /usr/src/cddl/lib/libzfs/../../../cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c:1381
#4  0x000000000040aa9c in zpool_do_export (argc=<optimized out>, argv=<optimized out>) at /usr/src/cddl/sbin/zpool/../../../cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:1223
#5  0x00000000004073af in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/cddl/sbin/zpool/../../../cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:5609
(gdb)

Which means ioctl call fails. The state before syscall is:

 

(gdb) i r $rax $rdi $rsi $rdx

rax            0x36     54

rdi            0x3      3

rsi            0xc0185a03       3222821379

rdx            0x7fffffffc148   140737488339272

(gdb)

Where rdx points to the formatted zfs cmd. I don't have any ZFS internals knowledge but it would be interesting to know more. I guess I need to setup the qemu and try to trigger the bug there -- it would be easier to debug that way. Or I'll try to find some serial cable lying around and try to debug this HW remotely.

kathode · Jun 23, 2017

Oh, wow! Now you've managed to get more info than me..

My understanding from having looked at the code is that EBUSY is raised in function spa_export_common at line 4672 of spa.c. I've looked into the spa_inject_ref and it seems that is likely not applicable here, because that seems to be a feature of using the tool zinject to insert faults. So therefore, the problem is that at this point spa_refcount_zero(spa) (line 4667) is not true.

I looked into other files, and it seems that the refcount should only be changed by spa_open_ref (line 901 of spa_misc.c). There aren't that many places where spa_open_ref is called. Lines 254 and 267 of dsl_dir.c are intriguing, but I don't fully understand them, particularly line 254. This function dsl_dir_hold_obj is called in a number of different places namely:

- line 247 of dsl_pool.c in dsl_pool_open_special_dir and line 325 of dsl_pool.c in dsl_pool_open)
- lines 453 and 470 of dsl_dataset.c in dsl_dataset_hold_obj, and in particular line 470 is doing something over snapshots. Now dsl_dataset_hold_obj is itself called in a number of places within dsl_dataset.c, quite a few related to snapshots.
... (amongst other files like dmu_objset.c and dsl_destroy.c)

My feeling is that the 'recv' of the new snapshot, or the rollback (implied by -F in zfs -Fu recv), has an impact on the treatment of snapshots when the pool is 'opened' (either following the recv, or on its own). But I could be wrong, I'm not familiar with ZFS. These are just my thoughts having read through some of the source code. I haven't tried zfs recv without -F to see if that has an impact.

I very much appreciate your time and patience to get this bug to reproduce on your system.

I've seen somewhere that there is a DEBUG_ZFS option in the source code. I wonder if we could get debugging messages out of it by recompiling a kernel with that flag or similar?

_martin · Jun 23, 2017

My troubleshooting so far was only from userspace. Now the fun begins in the kernel. I didn't have time to setup qemu env yet. I'll try to manage some free time next week to look at it.
As mentioned I have 0 experience with the ZFS internals; I would start with looking at the ZFS structures first. It's easier to understand the functions when one has idea about data.

On the other note I didn't try yet to trigger the bug without sending the snapshot, it might be worth trying that for a while too. And for your question regarding the cpu load -- those CPUs were pretty much idle, or idle-ish when I was doing import/export in the loop.

UPDATE: btw. very nice and detailed description of ZFS internals and its structures: https://people.freebsd.org/~gibbs/zfs_doxygenation/html/annotated.html

_martin · Jun 30, 2017

I tried for quite some time to reproduce this bug in qemu without luck. Tried single CPU, more CPUs, nothing..

Aurel_B · Mar 7, 2018

I ran into this bug as well with my backup script (which uses snapshots and import/export). My workaround: delete /boot/zfs/zpool.cache (make a backup first) and restart your system. Afterwards all pools (except for zroot) will have to be reimported but at least the locked pool is gone.