ZFS resilver stalled?


I'm running FreeBSD 10.2-RELEASE-p7

I replaced a bad disk in a zpool and it triggered a resilver as expected. It seemed to be running fine but for a number of hours (at least 4 - possibly longer) it's stuck at 8m remaining and 99.71% done. (zpool status below)
  pool: s11d33R
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul 11 10:53:06 2016
  40.0T scanned out of 40.1T at 242M/s, 0h8m to go
  3.99T resilvered, 99.71% done

  s11d33R  DEGRADED  0  0  0
  raidz2-0  ONLINE  0  0  0
  multipath/J11F12-1EJD21TJ  ONLINE  0  0  0
  multipath/J11F13-1EJDHA1J  ONLINE  0  0  0
  multipath/J11F14-1EJDGY9J  ONLINE  0  0  0
  multipath/J11F15-1EJDGGRJ  ONLINE  0  0  0
  multipath/J11F16-1EJ7X6JJ  ONLINE  0  0  0
  raidz2-1  DEGRADED  0  0  0
  multipath/J11F17-1EJD8EWJ  ONLINE  0  0  0
  replacing-1  OFFLINE  0  0  0
  8878648567307541532  OFFLINE  0  0  0  was /dev/multipath/J11F18-1EJDGJWJ
  multipath/J11F22-1EJ39HKJ  ONLINE  0  0  0  (resilvering)
  multipath/J11F19-1EJDGJSJ  ONLINE  0  0  0
  multipath/J11F20-1EJD2L3J  ONLINE  0  0  0
  multipath/J11F21-1EJD9A7J  ONLINE  0  0  0

errors: No known data errors
Is this normal? Can the last stage of a resilver take a long time without updating the scan stats?

I did notice in /var/log/messages earlier this morning
Jul 12 03:35:25 freebsd02 kernel: (da36:mps0:0:52:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 a2 91 40 00 00 00 60 00 00 length 49152 SMID 910 terminated ioc 804b scsi 0 state c xfer 0
smartcl(8) comes back ok against da36 although I can't help but think it's linked as da36 is part of the same zpool.

Any advice/help appreciated.



Sorry for replying to my own post but I left it for another 24 hours and pretty much nothing has changed, the scan stats has increased by a few mins

scan: resilver in progress since Mon Jul 11 10:53:06 2016
40.0T scanned out of 40.1T at 162M/s, 0h12m to go
3.99T resilvered, 99.71% done

I'm fairly convinced this isn't going to finish - is there a way to stop a resilver?


You could try to detach the drive that is being resilvered. As far as I know this is the only way. How is your RAM usage and are you using any dedupication in the pool?
Probably not the best solution but have you tried rebooting the machine?

Thank you for responding.

The server is not under any particular load and plenty of free memory available (output below from a freebsd-memory.sh script)

mem_used: 8852557824 ( 8442MB) [ 41%] Logically used memory
mem_avail: + 12622278656 ( 12037MB) [ 58%] Logically available memory
______________ ____________ __________ _______
mem_total: = 21474836480 ( 20480MB) [100%] Logically total memory

The pool is not using deduplication

s11d33R dedup off default
s11d33R/home1 dedup off default
s11d33R/home2 dedup off default
s11d33R/home3 dedup off default
s11d33R/home4 dedup off default

I'm tempted to leave it another 24 hours, I just wonder if something caused it to restart the resilver but the stats are not reflecting it?

Forgive my lack of knowledge but to detach the disk is it just a case of

zpool detach s11d33R multipath/J11F22-1EJ39HKJ



Ive left again another 24 hours and no change except the time to go has increased to 16m and the rate has decreased to 124M/s

scan: resilver in progress since Mon Jul 11 10:53:06 2016
40.0T scanned out of 40.1T at 124M/s, 0h16m to go
3.99T resilvered, 99.71% done

The pool is read-only but doing an zpool iostat s11d33R 2 does show activity which I would not expect

capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
s11d33R 40.5T 14.0T 97 64 3.57M 3.16M
s11d33R 40.5T 14.0T 613 0 2.45M 0
s11d33R 40.5T 14.0T 565 143 2.25M 1.07M
s11d33R 40.5T 14.0T 597 0 2.38M 0
s11d33R 40.5T 14.0T 732 0 2.88M 0

So it does appear to be doing something just very very slowly!

If I did go the detach route to stop the resilver, is it then just a case of using replace again to start the process again i.e.

zpool detach s11d33R multipath/J11F22-1EJ39HKJ
zpool replace s11d33R 8878648567307541532 multipath/J11F22-1EJ39HKJ

I am tempted to just restart the server as I read resilvering is resilient and can cope with a restart - would you agree with that?

Thanks for your help.