ZFS Resilver taking very long time

edispah · Jul 17, 2017

Hi

I am using FreeBSD 10.2-RELEASE-p9.

I replaced a failing disk in a zpool some time ago - output below

Code:

root@freebsd03:~ # zpool status -v
  pool: s12d33
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun 15 14:20:04 2017
        49.3T scanned out of 49.8T at 18.6M/s, 6h20m to go
        4.92T resilvered, 99.18% done
config:

        NAME                             STATE     READ WRITE CKSUM
        s12d33                           DEGRADED     0     0     0
          raidz2-0                       DEGRADED     0     0     0
            replacing-0                  OFFLINE      0     0     0
              1160678623310310837        OFFLINE      0     0     0  was /dev/multipath/J12F12-1EJDBAEJ
              multipath/J12F12-VK0Z530Y  ONLINE       0     0     0  (resilvering)
            multipath/J12F13-1EJDGHWJ    ONLINE       0     0     0
            multipath/J12F14-1EJAWSMJ    ONLINE       0     0     0
            multipath/J12F15-1EJDGL9J    ONLINE       0     0     0
            multipath/J12F16-1EJAUE5J    ONLINE       0     0     0
          raidz2-1                       ONLINE       0     0     0
            multipath/J12F17-1EJD9K1J    ONLINE       0     0     0
            multipath/J12F18-1EJAUZ4J    ONLINE       0     0     0
            multipath/J12F19-1EJ9PP2J    ONLINE       0     0     0
            multipath/J12F20-1EJ7X50J    ONLINE       0     0     0
            multipath/J12F21-1EJAUNKJ    ONLINE       0     0     0

errors: No known data errors

It just doesn't seem to be coming to an end and as you can see has been running over 1 month!! The pool has fairly constant heavy I/O and the files involved are millions of small ones.

I believe it reached the 99% mark within 3-4 days but has been stuck like this ever since.

I am thinking of restarting the server - is this safe to do whilst its mid resilvering? Will it start again or pick up?

Is there anyway I can tell if its doing anything?

Thanks

Paul

SirDice · Jul 17, 2017

edispah said:
It just doesn't seem to be coming to an end and as you can see has been running over 1 month!!

It does take a while but a month seems to be excessive. It takes about 2 days for a 3 TB disk to resilver on my 4 x 3 TB RAID-Z at home.

I believe it reached the 99% mark within 3-4 days but has been stuck like this ever since.

I am thinking of restarting the server - is this safe to do whilst its mid resilvering? Will it start again or pick up?

Yes, definitely try to reboot it. It might be stuck and the reboot may just give it that push it needs. It's safe to do, provided you do a "clean" shutdown or reboot. And by clean I mean properly shut it down, not hitting the power or reset button. Normally the resilver process will pick up where it left off.

edispah · Aug 8, 2017

Hi

Thanks for the info.

I did a restart on Jul 26th and kept monitoring it, the resilver didnt move beyond 99.18% - I made sure no other IO was occuring and did an iostat and it did appear to be doing something just very very slowly.

I left it until Aug 3rd when another disk failed, I thought that might have caused another issue with the resilver so I restarted the system again, 5 days on and still the 99.18% has not moved.

Code:

root@freebsd03:/var/log # zpool status
  pool: s12d33
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun 15 14:20:04 2017
        49.3T scanned out of 49.8T at 1/s, (scan is slow, no estimated time)
        4.92T resilvered, 99.18% done
config:

        NAME                             STATE     READ WRITE CKSUM
        s12d33                           DEGRADED     0     0     0
          raidz2-0                       DEGRADED     0     0     0
            replacing-0                  DEGRADED     0     0     0
              1160678623310310837        OFFLINE      0     0     0  was /dev/multipath/J12F12-1EJDBAEJ
              multipath/J12F12-VK0Z530Y  ONLINE       0     0     0
            multipath/J12F13-1EJDGHWJ    ONLINE       0     0     0
            multipath/J12F14-1EJAWSMJ    ONLINE       0     0     0
            multipath/J12F15-1EJDGL9J    ONLINE       0     0     0
            multipath/J12F16-1EJAUE5J    ONLINE       0     0     0
          raidz2-1                       DEGRADED     0     0     0
            multipath/J12F17-1EJD9K1J    ONLINE       0     0     0
            multipath/J12F18-1EJAUZ4J    ONLINE       0     0     0
            10036117701506245707         OFFLINE      0     0     0  was /dev/multipath/J12F19-1EJ9PP2J
            multipath/J12F20-1EJ7X50J    ONLINE       0     0     0
            multipath/J12F21-1EJAUNKJ    ONLINE       0     0     0

errors: No known data errors

I've noticed that the replacement disk (multipath/J12F12-VK0Z530Y) no longer has (resilvering) next to it - does that mean it's not doing anything?

Does anyone have any suggestions on how I can get myself out of this mode?

I was maybe thinking of offlining the disk zpool offline s12d33 multipath/J12F12-VK0Z530Y and then online it again? zpool online s12d33 multipath/J12F12-VK0Z530Y

I have a replacement disk attached to the system to replace the other failed disk - I presume I should resilver one disk at a time?

Thanks

Paul

edispah · Aug 14, 2017

Hi

Apologies for replying to my own message but I now have a second server where a similar thing has happened. I replaced a disk and the resilver was going well and got to 99.03% and now seems to have stopped. This time though I did get an error in /var/log/messages which I believe occured around the time the resilver "appears" to have stopped.

Code:

Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): READ(16). CDB: 88 00 00 00 00 01 a9 e5 6f b0 00 00 00 b0 00 00
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): CAM status: SCSI Status Error
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): SCSI status: Check Condition
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0):
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Field Replaceable Unit: 0
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Command Specific Info: 0
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Actual Retry Count: 153
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Descriptor 0x80: f7 2d
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Descriptor 0x81: 02 75 5d 05 07 91
Aug 10 06:24:24 freebsd01 kernel: (da8:mps0:0:20:0): Error 5, Unretryable error
Aug 10 06:24:24 freebsd01 kernel: GEOM_MULTIPATH: Error 5, da8 in J12F07-1EJ37DWJ marked FAIL
Aug 10 06:24:24 freebsd01 kernel: GEOM_MULTIPATH: da14 is now active path in J12F07-1EJ37DWJ
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 a9 e5 6f b0 00 00 00 b0 00 00
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): CAM status: SCSI Status Error
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): SCSI status: Check Condition
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0):
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): Field Replaceable Unit: 0
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): Command Specific Info: 0
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): Actual Retry Count: 153
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): Descriptor 0x80: f7 2d
Aug 10 06:24:27 freebsd01 kernel: (da14:mps0:0:27:0): Descriptor 0x81: 02 75 5d 05 0(da14:mps0:0:27:0): Error 5, Unretryable error
Aug 10 06:24:27 freebsd01 kernel: GEOM_MULTIPATH: Error 5, da14 in J12F07-1EJ37DWJ marked FAIL
Aug 10 06:24:27 freebsd01 kernel: GEOM_MULTIPATH: all paths in J12F07-1EJ37DWJ were marked FAIL, restore da8
Aug 10 06:24:27 freebsd01 kernel: GEOM_MULTIPATH: da8 is now active path in J12F07-1EJ37DWJ

So it had a read error on a multipath disk, failed the path got a read error on the other path - should this be enough to interrupt a resilver? the pool is raidz2, the disk with the read error is now also saying "resilvering"

Code:

root@freebsd01:~ # zpool status
  pool: s12d32
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Aug  4 09:18:00 2017
        49.3T scanned out of 49.8T at 58.1M/s, 2h25m to go
        4.91T resilvered, 99.03% done
config:

        NAME                             STATE     READ WRITE CKSUM
        s12d32                           DEGRADED     0     0     0
          raidz2-0                       ONLINE       0     0     0
            multipath/J12F02-1EJ1T6JH    ONLINE       0     0     0
            multipath/J12F03-1EJ39XJJ    ONLINE       0     0     0
            multipath/J12F04-1EJ38JXJ    ONLINE       0     0     0
            multipath/J12F05-1EJ2YH9F    ONLINE       0     0     0
            multipath/J12F06-1EJ2Y0TF    ONLINE       0     0     0
          raidz2-1                       DEGRADED     0     0     0
            multipath/J12F07-1EJ37DWJ    ONLINE       0     0     0  (resilvering)
            multipath/J12F08-1EJ37E8J    ONLINE       0     0     0
            replacing-2                  OFFLINE      0     0     0
              6780459540582333781        OFFLINE      0     0     0  was /dev/multipath/J12F09-1EJ17EBJ
              multipath/J12F09-1EJD98EJ  ONLINE       0     0     0  (resilvering)
            multipath/J12F10-1EJ2Y5EF    ONLINE       0     0     0
            multipath/J12F11-1EHZEP7F    ONLINE       0     0     0

errors: No known data errors

Any top tips or advice? I really need to get these pools back to a healthy state - would a reboot help? are their any issues with multipath and resilvering?

Thanks

Paul

ralphbsz · Aug 14, 2017

Unfortunately, a reboot has a reasonable probability of helping. It seems that you have found a bug in ZFS (not the first one, won't be the last one, even though ZFS is a very solid file system). The bug seems to be in the interaction of resilvering and handling IO errors (not a surprise, error handling is hard). The nature of the bug seems to be that some data structure gets into a state that prevents resilvering from making progress. Why reboot? Because we hope that the faulty data structure is in memory (not on disk), and a restart of ZFS will recreate it in the correct state.

If a reboot doesn't help, then the logical next step probably has to be filing a bug report, to alert the ZFS developers.

edispah · Oct 13, 2017

Sorry for replying to my own old post but I am still struggling with resilver issues and just wanted to add some information and request some advice that might help me track down the issue.

One of the servers sat at 99% for a good month but then all of a sudden completed and the pool is back Online and healthy.

I have another server where the same thing happened, it reached 99% within a week then sat for about a month at 99% but then completed.

Is their some extra process that occurs at 99% that could take some time depending on the system?

I was reading about settting up ZFS pools and how you should give them the whole disk (which is what I do, no partitions just the whole disk), however I recently read something else which indicated that you should create fixed size partitions and use them instead just in case for example a 6TB disk and a 6TB disk are not identical sizes in which case a resilver will fail? Could this be whats happening here? (although I would expect it not to complete at all if that was the case).

My feeling is its something configuration wise I have done rather than a bug, as I now have 4 different systems exhibiting the same behaviour.

In summary my set up is

LSI card (non raid) direct attached to SAS disks (either HGST 6TB or 8TB), i force 4K block size when creating pools via vfs.zfs.min_auto_ashift=12 in sysctl.conf.

The LSI card is connected to both a primary and secondary expander on the SAS back plane so I used gmultipath to label the disks and then used the label to create the pool e.g.

Code:

root@freebsd04:~ # gmultipath status
                     Name    Status  Components
multipath/J11R00-1EJ2XR5F   OPTIMAL  da0 (ACTIVE)
                                     da11 (PASSIVE)
multipath/J11R01-1EJ2XT4F   OPTIMAL  da1 (ACTIVE)
                                     da12 (PASSIVE)
multipath/J11R02-1EHZE2GF   OPTIMAL  da2 (ACTIVE)
                                     da13 (PASSIVE)

zpool create -f store43 raidz2 multipath/J11R00-1EJ2XR5F multipath/J11R01-1EJ2XT4F etc.......

As mentioned I do not partition the disk and just give the whole thing to zpool create.

When a disk fails I offline the disk like this

zpool offline s11d33R multipath/J11F18-1EJDGJWJ

then replace using this

 zpool replace s11d33R 8878648567307541532 multipath/J11F22-1EJ39HKJ

Can anyone see anything that flags up a warning with what ive done?

Thanks

Paul

ralphbsz · Oct 13, 2017

edispah said:
I was reading about settting up ZFS pools and how you should give them the whole disk (which is what I do, no partitions just the whole disk), ...

I don't agree with that advice. Giving ZFS a partition is safer (because then the disk becomes self-describing, which makes it less likely that other system scribble on the ZFS area), and makes future administration easier. It has virtually no performance or capacity drawback. It gives future flexibility. It requires one more sys admin step, but a competent sys admin should know how to handle partitions anyhow.

for example a 6TB disk and a 6TB disk are not identical sizes in which case a resilver will fail? Could this be whats happening here? (although I would expect it not to complete at all if that was the case).

No, that should not be the cause. If ZFS has a mirror consisting of a 5.99TB disk and 6.01TB disk, it should only place 5.99TB worth of data on it. If one then resilvers with a disk that is too small, it should either complete or fail, not hang forever at 99%. You have found a bug.

My feeling is its something configuration wise I have done rather than a bug, as I now have 4 different systems exhibiting the same behaviour.

And that exact bug (resilver hanging at 99% for a long time, perhaps "forever") has been reported by several others on this forum alone.

But I agree with you: Most likely part of the problem is that your setup or configuration is slightly unusual, and that triggers this particular bug.

edispah · Nov 2, 2017

Hi

Thanks for your advice, I have logged a bug report.

One of my affected servers does not seem to be resilvering but is still showing as resilver in progress, its been in this state since June + I have had another disk failure which I have not yet replaced due to the 1st being in an unusual state.

You can see the disk I replaced no longer has "(resilvering)" next to it.

I was going to try detaching the OFFLINE disk in the hope it might solve it?

Would anyone be able to offer and advice if running zpool detach s12d33 1160678623310310837 could do any harm - I don't want to make matters worse!

I did a zpool replace when I changed the disk which I believe should do a detach at the end. If its worth noting the disk is which is showing as OFFLINE is no longer attached to the system

Code:

  pool: s12d33
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun 15 14:20:04 2017
        49.3T scanned out of 49.8T at 1/s, (scan is slow, no estimated time)
        4.92T resilvered, 99.18% done
config:

        NAME                             STATE     READ WRITE CKSUM
        s12d33                           DEGRADED     0     0     0
          raidz2-0                       DEGRADED     0     0     0
            replacing-0                  DEGRADED     0     0     0
              1160678623310310837        OFFLINE      0     0     0  was /dev/multipath/J12F12-1EJDBAEJ
              multipath/J12F12-VK0Z530Y  ONLINE       0     0     0
            multipath/J12F13-1EJDGHWJ    ONLINE       0     0     0
            multipath/J12F14-1EJAWSMJ    ONLINE       0     0     0
            multipath/J12F15-1EJDGL9J    ONLINE       0     0     0
            multipath/J12F16-1EJAUE5J    ONLINE       0     0     0
          raidz2-1                       DEGRADED     0     0     0
            multipath/J12F17-1EJD9K1J    ONLINE       0     0     0
            multipath/J12F18-1EJAUZ4J    ONLINE       0     0     0
            10036117701506245707         OFFLINE      0     0     0  was /dev/multipath/J12F19-1EJ9PP2J
            multipath/J12F20-1EJ7X50J    ONLINE       0     0     0
            multipath/J12F21-1EJAUNKJ    ONLINE       0     0     0

Thanks

Paul

edispah · Nov 6, 2017

Hi

Finally have a handle on this!!

The system takes a snap shot every 10 minutes and uses zfs send to copy to another server.

I disabled the snapshot/zfs send for a period on two of my affected servers and also set this

Code:

sysctl vfs.zfs.resilver_delay=0
sysctl vfs.zfs.resilver_min_time_ms=5000

within about 6-7 hours the resilvers completed.

Is this expected, I don't see anything saying to stop snapshots whilst resilvering? my pools are quite full (around 90%) so maybe thats a factor?

Anyway just am very pleased I have some healthy pools again and in future whilst resilvering I will disable the snapshot/zfs send until it completes.

Thanks

Paul

ZFS Resilver taking very long time

Administrator