ZFS resilvering is incredibly slow

ZFS resilvering is incredibly slow - it doesn't seem to catch up on speed again. It permanently decreases the speed - and HDD activity tends to zero, according to what gstat is telling me. Besides is the ZFS mounts not accessible anymore. A try to ls contents of a directory results in a permanent stuck of the shell. Is ZFS maybe waiting for a time out of something which is not happening? currently ada2 is the one drive *beep**beep**beep**beep*ing arround.
Code:
  pool: zStar
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Jun 23 23:42:19 2013
    232M scanned out of 5,37T at 3,04K/s, (scan is slow, no estimated time)
    24,7M resilvered, 0,00% done
config:

        NAME         STATE     READ WRITE CKSUM
        zStar        ONLINE       2     9     0
          raidz1-0   ONLINE       7     7     0
            ada7p1   ONLINE       0     0     0
            ada8p1   ONLINE       0     0     0
            ada9p1   ONLINE       0     0     0
            ada10p1  ONLINE       0     0     0
            ada1p1   ONLINE       0     0     0
            ada4p1   ONLINE       0     0     0
            ada2p1   ONLINE       3    93     3  (resilvering)
            ada3p1   ONLINE       0     0     0  (resilvering)
            ada5p1   ONLINE       5    14     0
        logs
          gpt/ZIL    ONLINE       0     0     0
        cache
          gpt/L2ARC  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x48>
        <metadata>:<0x53>
        <metadata>:<0x87>
        <metadata>:<0x89>
        zStar/Private:<0x0>
        zStar/vMachines:<0x87c>
I also get messages like this in /var/log/messages
Code:
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 26
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 7b800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 27
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 73800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 29
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 53800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 25
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 51800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 30
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 11800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 28
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 01800000
Jun 24 06:59:38 Storage-01 kernel: siisch1: Timeout on slot 24
Jun 24 06:59:38 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 06:59:38 Storage-01 kernel: siisch1:  ... waiting for slots 00800000
Jun 24 06:59:48 Storage-01 kernel: siisch1: Timeout on slot 23
Jun 24 06:59:48 Storage-01 kernel: siisch1: siis_timeout is 00040000 ss 7f800000 rs 7f800000 es 00000000 sts 80170000 serr 00000000
Jun 24 07:00:18 Storage-01 kernel: siisch1: Timeout on slot 30
Jun 24 07:00:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 40000000 rs 40000000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:00:48 Storage-01 kernel: siisch1: Timeout on slot 30
Jun 24 07:00:48 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 40000000 rs 40000000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:00:48 Storage-01 kernel: (ada2:siisch1:0:0:0): lost device
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 30
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 3f800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 29
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 1f800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 28
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 0f800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 27
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 07800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 26
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 03800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 25
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 01800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 24
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
Jun 24 07:01:18 Storage-01 kernel: siisch1:  ... waiting for slots 00800000
Jun 24 07:01:18 Storage-01 kernel: siisch1: Timeout on slot 23
Jun 24 07:01:18 Storage-01 kernel: siisch1: siis_timeout is 00000000 ss 7f800000 rs 7f800000 es 00000000 sts 801f0000 serr 00000000
What can I do in order to prevent damage? I can't offline ada2 - it says "no valid replicas".

Thx Thanks.
 
Looks like your drives are dying/dead (both ada2 AND ada3?).

How long has this setup been running? Do you have adequate power supply? Because if that is not the case then maybe the drives are dropping out due to not enough power.

If your PSU is fine, then unfortunately it looks to me like you have experienced double disk failure (hence the permanent errors and mounts offline), on a RAID-Z1. Which unfortunately is only resilient to single disk failure.

Which means you are unlucky, and may need to resort to backups.

I'd back up everything you can (if anything) and rebuild the array. Either way, to me it looks like you have already permanently lost data from the array.
 
Well, nine disks in raidz1 with no spare - that is close to Russian roulette (if you don't have a backup). As @throAU mentioned, from the output it seems both disks are failed - ada2 and ada3 - one too many.
 
Last edited by a moderator:
Looking on timeout messages in logs I would guess that either something went wrong with connectivity, or disk more or less went south. May be power-cycle recover disk firmware from stuck, or may be kills it completely. Who knows...
 
To add to the above.

Yes, nine disks in a single RAID-Z VDEV with no spares = risky.

When you rebuild your pool (which I believe is inevitable at this point), maybe consider 2x RAID-Z with a spare? You'll get 1 disk failure tolerance per VDEV, have a spare disk and double your write IOPS performance.
 
First of all: thanks for all the helpful responses to everyone.

gkontos said:
Assuming that you have backups on a nine-disk RAID-Z1, use sysutils/smartmontools to determine if this is a drive(s) or controller issue.

In any case be prepared for down time.
Well, I'm monitoring the disks with S.M.A.R.T. daemon - and ada2 had been complaining for a while.

BUT it looks like I found the error. I was booting the machine without ada2 - all seemed to come up fine, but when I inserted ada2 (hot plug and play) two other disks got lost and reconnected right away again. It turns out all those three disks are hanging on one power cable. This excludes my first guess, that I have a defect S-ATA PCI-controller - instead it tells me, that my power supply wasn't able to balance the DC power out. In other words: I'm pretty sure my power supply lost functionality. I'm going to replace it within the next few days.

In general nine drives with a parity of one shouldn't be an issue IF you select nine different drives (different manufacturer as well as different model) because they'll definitely then have a different life expectancy.


Additionally to this, I was also noticing, that my ada2 (which was told to be damaged by S.M.A.R.T.) doesn't complain at all if it is only running with seven instead of eight other drives. So, funny that S.M.A.R.T. tells me about timeouts and bad sectors, since there don't seem any... ;)


Thanks thus far.

Best regards
 
Leander said:
In general nine drives with a parity of one shouldn't be an issue IF you select nine different drives (different manufacturer as well as different model) because they'll definitely then have a different life expectancy.

I am using eleven drives on RAID-Z3 on all my latest builds. So far it has been much more cost effective in regards to reliability and performance.
 
Leander said:
In general nine drives with a parity of one shouldn't be an issue IF you select nine different drives (different manufacturer as well as different model) because they'll definitely then have a different life expectancy.

Drives fail for various reasons (as you've discovered, non-permanent off-line due to power can be one of them - even if the drive doesn't DIE per-se, you still lose data), manufacturer tolerance is only one factor.
 
Back
Top