ZFS: resilvering loop after zpool attach

I have a dual-drive USB toaster with two identical HDDs in it. I created a pool (and some filesystems in it) on one drive and the other drive was unused for some time. Recently I decided to turn this setup into a mirrored configuration and I did
# zpool attach pool0 da0 da1
where da1 was the previously unused drive.

The command didn't report any problems, but the resilvering of da1 keeps going, stopping and then restarting. The pool status is "online", but the "scrub" states that a resilvering is in progress, although it never gets past a few percent of completion.

I have found some references to such an endless resilvering loop, but they were related to replacing a failed drive. In my case there aren't any failed drives, I'm just trying to convert a single-drive configuration into a mirror.

I was able to do # zpool detach and then try attaching again, but with the same result.

I'm using 8.2-PRERELEASE from 2-3 weeks ago (cvsupped the source and built world).

Is this a known problem? Am I doing something wrong?
 
You may have a dying drive. Check dmesg(8) output, /var/log/messages, console output, etc.

You can also check the drive like so:

Install the sysutils/smartmontools port and use that to get a baseline reading for the different error counts.

Then use dd(1) to write zeros to every byte of the drive:
# dd if=/dev/zero of=/dev/da1 bs=16M

Then read back everything:
# dd if=/dev/da1 of=/dev/null bs=16M

Then do the same with random data:
# dd if=/dev/random of=/dev/da1 bs=16M
# dd if=/dev/da1 of=/dev/null bs=16M

And re-check all the SMART values in between the tests to see if any are incrementing.

This may also be a USB disconnect/reconnect issue, if both drives are connected via a single USB channel/cord.
 
They are both on the same USB port/cable. As a matter of fact they are both in the same enclosure (a dual-drive one).

Is there a way to verify if the reconnect is causing the problem?
 
Nope. Nothing there... :(

Code:
# zpool status pool0
  pool: pool0
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h6m, 0.33% done, 32h29m to go
config:

        NAME        STATE     READ WRITE CKSUM
        pool0       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0  3.77G resilvered

errors: No known data errors


# tail /var/log/messages
Feb 20 17:37:21 atheneum dhclient: New Routers (fxp0): 192.168.1.1
Feb 20 17:37:23 atheneum kernel: fxp0: link state changed to UP
Feb 20 17:49:38 atheneum su: xxxxxxxx to root on /dev/pts/0
Feb 20 17:49:46 atheneum kernel: ZFS filesystem version 4
Feb 20 17:49:46 atheneum kernel: ZFS storage pool version 15
Feb 20 23:49:29 atheneum su: xxxxxxxx to root on /dev/pts/2
Feb 24 21:32:03 atheneum su: xxxxxxxx to root on /dev/pts/1
Feb 25 00:14:53 atheneum su: xxxxxxxx to root on /dev/pts/2
Feb 26 22:22:06 atheneum sshd[66096]: error: PAM: authentication error for xxxxxxxx from 192.168.1.101
Feb 26 22:22:14 atheneum su: xxxxxxxx to root on /dev/pts/0


# zpool history -l
History for 'pool0':
2011-02-01.21:04:00 zpool create pool0 /dev/da0 [user root on atheneum:global]
2011-02-02.13:18:14 zfs create pool0/vob [user root on atheneum:global]
2011-02-03.22:52:28 zfs create pool0/m [user root on atheneum:global]
2011-02-06.14:19:46 zfs create pool0/v [user root on atheneum:global]
2011-02-26.22:25:29 zpool attach pool0 da0 da1 [user root on atheneum:global]
2011-02-27.20:51:29 zpool scrub -s pool0 [user root on atheneum:global]
2011-02-27.21:07:08 zpool scrub -s pool0 [user root on atheneum:global]
2011-02-27.21:07:14 zpool scrub -s pool0 [user root on atheneum:global]
2011-02-27.21:15:37 zpool detach pool0 da1 [user root on atheneum:global]
2011-02-27.21:20:32 zpool attach pool0 da0 da1 [user root on atheneum:global]

# zpool history -i | tail
2011-03-01.18:50:45 [internal pool scrub txg:307530] func=1 mintxg=3 maxtxg=302581
2011-03-01.18:51:23 [internal pool scrub done txg:307533] complete=0
2011-03-01.18:51:23 [internal pool scrub txg:307533] func=1 mintxg=3 maxtxg=302581
2011-03-01.19:19:32 [internal pool scrub done txg:307584] complete=0
2011-03-01.19:19:32 [internal pool scrub txg:307584] func=1 mintxg=3 maxtxg=302581
2011-03-01.19:25:50 [internal pool scrub done txg:307594] complete=0
2011-03-01.19:25:50 [internal pool scrub txg:307594] func=1 mintxg=3 maxtxg=302581
2011-03-01.19:30:04 [internal pool scrub done txg:307602] complete=0
2011-03-01.19:30:04 [internal pool scrub txg:307602] func=1 mintxg=3 maxtxg=302581


:(
 
I replaced the 2-bay Thermaltake BlacX Duet with a big, fat Mediasonic 4-bay HF2-SU3S2 and so far things seem to work ok. I started from scratch though---backed up the data and recreated the pools. I hope it will work, because I have developed some taste for ZFS.. :P

Thanks all for help.
 
Saguaro said:
I replaced ...

Ah, I have also installed GPT on da0 and da1 and created a freebsd-zfs slice on each. The pool is now on da0p1/da1p1 instead of da0/da1. Not sure if that matters though.
 
Back
Top