ZFS mount problem

ilcusa · Oct 25, 2010

My little NAS ( EPIA SN10000 ) has this hard-disk configuration:

500GB SATA with:

/ (UFS)
/var (UFS)
/tmp (UFS)
/usr (ZFS, a pool in a separate partition)

2x 1TB SATA, RAID 1, with:

/hd_raid (ZFS)

This morning I found it has rebooted (maybe an hardware error.. usually every 20 days it crashes and I don't know why. Not due to power fails because there is an UPS) and was blocked with some FS errors.
I rebooted in single user and I did fsck. All the UFS partitions were corrected.

Then I mount my /usr ZFS partition with:
# zfs mount usr
But it blocked. The keyboard isn't freezed, so I can write in the console, but the command doesn't end.
Ctrl + C doesn't work, so I can only reboot with ctrl+alt+del.

I tried to mount the other ZFS pool, with zfs mount hd_raid, and it works! So the problem is only in the usr pool.

zpool status show "online", and the scrub evidences no errors..

What can I do? Where could be the problem? :q:q

Thanks.

phoenix · Oct 25, 2010

What happens if you boot to single-user mode and run the following:

Code:

# /etc/rc.d/hostid start
# zpool list
# zpool status
# zpool export <name-of-pool-for-usr>
# zpool import <name-of-pool-for-usr>

Can you get a listing of the ZFS filesystems after that?
# zfs list

Can you do a scrub after that?
# zpool scrub <name-of-pool-for-usr

Finally, what happens if you run:
# /etc/rc.d/zfs start

ilcusa · Oct 26, 2010

(I forgot to say I'm using FreeBSD 8.1)

# /etc/rc.d/hostid start

Code:

Setting hostuuid: 00020003-0004-0005-0006-000700080009
Setting hostid: 0x81f4ec68

# zpool list

Code:

ZFS filesystem version 3
ZFS storage pool version 14
NAME     SIZE  USED   AVAIL  CAP  HEALT   ALTROOT
hd_raid  931G  21.7G  909G   2%   ONLINE  -
usr      588G  2.66G  584G   0%   ONLINE  -

# zpool status

Code:

 pool: hd_raid
state: ONLINE
scrub: none requested
config:
NAME     STATE  READ  WRITE  CKSUM
hd_raid  ONLINE    0      0      0
ar0      ONLINE    0      0      0
errors: No known data errors

 pool: usr
state: ONLINE
scrub: none requested
config:
NAME     STATE  READ  WRITE  CKSUM
usr      ONLINE    0      0      0
ad4s1f   ONLINE    0      0      0
errors: No known data errors

# zpool export usr
# zpool import usr

--> shell blocked! I can only reboot with ctrl+alt+del

After reboot I tried:

# zpool scrub usr
# zpool status usr

Code:

 pool: usr
state: ONLINE
scrub: scrub completed after 0h3m with 0 errors on Tue Oct 26 20:18:04 2010
config:
NAME     STATE  READ  WRITE  CKSUM
usr      ONLINE    0      0      0
ad4s1f   ONLINE    0      0      0
errors: No known data errors

# /etc/rc.d/zfs start
--> same problem, shell blocked. The only thing I can do is "ctrl+alt+del"..

Do you believe there is a solution? Or it can be a bug and I've to wait it to be fixed? :\
Thank you very much.

phoenix · Oct 27, 2010

One more thing to try. When it's blocked like that, press CTRL+T to get some stats about the currently running process. If you press it multiple times, you'll get multiple lines of output. Can you do that 10-15 times and paste the info here?

danbi · Oct 27, 2010

I see you use ataraid for the second pool (ar0). This is bad idea with ZFS. Better idea is to have ZFS manage the mirror. This way it can recover from errors -- with your current setup, ZFS will only report data is corrupt.

You may migrate the mirror to ZFS online. My memory of atacontrol is bit rusty, but..

# atacontrol detach ataX

should detach one of the ata channels, with one of the drives. Then

# atacontrol reinit ataX

should re-discover the drive again.. hopefully, outside of the ar0 array. It might show up as another array, ar1, if so:

# atacontrol delete ar1

You now have 'spare' disk for ZFS. ZFS won't even notice you have failed mirror. This is bad and is why we do this transformation. Let's assume the drive you disconnected is adA. Just mirror it with your (now degraded) ar0:

# zpool attach hd_raid ar0 adA

This will take some time to resilver, monitor with

# zpool status

until resilver has completed.
After resilver completes, you need to get rid of the pseudo-RAID completely. Make ZFS forget about that device, then destroy ar0 and reconstruct the mirror.

# zpool detach hd_raid ar0
# atacontrol delete ar0
# zpool attach hd_raid adA adB

monitor again progress with

zpool status

At the end, or at any time you feel necessary, you may run

# zpool scrub hd_raid

to make sure all data is ok.

ilcusa · Oct 27, 2010

phoenix said:
One more thing to try. When it's blocked like that, press CTRL+T to get some stats about the currently running process. If you press it multiple times, you'll get multiple lines of output. Can you do that 10-15 times and paste the info here?

I can't believe it.. :\ :\
After I pressed "CTRL+T" the system started to works!! Very, very strange..

Now all is ok, even afer a couple of reboots.

Thanks phoenix for you help! (..I wish I knew what was happen..

)

danbi, thanks for you suggestion, I'll try it!

cy@ · Feb 3, 2011

Seems I have the same problem. No resolution. Ctrl+T shows the process just waiting. No I/O and symptoms as described in this thread.

cy@ · Feb 3, 2011

Solved. Solution was to remove the drive from my laptop and plug it into my testbed via a USB to IDE cable, then run zpool import followed by zpool scrub, zpool scrub -s, and zpool export. Finally replacing the drive back into my laptop.

ilcusa · Feb 3, 2011

It looks a bug.. I hope that in FreeBSD 8.2 (with ZFSv15) it will be fixed!