ZFS Pool doesn't mount anymore

Epikurean · Dec 4, 2011

Hello Fellas!

Sorry for sounding a little desperate, but I'm at a complete loss. Sometimes my ZFS Pool (containing 2 of those dreaded WDEARS) of 5 disks and 8 TB refuses to mount. It's always the same: the system freezes, upon reeboot, it stops at "mounting filesystems". What I usually did was booting into single user mode, deleting the folder the ZFS is trying to mount ( in my case: /tank) and voilÃ , the system would work again. But not this time.

What can I do? I am running 9.0 stable. There are no logs of the system freezes, the only thing I was able to see is a decrease of RAM (usually ZFS consumes all the RAM and leaves only 2kb of them). My Hardware: Core i3 and 4 GB of RAM.

Your help would be much appreciated.

Epikurean · Dec 5, 2011

Last night, I was able to complete a [cmd=]zpool Scrub[/cmd] The system didn't find any errors, but unfortunately, it didn't help with my problem either.

Following the guide at http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Resolving_ZFS_Mount_Point_Problems_That_Prevent_Successful_Booting, I was still unable to remount the drive.

What I saw when using top is, that the zfs status is tx->tx (whatever that means). After 3 minutes gstat reports 0 activity on the disks of the pool

Is there any way to save the pool? What can I do next?

gkontos · Dec 5, 2011

Does /tank contain any OS files ?

If not try to unmount it and then run a:

[CMD=""]# zdb tank[/CMD]
[CMD=""]# zpool history tank[/CMD]

Observe /var/log/messages during those operations.

Epikurean · Dec 5, 2011

Thank you very much for your help
There are no OS files on /tank
I entered
# zdb tank
and go a lot of output. It is currently doing this since a couple of hours:

Code:

Traversing all blocks to verify checksums and verify nothing leaked ...

and judging from the output of
# gstat it will be doing this for the better part of this millenia.

What should I look for?

The pool remains unmountable

gkontos · Dec 5, 2011

Epikurean said:
What should I look for?

The pool remains unmountable

As long as the pool remains umounted you will not face any freeze issues (hopefully). Wait for the command to finish and keep an eye on /var/log/messages

Have you checked your memory for errors ?

Epikurean · Dec 6, 2011

/var/log/messages/ is still quiet. My machine is still performing the command # zdb tank.
Would booting into a OpenSolaris Live System help?

Here is the output from # gstat

Code:

dT: 1.004s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1     14     14     56   21.1      0      0    0.0   27.8| ada0
    0     14     14     56   16.0      0      0    0.0   22.2| ada1
    1     17     17     68   15.9      0      0    0.0   24.5| ada2
    0     17     17     68   11.6      0      0    0.0   19.7| ada3
    0     12     12     48   15.7      0      0    0.0   18.8| ada4

gkontos · Dec 6, 2011

Epikurean said:
/var/log/messages/ is still quiet. My machine is still performing the command # zdb tank.
Would booting into a OpenSolaris Live System help?

I don't remember if any opensolaris distribution ever made it to ZFS v28. You can try mfsbsd instead.

Honestly though, I ran out of any ideas, sorry!

Do you have backups ?

Epikurean · Dec 6, 2011

Unfortunately, I don't have any backups. The Pool stores all kind of files for me and my flatmates (Personal files, Laptop Backups etc. etc.) They will be very angry, but as long as I don't run out of candy, I should be fine.

How long should zdb take to complete?

EDIT: I was able to destroy an old snapshot. All of this doesn't make any sense to me:
# zpool export
works.
# zpool import
freezes after a while, but the pool is imported
# zfs destroy
works.
But I am still unable to mount the pool. How is this possible?

gkontos · Dec 6, 2011

It takes a while to run depending on your data size. You really shouldn't be doing anything else to the pool until it finishes. You can always interrupt it though.

When did this problem started in the first place ?

Epikurean · Dec 8, 2011

The problem started a couple of months ago.

The last things I tried were:
# zdb tank
and
# zpool import -F tank
The last command is from a suggestion on http://docs.oracle.com

Updates: I checked the RAM with Memtest: it's ok.

Even if I just boot up and do nothing (with zfs_enable set to NO), the pool has a gstat activity of 20-30%. I installed a new OS on a separate harddrive: the pool is not mountable either and zdb crashes after it eats up all memory. In all this, there are no messages besides when RAM is running low. I suspect 2 things: either 4GB RAM is too low to "heal" my pool or there is something wrong with my motherboard. While trying to install an alternative OS, I had to deal with freezes in the BIOS etc.

I think I just have to kill the pool, since I can't afford more RAM. If the system crashes even with the new OS and no ZFS, then I'll know for sure that I'm dealing with a bad motherboard.

Thanks for your help, gkontos!

peetaur · Dec 8, 2011

Even if I just boot up and do nothing (with zfs_enable set to NO), the pool has a gstat activity of 20-30%

The whole pool, or a few disks? If it is just one or two, you should have replaced those disks before this happened.

Do you have logs or spares? Strange things happen to people's pools when they don't use labels, and have more than 1 vdev. Can you post a "zpool status"?

Do you have NFS exports? I found ways to hang a single dataset on a ZFS system when you NFS export the .zfs directory.

Do you have zvols? I found some bugs, such as renaming a snapshot on a zvol can hang all of zfs.

How long did you let it run after you ran the import command? Maybe the import is just slow. For some reason, this page says it can be very slow sometimes on a failure: http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSCachefiles

Maybe there is a way to import the pool without mounting filesystems, and then use zfs send to move the data to a new working pool. I can only guess... just an idea. I have no idea if it makes sense.

My advice is always use labels, so I will use that in my example.

Code:

glabel label newtank1 da97
glabel label newtank2 da98
glabel label newtank3 da99
glabel label newtank4 da99
zpool create newtank raidz1 label/newtank1 label/newtank2 label/newtank3 label/newtank4
zpool import -o readonly -o canmount=noauto -R /z tank
zfs snapshot -r tank@recoverysnap
zfs send -R tank@recoverysnap | zfs recv -F -v newtank

or if you suspect a specific dataset is broken, snapshot something that works and send that instead, and skip the bad dataset. (-F means overwrite and destroy what was there before, so don't select the root dataset)

Code:

zfs snapshot -r tank/good1@recoverysnap
zfs create newtank/good1
zfs send -R tank/good1@recoverysnap | zfs recv -F -v newtank/good1

And if things don't work, maybe try disabling other things too, like compression and dedup.

And I'm sure people told you already, but try to always keep backups. Don't put all your eggs in one basket... put them in two. Every system will fail eventually. Raid is no exception. If your controller melts, or the zfs software or firmware has a bug, then it is simple to lose a raid system.

funky · Dec 8, 2011

Epikurean said:
If the system crashes even with the new OS and no ZFS, then I'll know for sure that I'm dealing with a bad motherboard.

Look for leaking or swollen capacitors on the motherboard, that's a typical "disease" of motherboards especially when they run 24/7.

Epikurean · Dec 9, 2011

Today, I convinced one of my flatmates to buy more RAM.

@peetaur
I don't use NFS. I already disabled compression and autoexpand, but nothing happened.
Again: zpool import fails because the system runs out of RAM and simply crashes (the last messages I see on my screen are things like "PID XYZ killing process blabla". I never used labels. Perhaps this is the root of my problem. I don't use any zvols.

rusty · Dec 9, 2011

Epikurean said:
/var/log/messages/ is still quiet. My machine is still performing the command # zdb tank.
Would booting into a OpenSolaris Live System help?

You could try OpenIndiana 151a, that definitely has ZFS v28
http://openindiana.org/

Savagedlight · Dec 14, 2011

You should probably check SMART status of the drives, see if any of them are dying.
I've had issues with horribly slow pools (but no zfs errors), where SMART info revealed a drive was having severe problems with writing the data to disk. Replacing that drive solved the problems for me.