Broken folder on zfs

rabfulton · Jul 6, 2012

I have a problem with a raidz partition. There is a folder that cannot be deleted and any attempt to do so causes a kernel panic and the system then hangs. My system is AMD64 FreeBSD 9.0 RELEASE.

Code:

%zpool status -v
  pool: tank
 state: ONLINE
 scan: scrub repaired 0 in 1h26m with 0 errors on Fri Jul  6 21:39:22 2012
config:

        NAME            STATE     READ WRITE CKSUM
        tank            ONLINE       0     0     0
          raidz1-0      ONLINE       0     0     0
            label/rdz2  ONLINE       0     0     0
            label/rdz3  ONLINE       0     0     0
            label/rdz1  ONLINE       0     0     0

errors: No known data errors

Smart shows the disks to be in good health too.

Code:

%cd /usr/local/lib/ruby/1.8/irb/ext/
%ls -al
total 12911980442550336
drwxr-xr-x  2 root  wheel    11 Jan 22 19:54 .
drwxr-xr-x  4 root  wheel    21 Jul  5 21:08 ..
-rw-r--r--  1 root  wheel  1170 Feb 12  2007 change-ws.rb
-rw-r--r--  1 root  wheel  2169 Feb 12  2007 history.rb
-rw-r--r--  1 root  wheel  2307 Feb 12  2007 loader.rb
-rw-r--r--  1 root  wheel   625 Feb 12  2007 math-mode.rb
-rw-r--r--  1 root  wheel  4864 Nov 17  2009 multi-irb.rb
-rw-r--r--  1 root  wheel  2168 Aug  9  2009 save-history.rb
-rw-r--r--  1 root  wheel  1190 Feb 12  2007 tracer.rb
-rw-r--r--  1 root  wheel  1363 Feb 12  2007 use-loader.rb
-rw-r--r--  1 root  wheel   978 Feb 12  2007 workspaces.rb

Notice the value of total!

Code:

zdb -c tank

Traversing all blocks to verify metadata checksums and verify nothing leaked ...
leaked space: vdev 0, offset 0x10800, size 6144
leaked space: vdev 0, offset 0x51800, size 6144
leaked space: vdev 0, offset 0xf00599cc00, size 6144
block traversal size 969073810432 != alloc 969073828864 (leaked 18432)

        bp count:         5601946
        bp logical:    646810763264      avg: 115461
        bp physical:   645396452864      avg: 115209     compression:   1.00
        bp allocated:  969073810432      avg: 172988     compression:   0.67
        bp deduped:             0    ref>1:      0   deduplication:   1.00
        SPA allocated: 969073828864     used: 32.42%

That can't be good! If i try rm -r /usr/local/lib/ruby/1.8/irb/ext/ the directory whilst at a console I can capture the following with a camera:

Code:

panic: page fault 
cpuid = 1
KDB: stack backtrace:
#O Oxffffffff808680fe at kdb_backtrace+0x5e
#1 Oxffffffff80832cb7 at panic+Ox187
#2 Oxffffffff80b18400 at trap_fatal+0x290 
#3 Oxffffffff80b18749 at trap_pfault+Ox1f9
#4 0xffffffff80b18c0f at trap+Ox3df
#5 Oxffffffff80b0313f at calltrap+0x8
#6 Oxffffffff81430cf7 at bp_get_dsize+Ox57 
#7 Oxffffffff814027ba at dmu_tx_hold_free+Ox74a  
#8 0xffffffff813f5ca6 at dmu_free_long_range_impl+0x106 
#9 Oxffffffff813f5f1c at dmu_free_long_range+Ox4c 
#10 0xffffffff81462549 at zfs_rmnode+Ox69 
#11 Oxffffffff814790a6 at zfs_inactive+Ox66 
#12 0xffffffff8147926a at zfs_freebsd_inactive+Ox1a 
#13 Oxffffffff808c2f81 at vinactive+Ox71
#14 Oxffffffff808c74a8 at vputx+0x2d8 
#15 0xffffffff808cb3af at kern_unlinkat+Ox1df
#16 Oxffffffff80b17cf0 at amd64_sysca11.0x450 
#17 Oxffffffff80b03427 at Xfast_syscall+Oxf7
Uptime: 1m59s 
acpi0: reset failed - timeout 
Automatic reboot in 15 seconds - press a key on the console to abort

Otherwise my system works fine with the exception that I cannot ever remove or upgrade ruby. I don't know when this first occured as it had been a while since I updated ports. Any ideas?

HarryE · Jul 8, 2012

An interrupted rsyc on PCBSD9 corrupted my ZFS filesystem in the same manner.
It's a ZFS bug. I hope they'll fix it soon. This bug shattered my confidence in allmighty ZFS :-(

rabfulton · Jul 8, 2012

Is there any possibility that my pool can be repaired? Or is it a case of recreating the pool from one of my many non-existent backups?

Thanks for the reply.

phoenix · Jul 9, 2012

Can you destroy just that one filesystem?

throAU · Jul 9, 2012

phoenix said:
Can you destroy just that one filesystem?

Presumably after tarring it up to another file system first

rabfulton · Jul 9, 2012

Thanks for the replies. I guess now is a good time to upgrade to a new bigger array. I can't really destroy that one filesystem as it is too large to backup (well I guess I could spend weeks burning dvd's). This might be a situation that would advocate having many smaller filesystems in your pool. Will do that in the future.

Would it be entirely insane to degrade my current pool and create a new degraded pool with two new disks to copy the data over, before removing the remaining two original disks and resilvering the new pool with a third new disk?

Probably definitly, but I only have 4 sata ports lol.

I am assuming that replacing the disks one at a time would not fix the issue.

HarryE · Jul 10, 2012

# zfs send propagates the error, the only choice seems to be copy by # tar
You cannot create a raidz1 from 2 disks then adding a third.
The rest or your pool is ok. In my case, since I had enough space in the pool, I created another ZFS filesystem in the same pool, copied the content and repaired the bad directories from a twin installation (I run several FreeBSD servers), then I deleted the corrupted ZFS.
If it's not your case, you may try to create a zpool having copies=2 (you get half of the space from the disk) option on the fourth SATA port, to protect you from bad sector errors.
Or mirror the 4th port disk with an USB external disk.
I suppose you don't have any previous snapshots of that filesystem. I started using periodic snaphots since.

knarf · Jul 10, 2012

HarryE said:
You cannot create a raidz1 from 2 disks then adding a third.

He could try faking the needed third disk using mdconfig(8).

But if disk space is the problem, he could also create a second zfs, set dedup=on on both and rsync the broken fs to the new one and then destroy the old one. Been there, done that, my broken zfs showed 16.0E usage.

But it did not cure the whole problem. After a while I had 100% disk i/o on the disks even in single user mode (gptzfsboot-mirror). I had to backup all the data with only 2 MB/s, because of the high disk load.

rabfulton · Jul 10, 2012

Thanks for the heads-up on zfs send. Regarding the new array I will just fake the disk using mdconfig and offline it before copying data. I did that once before when moving from 1 disk to a 3 disk raidz which had to include the original disk. Neat trick.

Have purchased 3 seagate 2TB ST2000DM001. Will dedicating the whole disks (bootloader is on a CF card along with some of the base system) to the pool result in proper alignment?

Is this still best practice with AF disks?

Does this smart align technology in the drives interfere?

I can't seem to find a definitive answer.

P.S. Secondhand disks seem to fetch respectable prices on ebay atm, so will probably just about break even after the new array is filled up and running

.

Sebulon · Jul 10, 2012

@rabfulton

Here's the method I use to partition and create ZFS optimized for 4k:
4x2TB disk partition help
Starts at #12. It demonstrates creating a striped mirror pool, but just modify the pool creating accordingly.

EDIT: Just saw that you boot from CF. Following the guide will give you a bootable pool, which give you better security against boot issues, or you omit creating the boot-partition and bootcoding.
/Sebulon

rabfulton · Jul 10, 2012

The CF card contains a read only boot partition and enough of base to run the system ala Vermaden's older zfs guides. I like that setup.

I also like the simplicity of giving ZFS whole disks assuming that is still okay with newer drives and the AF quirks. I've seen plenty of good guides regarding the use of GPT partitions, but it seems uneccessarily complex, if I do not need it, I'd rather avoid it.

kpa · Jul 10, 2012

You can have just the freebsd-boot GPT partition containing /boot/gptzfsboot on a CF card or an USB memory stick and a complete bootable ZFS pool on unpartitioned disk, it works because the zfs bootloader is smart enough to find the boot filesystem even on disks that don't have partitions.

rabfulton · Jul 10, 2012

Thanks Kpa, but thats not my question. My boot setup is fine. I am asking about dedicated zfs disks with no partitions and whether this is still a good idea. Booting the pool has not and will not be a problem.

Sebulon · Jul 10, 2012

rabfulton said:
I've seen plenty of good guides regarding the use of GPT partitions, but it seems uneccessarily complex, if I do not need it, I'd rather avoid it.

ThatÂ´s certainly a matter of opinion. Following that perticular guide feels to me more complex than mine

But as to 4k hard drives and performance, yes, you need it. Recently read a thread here about making ZFS without any formatting and benchmarking with bonnie++. Then destroying the pool and rebuilding with proper formatting and gnop. Benchmarking again afterwards revealed a significant boost in performance.
ZFS performance problems with 2TB Samsung drives drives

And if you want to be able to import a pool into Solaris, the ZFS-partitions needs to start at 2048, or the partitions will be unavailable. I noticed this when I tried importing a pool made in FreeBSD 8.0 that was upgraded to 9.0 and ZFS V28 into Solaris 11. Half the drives was partitioned directly at the beginning at 63, and the other half starting at 1M(2048). I ran zpool import and the ones that started at the beginning was UNAVAILABLE, and the others was ONLINE, so itÂ´s good for both performance and compatibility. Please simulate that yourself to see the difference.

The gnop-trick is also needed to force the ashift-value to 12, since the drives still lie about their sector size being 512b.

/Sebulon

rabfulton · Jul 10, 2012

Thanks Sebulon, that clears things up. Though seems to like that should be considered a bug in zfs or the handbook should at least contain such information.

Sebulon · Jul 11, 2012

@rabfulton

It's easy to feel like that at first, I did to. But it's really not the FS's "fault" that the hard drives lie about their sector size. Although you could argue that ZFS actually is trying to be overly intelligent, deciding stuff on it's own like that. I think that ZFS should have an option to force-set ashift on demand. I saw talk about that from way back in a solaris mailing thread, where no one really saw anything negative about it either, but sadly it just never got done

/Sebulon

TheDreamer · Jul 12, 2012

I have a broken folder, but at least it doesn't panic my system. But, its deep down in my /usr/ports filesystem.

I was able to mv it aside....

The Dreamer