Help with ZFS and block size alignment.

After a recent upgrade to FreeBSD 10.0-RELEASE (from 9.2-RELEASE), I get the following error message, and I can't properly handle it:

Code:
# zpool status
  pool: system
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.
  scan: resilvered 1.63G in 0h3m with 0 errors on Thu Mar 29 13:17:43 2012
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            gpt/system1  ONLINE       0     0     0  block size: 512B configured, 4096B native
            gpt/system0  ONLINE       0     0     0  block size: 512B configured, 4096B native

After some research, this is what I gathered about this problem:
  1. This happened because my drives misrepresent their sector size;
  2. My pool's ashift is 9 which doesn't align with my drives' 4K sectors;
  3. A pool's ashift can't be changed after creation.
What can I do to solve this? I was about to do the following for each drive, will it work?

Code:
01.     # zpool detach system1
02.     # gpart delete -i 3
03.     # gpart create -a4k -t freebsd-zfs -l system1 ada1
04.     # gnop create -S 4096 /dev/gpt/system1 
05.     # zpool create -O mountpoint=/mnt -O atime=off -O setuid=off -O canmount=off -O xattr=on temporary /dev/gpt/system1.nop
06.     # zfs snapshot system@001
07.     # zfs send system | zfs recv -d temporary
08.     # zfs umount temporary
09.     # zfs set mountpoint=/ temporary
10.     # zfs set mountpoint=/mnt system
11.     # zpool export system
12.     # zpool import system old
13.     # zpool export temporary
14.     # gnop destroy /dev/gpt/system1.gnop
15.     # zpool import temporary system
The problem is I'm not very familiar with ZFS' inner workings and constructed my pool a while back, so I'm a little apprehensive about jumping right in there destroying pools and providers using unfamiliar commands, like gnop(8) in step 04, and I can't tell if I need to zero the partition (or part of it) between steps 02 and 03 with dd if /dev/zero of /dev/ada1p3 or something. I also suspect the flags -a4k isn't needed on step 03, considering the original setup:

Code:
# gpart create -s GPT ada1
# gpart add -b 1m -s 128 -t freebsd-boot ada1
# gpart add -s 16g -t freebsd-swap -l swap1 ada1
# gpart add -t freebsd-zfs -l system 1 ada 1
Which was intended to make everything align correctly. In fact, geom part list ada1 -v gives all offsets and lengths as multiples of 4096. The partitions don't start at a multiple, though, which I can't tell is relevant. Also, Sectorsize is 512 and Stripesize is 4096.

I used the following guide as reference when first installing FreeBSD: http://blogs.freebsdish.org/pjd/2010/08/06/from-sysinstall-to-zfs-only-configuration/. If you guys need the specifics of what I changed I can post it here.

About this system, it's a very small media server with two identical 2TB WD Caviar Green drives, mirroring each other. Each is partitioned according to the above guide, with identical boot partitions and encrypted mirrored swap. I do have backups, but only of critical data. Those steps would be executed through ssh while the system is live, I have thought about doing it from a live FreeBSD disk booted on the actual machine.
Code:
# uname -a
FreeBSD domain.example.com 10.0-STABLE FreeBSD 10.0-STABLE #0 r267465: Sun Jun 15 02:04:26 BRT 2014     root@domain.example.com:/usr/obj/usr/src/sys/CUSTOM  amd64
 
ashift is not alignment, it's just the block size used by ZFS.

Alignment means making the 4K ZFS blocks line up with the drive's 4K blocks.

Don't mess with this without a full backup.

Code:
# gpart create -s gpt ada1
# gpart add -t freebsd-boot -a4k -s512k ada1
# gpart add -t freebsd-swap -a4k -l swap1 -s4g ada1
# gpart add -t freebsd-zfs -l system -a4k ada1

Although I would put swap at the end of the drive. The beginning of the drive is the fastest, why use that for swap?

This does not use any encrypted space, you will have to do more work for that. Also, the green drives sometimes cause problems by going into power save. Might work okay with ZFS, though.
 
wblock@ said:
ashift is not alignment, it's just the block size used by ZFS.
Alignment means making the 4K ZFS blocks line up with the drive's 4K blocks.
Ok, makes sense. Thanks. That kind of misunderstanding is exactly what I was afraid of. How do I know if a partition is aligned with the disks, then? Searching Google, all I can find is something akin to what I described doing in my post with -b 1m (which in turn I got from the linked blog post).

wblock@ said:
Don't mess with this without a full backup.
Ok, thanks for the advice, I'll make a more thorough backup before continuing. Most of what matters (that can't be freely obtained from the internet or rebuilt from ports) is backed up though.

wblock@ said:
Code:
# gpart create -s gpt ada1
# gpart add -t freebsd-boot -a4k -s512k ada1
# gpart add -t freebsd-swap -a4k -l swap1 -s4g ada1
# gpart add -t freebsd-zfs -l system -a4k ada1

Although I would put swap at the end of the drive. The beginning of the drive is the fastest, why use that for swap?
This is the order most guides I encountered at the time do it, a few recommended it. Even the Handbook gave swap a good priority (http://www.freebsd.org/doc/handbook/bsdinstall-partitioning.html). Maybe it's old advice relevant only for computers with scarce memory or something, but it made sense when I read it. Another thing I considered is that given the size of almost 2TB of the drive, an offset of 4GB did not seem like that big of a deal to overall performance (that's less than a single movie). To be perfectly honest, I'm still not completely clear on the workings of swap so my decision was more about how many guides did it one way than anything else.

wblock@ said:
This does not use any encrypted space, you will have to do more work for that. Also, the green drives sometimes cause problems by going into power save. Might work okay with ZFS, though.
Don't worry, the rest of the command chain is:
Code:
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada 1
# gmirror label -F -h -b round-robin swap /dev/gpt/swap1
# zpool create -O mountpoint=/mnt -o atime=off -O setuid=off -O canmount=off -O xattr=on system /dev/gpt/system1
# cat > /etc/fstab
system/rootfs / zfs rw,noatime,xattr 0 0
/dev/mirror/swap.eli none swap sw 0 0
^D
I also read about these power save problems while choosing the hard drives, but so far (2 years and 6 months) there was not a single hiccup. Energy efficiency was a big priority in the design.
 
For alignment, make sure the partitions start on even multiples of 4K and are also even multiples of 4K in size. The -a4k in my example forces this.

After that, use gnop(8) to create a "fake" drive with 4K blocks. Most of the ZFS examples show the details. It's still ridiculous that just specifying the smallest ashift desired on the command line is not possible, but it's coming. Use the gnop() device when creating the pool, and the pool will use 4K blocks. Since the partition is aligned to 4K blocks, ZFS will be also.
 
How do I verify that the partition blocks are 4K and also that the ZFS blocks are perfectly aligned with the partition blocks before I perform an upgrade from FreeBSD 9.2 to 9.3?
 
The size of blocks used by the drive are up to the drive. Check the vendor information.
To make sure ZFS blocks are aligned with drive blocks, put ZFS on a 4K-aligned partition.

I don't see how this comes into play with an upgrade.
 
wblock@ said:
ashift is not alignment, it's just the block size used by ZFS.
If you want to dig a little deeper, there is a nice explanation on the Open-ZFS wiki.

This is interesting.
Flash-based solid state drives came to market around 2007. These devices report 512-byte sectors, but the actual flash pages, which roughly correspond to sectors, are never 512-bytes. The early models used 4096-byte pages while the newer models have moved to an 8192-byte page.
If we have flash drives, how can we determine the actual flash pages/sector sizes? Maybe we should we be using an ashift of 13? If only the drive manufacturers would stop lying.

The page also mentions a database of drives that are known to lie about their sector size, but I wonder if this excludes the SSDs mentioned above.
 
jrm said:
..
Maybe we should we be using an ashift of 13? If only the drive manufacturers would stop lying.
...

They will stop doing that when the last machine using the age old BIOS is long dead and gone. BIOS will never work (at least for booting purposes) with anything else but disks that report 512 byte sectors.
 
It seems that my zroot ZFS partition is fine but not my zdata ZFS partition:

Code:
[16]root@test:/usr/local/etc # zdb zroot | grep ashift
                ashift: 12
                ashift: 12
Segmentation fault (core dumped) 
[16]root@test:/usr/local/etc # zdb zdata | grep ashift    
                ashift: 9
                ashift: 9
Segmentation fault (core dumped) 
[16]root@test:/usr/local/etc #

Looking into the procedures I used to set up these ZFS partitions, it struck me that I used the gnop command to set up the zroot partition but not the zdata. The zroot is a single drive setup while the zdata is a RAIDZ1 setup. I seem to recall having difficulties using the gnop command to set up the temporary partition on the zdata array due to multiple disks. How does one use the gnop command to set up a RAIDZ1 partition? Like this?

Code:
# gpart add -t freebsd-zfs -l disk1 -b 2048 -a 4k ada1
# gpart add -t freebsd-zfs -l disk2 -b 2048 -a 4k ada2
# gpart add -t freebsd-zfs -l disk3 -b 2048 -a 4k ada3
# gnop create -S 4096 /dev/gpt/disk1
# gnop create -S 4096 /dev/gpt/disk2
# gnop create -S 4096 /dev/gpt/disk3
# zpool create zdata raidz1 /dev/gpt/disk1.nop /dev/gpt/disk2.nop /dev/gpt/disk3.nop
# zpool export zdata
# gnop destroy /dev/gpt/disk1.nop /dev/gpt/disk2.nop /dev/gpt/disk3.nop
# zpool import zdata

~Doug
 
I think I figured out exactly how I configured the zdata zfs partition. There was a script that I downloaded and modified for my purpose. The script worked fine for my mirror disks that hosted my OS partition but it didn't work for my zdata zfs partition. I went ahead without the gnop command and while it appeared the zdata zfs partition was set up properly it wasn't aligned perfectly with the disk partitions.

Here is my script that was modified for the zdata zfs partition:

Code:
#!/bin/sh
# Based on [url]http://www.aisecure.net/2012/01/16/rootzfs/[/url] and
# @vermaden's guide on the forums
#
# drop into shell using Installation Media and do:
# # dhclient <name of Ethernet device>
# # scp install@test.dawnsign.com:/home/install/install_zfs_raidz1.sh .
#

#DISKS="ada0 ada1 ada2 ada3 ada4 ada5"
DISKS="ada1 ada2 ada3"
#DISKS="mfisyspd0 mfisyspd1 mfisyspd2 mfisyspd3 mfisyspd4 mfisyspd5 mfisyspd6 mfisyspd7 mfisyspd8 mfisyspd9"

echo ""
echo "# remove any old partitions on destination drive"
echo "# Create the ZFS data partitions"
echo "# Align the Disks for 4K and create the pool"

for I in ${DISKS}; do
		NUM=$( echo ${I} | tr -c -d '0-9' )
		gpart destroy -F ${I}
		gpart create -s gpt ${I}
		gpart add -t freebsd-zfs -b 2048 -a 4k -l data_disk${NUM} ${I}
		gnop create -S 4096 /dev/gpt/data_disk${NUM}
done

kldload zfs

echo ""
echo "# create ZFS raidz1"
zpool create -f -O atime=off -O setuid=off -O canmount=on zdata raidz1 /dev/gpt/data_disk1.nop  /dev/gpt/data_disk2.nop /dev/gpt/data_disk3.nop

zpool export zdata

for I in ${DISKS}; do
		NUM=$( echo ${I} | tr -c -d '0-9' )
		gnop destroy /dev/gpt/data_disk${NUM}.nop
done

zpool import -o altroot=/ -o cachefile=/tmp/zpool.cache zdata

echo ""
echo "# Set the zdata fs property and set options"
zpool set listsnapshots=on zdata
zpool set autoreplace=on zdata
zpool set autoexpand=on zdata
# FreeBSD 9.2 and 10 Compression, ZFS v5 pool v5000
zpool set feature@lz4_compress=enabled zdata
zfs set checksum=fletcher4 zdata

echo ""
echo "# create ZFS sets and set options"
zfs create -o compression=lz4 -o exec=on -o setuid=on  zdata/home
zfs create -o compression=lz4 -o exec=on -o setuid=on  zdata/home/install
zfs create -o compression=lz4 -o exec=on -o setuid=on  zdata/home/no-rsync
zfs create -o compression=off -o exec=on -o setuid=on  zdata/backup

zfs umount /zdata
zfs set mountpoint=/zdata zdata

sync
echo ""
echo "# Syncing... Install Done."
echo ""
echo "# Reboot the machine."
echo ""
sync

#### EOF ####

I decided to troubleshoot this script by feeding one line into the command prompt line and seeing how each input worked out. It appeared that each of the command input worked but when the time came to reboot, the zdata zfs partitions weren't there at all. I can see that the zdata pool was created but these aren't being displayed when executing the df command:

Code:
[12]root@test:/root/bin # df
Filesystem      1K-blocks    Used    Avail Capacity  Mounted on
zroot            63983492 1067348 62916144     2%    /
devfs                   1       1        0   100%    /dev
fdescfs                 1       1        0   100%    /dev/fd
zroot/usr        66867596 3951452 62916144     6%    /usr
zroot/usr/ports  64895844 1979700 62916144     3%    /usr/ports
zroot/usr/src    64258032 1341888 62916144     2%    /usr/src
zroot/var        63317244  401100 62916144     1%    /var
zroot/var/log    62942936   26792 62916144     0%    /var/log
[12]root@test:/root/bin # zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zdata  8.12T  16.5M  8.12T     0%  1.00x  ONLINE  -
zroot  74.5G  9.21G  65.3G    12%  1.00x  ONLINE  -
[12]root@test:/root/bin #

I suspect that I'm not correctly setting the mountpoint. Can someone point me in the right direction?

~Doug
 
I can confirm the the newly created zdata zfs partition is properly aligned:

Code:
[14]root@test:/root/bin # zdb 
zdata:
    version: 5000
    name: 'zdata'
    state: 0
    txg: 13
    pool_guid: 2557486137044716545
    hostid: 531525864
    hostname: 'test.example.com'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2557486137044716545
        children[0]:
            type: 'raidz'
            id: 0
            guid: 9432835568044788724
            nparity: 1
            metaslab_array: 34
            metaslab_shift: 36
            ashift: 12
            asize: 9001760980992
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 15293535160840931828
                path: '/dev/gpt/data_disk1'
                phys_path: '/dev/gpt/data_disk1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 6082184454211888348
                path: '/dev/ada2p1'
                phys_path: '/dev/ada2p1'
                whole_disk: 1
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 9232995351358260693
                path: '/dev/gpt/data_disk3'
                phys_path: '/dev/gpt/data_disk3'
                whole_disk: 1
                create_txg: 4
    features_for_read:
        com.delphix:hole_birth
zroot:
    version: 5000
    name: 'zroot'
    state: 0
    txg: 8167823
    pool_guid: 15545892302639630615
    hostid: 531525864
    hostname: 'test.example.com'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 15545892302639630615
        children[0]:
            type: 'disk'
            id: 0
            guid: 534444360235656006
            path: '/dev/gpt/disk0'
            phys_path: '/dev/gpt/disk0'
            whole_disk: 1
            metaslab_array: 30
            metaslab_shift: 29
            ashift: 12
            asize: 80021553152
            is_log: 0
            DTL: 136
            create_txg: 4
    features_for_read:
[14]root@test:/root/bin #

Still haven't figured out how to mount zdata.
 
I'm not sure why but I can now automount the zdata zfs partitions upon booting up. It looks like I needed to actually unmount the zdata zfs partition and then create the mountpoint for zdata. After that on each reboot, I'm able to see all of the zfs partitions that I should see!

~Doug
 
I am not sure about 9.3-release. But in 10-stable, there is vfs.zfs.min_auto_ashift, which can be used to specify the minimum ashift property. So to build a pool with ashift=12, just set # sysctl vfs.zfs.min_auto_ashift=12There is no longer a need for the gnop trick.
 
Back
Top