Does gpart -a 4096 improve performance on a WD 4k sector disk?

Snowe · Mar 25, 2012

Hi,

I've been investigating this since I discovered abysmal performance on a samba share. The disk is a 1G WD10EARX with a ZFS dataset on it. I wiped the disk and tried the gnop trick to attempt to fool ZFS on the unpartitioned device into using 4k sectors but it would not stick (the ashift value remains at 9). So I gave up with trying to get ashift 12 and partitioned the disk with a gpt scheme and added a partition with:

# gpart add -t freebsd-zfs -a 4096 ada2

Resulting in:

Code:

[ian@serenity:~] gpart show ada2
=>        34  1953525101  ada2  GPT  (931G)
          34           6        - free -  (3.0k)
          40  1953525088     1  freebsd-zfs  (931G)
  1953525128           7        - free -  (3.5k)

Does this ensure better performance? Does it eliminate or reduce crossing the boundaries on the 4k sectors? Performance has certainly improved but how much is not due to having reduced the amount of data and freshly restored it?

-Ian

gkontos · Mar 26, 2012

For a 4K alignment the correct procedure would be:

[CMD=""]# gpart add -t freebsd-zfs -l disk0 -b 2048 -a 4k ada2[/CMD]
[CMD=""]# gnop create -S 4096 /dev/gpt/disk0[/CMD]
[CMD=""]# zpool create pool /dev/gpt/disk0.nop[/CMD]
[CMD=""]# zpool export pool[/CMD]
[CMD=""]# gnop destroy /dev/gpt/disk0.nop[/CMD]
[CMD=""]# zpool import pool[/CMD]

After this is done you will get a value of ashift=12.

Snowe · Mar 27, 2012

Nope it doesn't work.

I cut and pasted your commands changing only the label.
zpool refuses to import the pool.

Code:

serenity:~# zpool import zpool1
cannot import 'zpool1': one or more devices is currently unavailable
serenity:~# zpool import
  pool: zpool1
    id: 14604611665451009313
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        zpool1                  FAULTED  corrupted data
          10187083952381187777  UNAVAIL  corrupted data

If I recreate the gnop device I can reimport the pool.

Code:

serenity:~# gnop create -S 4096 /dev/gpt/zpool1
serenity:~# zpool import
  pool: zpool1
    id: 14604611665451009313
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        zpool1            ONLINE
          gpt/zpool1.nop  ONLINE
serenity:~# zpool import zpool1

serenity:~# zdb
[...]
zpool1:
    version: 28
    name: 'zpool1'
    state: 0
    txg: 428876
    pool_guid: 2534786943769185081
    hostid: 1596434886
    hostname: 'serenity'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2534786943769185081
        children[0]:
            type: 'disk'
            id: 0
            guid: 1564510495648812692
            path: '/dev/ada2p2'
            phys_path: '/dev/ada2p2'
            whole_disk: 1
            metaslab_array: 30
            metaslab_shift: 33
            ashift: 9
            asize: 991610011648
            is_log: 0
            create_txg: 4

I tried many permutations with the same result. I think this is a result of the WD green HD 512/4096 translation.

Code:

serenity:~# diskinfo -v ada2
ada2
        512             # sectorsize
        1000204886016   # mediasize in bytes (931G)
        1953525168      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        1938021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCC0T0165231 # Disk ident.

serenity:~# smartctl -i /dev/ada2 | grep Sector
Sector Sizes:     512 bytes logical, 4096 bytes physical

So my original question, will # gpart add -t freebsd-zfs -a 4096 ada2 or offsetting the start as you suggested# gpart add -t freebsd-zfs -b 2048 -a 4096 ada2) at least minimize crossing the 4k boundary, if not eliminate it?

PS Sometimes I was able to re-import the pool but after a reboot it would be offline in a FAULTED state.

Snowe · Mar 27, 2012

OTOH if x * 512 / 4k should be an even number
then # gpart add -t freebsd-zfs -b 64 -a 4k ada2 should be ok?

Code:

serenity:~# gpart show ada2
=>        34  1953525101  ada2  GPT  (931G)
          34          30        - free -  (15k)
          64  1953525064     1  freebsd-zfs  (931G)
  1953525128           7        - free -  (3.5k)

64 * 512 / 4096 = 8

Snowe · Mar 28, 2012

Ok, fixed it.

Adding

Code:

kern.cam.ada.0.quirks="1"

to /boot/loader.conf allowed the zfs partition on a Western Digital EARX disk to be formatted with ashift 12 following the usual procedure with gnop.

See http://lists.freebsd.org/pipermail/freebsd-fs/2011-June/011704.html

Sebulon · Mar 28, 2012

@Snowe

IÂ´ll just tell you what I know. There are two very important things to know about Advanced Format drives (and also SSDÂ´s) performance in regards to ZFS;

1. With AF drives, if you are using any partitioning, you have to make sure that the partitions are aligned. This previous example you made was correct:
# gpart add -t freebsd-zfs -b 64 -a 4k ada2
That is a 4k aligned partition, both at the beginning and the end. This will make sure that no IOÂ´s cross any boundaries.

I have also read partition guides that a safer start boundary would be 1MiB:
# gpart add -t freebsd-zfs -b 2048 -a 4k ada2
Never noticed any differences so far, but IÂ´m going to keep using that start just because of the word "safer".

2. ashift is by no means an "alignment". ashift decides the smallest IO ZFS will do. 9 for 512b IOÂ´s, and 12 for 4096b IOÂ´s. AF drives really never do smaller IOÂ´s than 4096b, since thatÂ´s what they really are. Therefore, thereÂ´s really no point in making ZFS send smaller IOÂ´s since the drives are just gonna have to wait four times before actually doing anything.

So make sure your partitions are aligned and make sure that the vdev ashift is 12. That will give you awesome performance.

# gpart create -s gpt ada2
# gpart add -t freebsd-zfs -l disk0 -b 2048 -a 4k ada2

For retarted reasons, you need to create "fake" drives to create your pool with. You shouldnÂ´t have to do this really, but itÂ´s the only workaround IÂ´ve found for the gnop-trick to work and the labels to show up. The recipe for dd is to seek forwards to the total size of your drives, minus 1 MiB for partition start and stop boundary. Use diskinfo to see the size of one of your drives in bytes:

Code:

[CMD="#"]diskinfo -v ada0[/CMD]
	512         	# sectorsize
	2000398934016	# mediasize in bytes (1.8T)
	3907029168  	# mediasize in sectors
        ...

[CMD="#"]echo "2000398934016 / 1024000 - 1" | bc[/CMD]
1953513

# dd if=/dev/zero of=tmpdsk0 bs=1024000 seek=1953513 count=1
# mdconfig -a -t vnode -f /tmp/tmpdsk0 md0
# gnop create -S 4096 md0
# zpool create -o autoexpand=on pool1 mirror md0.nop gpt/disk0
# zpool detach pool1 md0.nop
# mdconfig -d -u 0
# rm /tmp/tmpdsk0

Code:

[CMD="#"]zdb | grep ashift[/CMD]
      ashift: 12

/Sebulon

gkontos · Mar 28, 2012

Please have a look at this commit.

The following example shows a backup server recently installed consisted by 2 pools.

The first pool includes the OS and 4 jails being constantly replicated in a HA scenario with CARP interfaces.
The second pool is used only for backups and collects data from 4 different servers.

[CMD=""]gkontos@backup1:~> gpart show -l [/CMD]

Code:

=>        34  1953525101  ada0  GPT  (931G)
          34          94     1  (null)  (47k)
         128  1953525007     2  disk0  (931G)

=>        34  1953525101  ada1  GPT  (931G)
          34          94     1  (null)  (47k)
         128  1953525007     2  disk1  (931G)

=>        34  3907029101  ada2  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  disk2  (1.8T)
  3907029128           7        - free -  (3.5k)

=>        34  3907029101  ada3  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  disk3  (1.8T)
  3907029128           7        - free -  (3.5k)

[CMD=""]backup1# zdb zroot | grep ashift[/CMD]

Code:

backup1# zdb zroot | grep ashift
                ashift: 12
                ashift: 12

[CMD=""]backup1# zdb tank | grep ashift[/CMD]

Code:

                ashift: 12
                ashift: 12

zroot has been created like this and tank as described above in my previous post.

Regardless of the drives being used, it is a good idea to always align for 4K.

mworld · Jun 26, 2012

What about this scenario? These are apparently 4k advanced format drives.

FreeBSD 9 with gpart and zfs:

Code:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u1    RAID-5    OK             -       -       256K    11175.8   RiW    ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u1   2.73 TB   SATA  0   -            WDC WD30EZRX-00MMMB0
p1    OK             u1   2.73 TB   SATA  1   -            WDC WD30EZRX-00MMMB0
p2    OK             u1   2.73 TB   SATA  2   -            WDC WD30EZRX-00MMMB0
p3    OK             u1   2.73 TB   SATA  3   -            WDC WD30EZRX-00MMMB0
p4    OK             u1   2.73 TB   SATA  4   -            WDC WD30EZRX-00MMMB0

I tested four of these drives in RAID0 and got ~450 MB/sec (lowest read or write figure) using gpart with -a 512. Now that it's RAID5 (which would normally drop a bit of performance anyway) it's ~170MB/sec with five drives.

The question is, how does the RAID's 256k stripe play into the equation? Should it be 4k aligned as per a single drive? I'll back up the data and do some testing.

Code:

# gpart show
=>         34  23437410237  da1  GPT  (10T)
           34          478       - free -  (239k)
          512     33554432    1  freebsd-swap  (16G)
     33554944  23403854848    2  freebsd-zfs  (10T)
  23437409792          479       - free -  (239k)

mworld · Jun 27, 2012

I might skip using zfs on 4k drives. Just tried good ol' UFS and it flies. As per http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

Sebulon · Jun 28, 2012

@mworld

You have completly missed the point of ZFS when creating a pool using one big RAID volume. The point of ZFS is to turn your fancy RAID into a dumb HBA and export the disks as JBOD to the OS. Then you partition the disks one by one and create the ZFS pool with those individual partitions. ZFS uses its own internal dynamic striping of 128k on top of that. Otherwise youÂ´ll miss out on all the fancy error correcting, self-healing and end-to-end checksumming that ZFS has to offer.

Do that and it will fly past that even further

/Sebulon

mworld · Jun 29, 2012

Last time I tried software RAID, the performance was limited at best. Perhaps next time I have some free drives I'll try it out.

Sebulon said:
@mworld

You have completly missed the point of ZFS when creating a pool using one big RAID volume. The point of ZFS is to turn your fancy RAID into a dumb HBA and export the disks as JBOD to the OS. Then you partition the disks one by one and create the ZFS pool with those individual partitions. ZFS uses its own internal dynamic striping of 128k on top of that. Otherwise youÂ´ll miss out on all the fancy error correcting, self-healing and end-to-end checksumming that ZFS has to offer.

Do that and it will fly past that even further

/Sebulon