Advice on ZFS ashift settings on a pool of HDD

pcohen · Jan 9, 2014

Hi

And best wishes for this new year to all the FreeBSD community.

I installed a few years ago a server to do mainly file storage (plus other stuff but mainly file storage). For that purpose I bought an SSD and three WD10EARS (WD Green SATA 1 TB). My goal was (and remains) to have storage and not necessarily performance. I did not take care at that time and created a RAID-Z pool directly on the device. I recently figured that I might not use the AF features of my hardware using the ashift 12 value.

Meanwhile I read that using ashift=12 can improve ~~perfs~~ performance but might reduce storage capacity a bit. As my goal is to privilege storage over ~~perf~~ performance and that my SSD plays the role of cache for my pool, do you think it would make sense to recreate my pool with ashift=12, is it worthless or even should I keep the current settings as my main goal is storage?

Thanks in advance for any advice.

Best regards.

NB I am not an expert of these topics so I apologize if I am not precise enough or do not use the right terms.

wblock@ · Jan 9, 2014

Do those drives use 4K blocks? Not all of the Green drives do. If not, it doesn't matter much.

On 4K drives, both ashift and alignment need to be correct, or the write speed will be half of what the hardware can do, or less.

pcohen · Jan 9, 2014

Good evening (European time).

Thanks for the fast answer. Yes they are supposed to use 4K blocks - at least it is in the technical specifications. So I suppose I should try to recreate my pool? To do so, should I perform the following operation on all disks:

Code:

# X in 1 2 3
gpart create -s gpt /dev/adaX
gnop create -S 4096 /dev/adaX 

zpool create -o ashift=12 mydatapool /dev/ada1.nop /dev/ada2.nop /dev/ada3.nop
# Should I destroy the nop after ?

Thanks.

wblock@ · Jan 9, 2014

No, the first thing to do would be a full backup.

After that, partition and make sure the partitions are aligned to 4K blocks. The partitions must start on a block that is evenly divisible by 8, a 4K multiple. (This is a separate issue from ashift, yet is frequently ignored.)

It is not necessary to create multiple gnop(8) devices, you only need one with the right size. The pool will use the largest block size of all the devices.

pcohen · Jan 9, 2014

wblock@ said:
No, the first thing to do would be a full backup.

Always good to mention it, though I perform already regular backups and of course had planned to do a fresh one just before the operation as I am aware this kind of operation will destroy the pool and therefore the data.

wblock@ said:
After that, partition and make sure the partitions are aligned to 4K blocks. The partitions must start on a block that is evenly divisible by 8, a 4K multiple. (This is a separate issue from ashift, yet is frequently ignored.)

If I use the whole disk, do I have to do such an operation, I mean starting from another position than 0 ?

Thanks again for your support.

wblock@ · Jan 10, 2014

If the ZFS blocks start at zero, it's fine. But don't do the gpart create step shown above, and use gpart destroy -F on the drives to remove old partition metadata.

pcohen · Jan 10, 2014

Ok, fine, I must call destroy first but after, could I execute my command or do any other operation? If I look in old posts I see some calls to newfs for example. There is also a -a option in gpart. Basically I could maybe also do:

Code:

gpart destroy -F /dev/ada1
gpart create -s GPT /dev/ada1
gpart add -t freebsd-zfs -a 4k ada1 #inspired by gpart man page

I am not sure it would work either. Should I use that or gnop? And how does it compare to using "raw zfs" - because here I explicitly create a partition?

Thanks

wblock@ · Jan 10, 2014

I was working on testing this in a VM, but it was fighting back and so will be delayed a bit for an update. In the meantime:

There are not too many reasons to create partitions when you're using the whole disk for ZFS anyway.

gnop(8) does one thing here: it is a workaround to force ZFS to use 4K blocks. Otherwise, it will use what the disks report as a block size, 512 bytes. Again, forcing ZFS to use 4K blocks does not guarantee that they line up with the 4K blocks on the disk, that's alignment. If they don't line up, write speed drops to half of what the hardware can do, or even less.

At present, you have to do both. In the future, ZFS will have a way to specify block size, and also do better testing for real physical block size on the disks.

wblock@ · Jan 10, 2014

Okay, the VM seems to be cooperating now.

First, get rid of any leftover partitioning schemes:

Code:

# gpt destroy -F ada1
# gpt destroy -F ada2

When using full disks for ZFS, there aren't many reasons to use partitioning at all. One reason would be to use GPT labels, but those aren't strictly necessary because ZFS puts its own labels on disks so they are relocatable. You can take a disk array and randomly reconnect it to a new machine and it will still work.

If you want to use partitions, I recommend starting the first data partition at 1M, or block 2048. This is a 4K-aligned block. It's also not a bad idea to make it end a bit before the end of the disk, preferably 1M or more. That has to be calculated out, but here we'll pretend the resulting size is 960M.

Code:

# gpart create -s gpt ada1
# gpart add -t freebsd-zfs -a4k -b1m -s960m ada1
# gpart create -s gpt ada2
# gpart add -t freebsd-zfs -a4k -b1m -s960m ada2

That takes care of alignment, now to force ZFS to use 4K blocks. For mirrors and RAID-Z, ZFS will use the largest block size of the devices listed. So we create a fake device with gnop(8) that has 4K blocks, but otherwise just passes everything on to ada1.

Code:

# gnop create -S4k ada1
# zpool create tank mirror /dev/ada1.nop /dev/ada2

After a reboot, the nop device disappears, but all the ZFS data was written to ada1, so ZFS just sees it as if a device name changed.

pcohen · Jan 11, 2014

Hi,

Thank you very much for all the details, I will handle it soon.
Being a bit paranoid

, I will do a second backup before destroying my pool.

Thanks again for the help and I would like to take this opportunity to say how much I appreciate the quality of this forum. Reactivity and accuracy of the answers.
Thank you very much.

Best Regards

throAU · Jan 22, 2014

2c.

Irrespective of whether or not your current drives use 4k blocks, I would set ashift=12.

Why?

Because drives you purchase in future WILL be 4k blocks, and once you create a VDEV, its ashift is fixed.

You can not remove a VDEV from a ZPOOL.

What does this mean? If you have a disk failure and need to replace with a new disk (or, you want to upgrade the disks in your pool to expand capacity), it will have the wrong ashift, and you can not fix it without re-creating the ZPOOL (currently, this is a major problem with ZFS in my opinion - as eventually we'll need to migrate to larger block-sizes in future. ZFS capacity limits are fine and won't be exceeded, but block sizes will need to adjust in the future. Yes, 4k to the next size up will be a long time, but if you're using ZFS for permanent storage, its conceivable that your pool will still be around).

I just went through this exact problem with my home setup - luckily I had 2x mirror VDEVS - I removed 1 drive from each mirror, created a new pool with 2x single disk VDEVs, migrated data to the new pool, destroyed the old one and added the free disks to the new pool's VDEVs, promoting the single disks to mirrors.

If I'd been using RAIDZ it would have been a lot more painful.

Faiar · Apr 17, 2014

Nice observation. So basically, your are saying if I make a 2 hard drive ZFS mirror, and need to replace one of them, I format the new one, the same method as previous, and then add it to the pool for resilvering, this won't work?

LE: on a 4k block situation!

ralphbsz · Apr 18, 2014

throAU said:
Because drives you purchase in future WILL be 4k blocks, and once you create a VDEV, its ashift is fixed.

For several years, you will be able to buy 4K transitional drives. They have physically 4K sectors on the platter, but they use read-modify-write techniques to allow atomic updates on 512 byte boundaries, and they pretend to have 512 byte sectors.

(Actually, the above description is somewhat sloppy; in reality there are concepts such as physical block size, logical block size, and sectors, and they are all different, but the above is good enough for most people).

I initially thought that these transitional drives would have really awful performance (after all, for every 512-byte write, you'd think they first have to log on the platter that they are about to being an update, read the 4K, wait a rotation, rewrite it, and then spool a log record indicating that the update was successfully done). But a friend has tested their performance, and they run reasonably well pretending to be 4096 byte drives. The drive manufacturers are clearly using some magic sauce.

So don't panic that all existing file systems have to be immediately re-formatted with 4K sectors (or ashift-12) just to be future-compatible. Don't do anything rash, and relax. On the other hand, if you are formatting a new file system today, it's obviously smart to go for 4K-sector compatibility, even if your current hardware is still 512 bytes.

Faiar · Apr 18, 2014

Another question, sorry for it but I just started working with ZFS. I'm trying to make a ZFS mirror exactly like you told, so here's the problem:

Code:

nas4free: ~ # gpart add -t freebsd-zfs -a4k -b1m -s930gb ada1
ada1p1 added
nas4free: ~ # gpart create -s gpt ada0
ada0 created
nas4free: ~ # gpart add -t freebsd-zfs -a4k -b1m -s930gb ada0
ada0p1 added
nas4free: ~ # gnop create -S4k ada1
gnop: Provider ada1.nop already exists.
nas4free: ~ # gnop create -S4k ada0
gnop: Provider ada0.nop already exists.
nas4free: ~ # zpool create tank mirror /dev/ada1.nop /dev/ada0
nas4free: ~ # zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada1.nop  ONLINE       0     0     0
            ada0      ONLINE       0     0     0

errors: No known data errors
nas4free: ~ # reboot

and after that:

Code:

Welcome to NAS4Free!
nas4free: ~ # zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada1.nop  ONLINE       0     0     0
            ada0      ONLINE       0     0     0

errors: No known data errors

and in dmesg:

Code:

GEOM: ada1: the primary GPT table is corrupt or invalid.
GEOM: ada1: using the secondary instead -- recovery strongly advised.
GEOM: ada0: the primary GPT table is corrupt or invalid.
GEOM: ada0: using the secondary instead -- recovery strongly advised.

Any suggestions? Thanks in advance and sorry for my newbie style

LE: I think the command for creating the partition is not good, or something:

Code:

nas4free: ~ # gpart add -t freebsd-zfs -a4k -b1m -s220gb ada3
ada3p1 added
nas4free: ~ # gpart list ada3
Geom name: ada3
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 468862094
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: ada3p1
   Mediasize: 236223201280 (220G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0
   rawuuid: f1caba00-c6ce-11e3-aad9-d4ae52cfa0da
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 236223201280
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 461375487
   start: 2048
Consumers:
1. Name: ada3
   Mediasize: 240057409536 (223G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0