4K Blocks and letting Zpool create a pool.

Phoenixxl · Mar 11, 2013

Hello,

I read about the 4K issue with drives like the Caviar Green 3TB ones I bought, and used gnop to make the first drive in my RaidZ arrays 4K, used the .gnop device to create the zpool. Exported the pool, then imported again with the .gnop's removed.

When I do a [cmd=]zdb | grep ashift[/cmd] it's all listed as ashift 12.

Code:

root@Castor:/ # zdb | grep ashift
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12

I actually made a script to do it since I knew I'd end up trying.

http://pastebin.com/MuWiLpXs

Should I be ok now?

Why I ask, and why my writing this:
I haven't made partitions myself, I gave zpool whole drives. The initial document I read about using the drives I use just showed the "gnop trick", today I came across a site where they partition manually as well.

I have been using ZOL (Laurence Livermore Labs zfs on linux) on another system for some time now. Giving zpool there an empty drive instead of a partition, the zpool command would align automatically. I assume the Zpool command, once it detects the drives as being 4K does this as well.

I was also wondering about the sysctl value of vfs.zfs.vdev.cache.size being 0 by default. I assume that means some kind of dynamic allocation? Some places say the default value should be 10M, was that for earlier FreeBSD installs? I am reluctant to change the value before I know for sure.

Code:

root@Castor:/ # sysctl -a | grep vfs.zfs.vdev.cache.size
vfs.zfs.vdev.cache.size: 0

I am using 9.1.

Info about most other sysctl values, where to change them and zfs usage on FreeBSD specifically I have found around the web during these last 2 weeks.

Thank you in advance for any reply.

sub_mesa · Mar 11, 2013

Two things required to work well with 4K sector emulation drives aka Advanced Format:

proper alignment
ashift=12 meaning ZFS will do I/O in multiples of 4KiB (4096 bytes).

In your case you have optimized ZFS to use ashift=12 which is good. The alignment should also be good if you use raw disks without partitions. My own preference is always GPT partitions aligned properly at 1024K offset. Using partitions mean you get a lot of benefits, like a human-assigned label name to identify the disk as well as protect against disks that are a fraction smaller or larger than other disks, which can cause problems when replacing a failed disk, but the replacement disk is just a couple of kilobytes smaller. Then you are out of luck.

However, please be aware of the slack issue with larger sector sizes. In your case you have 12 drives in RAID-Z2 which is not optimal. 6 or 10 or 18 drives in RAID-Z2 is optimal. For performance reasons, but also usable storage space. In case of 12-disks in RAID-Z2 with 4K sector size, you will lose some storage space because of slack, this usually amounts to a few percent which can grow to meaningful numbers on large pools.

Sebulon · Mar 12, 2013

@Phoenixxl

HereÂ´s an example of aligned partitioning, etc:
https://forums.freebsd.org/showpost.php?p=198755&postcount=12

And FYI, sh != bash in FreeBSD.

sub_mesa said:
6 or 10 or 18 drives in RAID-Z2 is optimal.

Really, 18 drives in one vdev? If youÂ´d ever need to resilver (which you would eventually), it would just take forever. Chances would become too great that more crap starts piling up in the mean time. I would suggest sticking with Sun/Oracle best practice never having more than 8 drives in one vdev, and I can stretch that to 10, but not more.

/Sebulon

sub_mesa · Mar 12, 2013

Of course, 18 disks in RAID-Z2 is not recommended in general, for several reasons:

1) many disks in one RAID-Z-like vdev means the entire vdev has the random I/O capabilities of just one disk;
2) 2 disks parity may be too few for 18 disks total; using 2 RAID-Z2 of each 10 disks would be more logical;
3) rebuild times can grow on too large vdevs;
4) with many disks in the same vdev, if one disk is slower this will drag ALL disks down to the slowest disk. This behavior is worse than traditional RAID5 where one I/O is handled by one disk; whereas in RAID-Z this would involve ALL disks instead.

However, the advise of 6/10/18 disks is not just some random number, there is some math to it:

ZFS recordsize / (<total disks> - <parity disks>) = 'stripesize' per disk

This leads to a sub-optimal 'stripesize' that disks have to process. For example, 12 disks in RAID-Z is not optimal:

128KiB / (12 - 2) = 12.8K -> 13.0K = BAD alignment - not a multiple of 4K

Pool configurations optimal for 4K Advanced Format harddrives
Stripe: all configurations optimal
Mirror: all configurations optimal
RAID-Z: optimal configurations involve 3, 5, 9 disks
RAID-Z2: optimal configurations involve 4, 6, 10, 18 disks
RAID-Z3: optimal configurations involve 5, 7, 11, 19 disks

In other words, you want the number of data disks to be a power of 2, which prevents the disks from having an odd stripesize. Pool configurations with ashift=12 which do not conform to the list of optimal configurations will have both slightly lower performance but also lesser storage space due to slack.

In most cases, you want to stick to: RAID-Z of 3 or 5 disks, or RAID-Z2 of 6 or 10 disks. Larger pools can simply use multiple vdevs.

Phoenixxl · Mar 12, 2013

Thank you all for the replies.

I already have 28.3 TB of data on the thing. taking 29.1 TB of space. I'll live with the loss.

My drives are aligned right, ashift is at 12 that's all I really cared about. I understand that when writing a block it will have to be put on 32 drives.. And I only have 20 non parity drives so the writes will overlap. That's OK. I get 1.4 GB/sec in sequential writes as it is now. I am the only one using the server. This is fine.

-

My second question however still remains.. and is quite important in my case. vfs.zfs.vdev.cache.size, it being 0 in a default installation. Is it really zero, or does that number indicate some kind of variable length? Some sites talk about the default value being 10M. Is that for installs of earlier versions of ZFS?

Code:

root@Castor:/root # sysctl -a | grep vfs.zfs.vdev.cache
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 0
vfs.zfs.vdev.cache.max: 16384

bshift and cache.max seem to be the regular numbers...

I am inclined to manually set size to 10M if it comes to light current default values (maybe for use on hardware with low memory) avoid using cache.

If anyone knows with some kind of certainly, please reply. This is quite important for me.

Thank you in advance.

Phoenixxl · Mar 12, 2013

Sebulon said:
And FYI, sh != bash in FreeBSD.

Thank you for replying , but I really didn't need a csh for my script. sh is more than enough.

Regards.

Sebulon · Mar 12, 2013

Phoenixxl said:
Thank you for replying , but I really didn't need a csh for my script. sh is more than enough.

Regards.

Oh, no it was I that must have misunderstood. I looked at your script on pastebin where the header states:

[BASH] #!/bin/sh ...

I thought you had named the document that way, but it seems pastebin does that all by itÂ´s itself. I was just pointing out that bash is not sh, they behave very differently in fact. But if youÂ´re coming from linux land, it can be hard to know the difference, since "they" make no (well, at least very little) distinction between them.

/Sebulon

Phoenixxl · Mar 12, 2013

Well, I'm assuming VDEV cache is disabled by default on 9.1 installs, and FreeBSD uses the vfs.zfs.vdev.cache sysctl variable to do so. 0 meaning disabled?

I was under the impression the convention was to use vfs.zfs.vdev.cache.max at 1 to disable VDEV cache.

I will manually set vfs.zfs.vdev.cache at "10M" and hope I'm not messing anything up by doing so.

Any comments concerning this are still welcome and appreciated .