Mitigating 4k disk issues on RAIDZ

AndyUKG · Jan 29, 2013

Hi,

was just checking out this page and video, can anyone make out or guess what the guy is saying about record sizes and mitigating issues on FreeBSD with ZFS when using 4k disks? Its at 19.50 in the video. They are talking before hand about that you may recieve a space penalty when using RAIDZ with a large stripe size when using 4k disks.

http://wiki.illumos.org/display/illumos/ZFS+and+Advanced+Format+disks

thanks! Andy.

Sebulon · Jan 30, 2013

Well, not even a particularly large stripe either as I see it. If you have a raidz(2,3) vdev exporting a 4k zvol, what they were saying was that all of your writes basically was turned into mirroring, having to write out more parity than data. Ouch! But logical. A shame I didn´t think of that before...

But then some other guy in the back starts mumbling something and the speaker then responds that the work that has been made in FreeBSD mitigates these issues. So I don´t know, maybe we dodged that one, thanks to the devs thinking of that before anyone else did

/Sebulon

Sebulon · Jan 30, 2013

I just had a look to see how we´re doing on a zvol that is exported with 4k block size...and it´s bad, really bad. When you look in the initiator OS to see how much data it thinks it has written, looking at Properties in Explorer for that volume, it says 575GB Used. The storage tells a very different story:

Code:

# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
...
pool/vols/lun_1           1.06T  7.92T  1.06T  -

Big ouch!

I guess you would need to have zvols created with much bigger block sizes, like 64k, for ZFS to able to stripe that efficiently with 4k IO's. But if the initiator then sets up e.g. database that uses a smaller record size, e.g. 8k, then every 8k record would actually take up 64k on the storage... Funk! What are we supposed to do?!

/Sebulon

Sebulon · Jan 31, 2013

OK,

after a strong cup of coffee, I decided to do some benchmarking of this phenomenon, since there´s so little to be known about it. First the setup:

Code:

# gpart show
=>        34  3907029101  da0  GPT  (1.8T) <= All disks are partitioned like this.
          34        2014       - free -  (1M)
        2048  3907027080    1  freebsd-zfs  (1.8T)
  3907029128           7       - free -  (3.5k)
...
#zpool status
  pool: pool
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jan 27 03:30:16 2013
config:

	NAME             STATE     READ WRITE CKSUM
	pool2            ONLINE       0     0     0
	  raidz2-0       ONLINE       0     0     0
	    gpt/rack1-1  ONLINE       0     0     0
	    gpt/rack1-2  ONLINE       0     0     0
	    gpt/rack1-3  ONLINE       0     0     0
	    gpt/rack1-4  ONLINE       0     0     0
	    gpt/rack1-5  ONLINE       0     0     0
	  raidz2-1       ONLINE       0     0     0
	    gpt/rack2-1  ONLINE       0     0     0
	    gpt/rack2-2  ONLINE       0     0     0
	    gpt/rack2-3  ONLINE       0     0     0
	    gpt/rack2-4  ONLINE       0     0     0
	    gpt/rack2-5  ONLINE       0     0     0
	logs
	  gpt/log1       ONLINE       0     0     0
	cache
	  gpt/cache1     ONLINE       0     0     0

errors: No known data errors
# zdb | grep ashift
            ashift: 12
            ashift: 12
            ashift: 12

And then I took a large folder on the initiator that has a very random content:

Ultra super testfolder X

Size: 147GB
Files: 145649
Directories: 25212

Which I have used to copy out on to these zvols, that was created in the following manner:

Code:

# zfs create -b $volblocksize -o sync=always -o compress=on -s -V 2t pool/vols/lun_{1,2,3,4}
# zfs get volblocksize pool/vols/lun_{1,2,3,4}
NAME               PROPERTY      VALUE     SOURCE
pool/vols/lun_1    volblocksize  4K        -
pool/vols/lun_2    volblocksize  16K       -
pool/vols/lun_3    volblocksize  32K       -
pool/vols/lun_4    volblocksize  64K       -

On the initiator side, the disks was formatted with GPT and the corresponding blocksize was used for the "Allocation unit size" for the NTFS filesystem.

Here is the storage capacity that was really used after the copying was done:

Code:

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
...
pool/vols/lun_1            271G  7.16T   271G  -
pool/vols/lun_2            189G  7.16T   189G  -
pool/vols/lun_3            156G  7.16T   156G  -
pool/vols/lun_4            153G  7.16T   153G  -

Conclusion
If you are exporting a zvol to a client that is going to store lots of random, mixed files and folders; go big, or go home

Then I started thinking about databases. What if the volume you´re exporting is going to be used for storing databases? Well, since the most largely used OS in most organizations are MS based, I took some time to look up best practices for MS SQL:

http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/66b1796f-339e-49b0-a0e8-0f83b0cac4f7/

I've ever read that the ideal cluster size for SQL Server data file is 64k because the page size is 8k and the pages are read and written into group of 8 per time, called "extent".

http://technet.microsoft.com/en-us/library/dd758814%28v=sql.100%29.aspx

The file allocation unit size (cluster size) recommended for SQL Server is 64 KB

http://resources.arcfmsolution.com/DatabaseServer.html

When formatting the partition that will be used for SQL Server data files, you should use a 64-kilobyte (KB) allocation unit size for data, logs, and the tempdb database.

/Sebulon

Mitigating 4k disk issues on RAIDZ

AndyUKG

Sebulon

Sebulon

Sebulon