L2ARC and ZIL on SSD - 4K alignment?

belon_cfy · Apr 21, 2012

I have added a whole SSD as L2ARC without partitioning it, do we really need to optimize the align of L2ARC and ZIL on SSD by using gpart and gnop?

I have tried to partition a L2ARC (45G) and ZIL (10G) by using gpart for performance test, but when I did a zdb | grep ashift, the ZIL showed 9 instead of 12, does it mean the ZIL wasn't aligned?

Sebulon · Apr 21, 2012

@belon_cfy

Having an improper alignment on a SSD can cut the performance in half, so yes, it is very important:

Code:

[CMD="#"]diskinfo -v daX[/CMD]
daX
	512         	# sectorsize
	240057409536	# mediasize in bytes (223G)
	468862128   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	29185       	# Cylinders according to firmware.
	255         	# Heads according to firmware.
	63          	# Sectors according to firmware.
	ID-SDFG434156XVH	# Disk ident.

[CMD="#"]echo "240057409536 / 1024000 - 1" | bc[/CMD]
234430
[CMD="#"]dd if=/dev/zero of=tmpdsk0 bs=1024000 count=1 seek=234430[/CMD]
[CMD="#"]mdconfig -a -t vnode -f tmpdsk0[/CMD]
[CMD="#"]gnop -S 4096 md0[/CMD]
[CMD="#"]gpart create -s gpt[/CMD]
[CMD="#"]gpart add -t freebsd-zfs -l log1 -b 2048 -a 4k daX[/CMD]
[CMD="#"]zpool add pool log mirror md0.nop gpt/log1[/CMD]
[CMD="#"]zpool detach pool md0.nop[/CMD]

Gives you both a proper alignment and ashift: 12 on the log vdev. Ashift is not alignment.

/Sebulon

belon_cfy · Apr 21, 2012

Are the parameters -b 2048 and -a 4k necessary when adding a partition by using gpart? I'm using 4X 2TB 4k optimized drives and the zfs volumes have been created with 4K optimized. My gpart show as below, does it consider aligned?

Code:

=>        34  1953525101  da0  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da1  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da2  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da3  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

How about L2ARC? Do I still need to create a .nop with 4k optimized before add it as L2ARC?

einthusan · May 4, 2012

belon_cfy said:
How about L2ARC? Do I still need to create a .nop with 4k optimized before add it as L2ARC?

Bump! I would like to know the answer to that. Anyone care to answer that?

Sebulon · May 4, 2012

@einthusan

Not quite sure to be honest. I mean, a SLOG gets its own vdev that receives a ashift-value, but if you run zdb, you won't see that for L2ARC. It's also different since it's a part of the ARC, while the SLOG is part of the pool.

I settled for aligned partitioning on my cache device (Vertex3) and have noticed it reading and writing over 200MB/s at times in gstat, so I'm not worried.

/Sebulon

einthusan · May 4, 2012

Sebulon said:
@einthusan

Not quite sure to be honest. I mean, a SLOG gets its own vdev that receives a ashift-value, but if you run zdb, you wonÂ´t see that for L2ARC... ItÂ´s also different since itÂ´s a part of the ARC, while the SLOG is part of the pool.

I settled for aligned partitioning on my cache device(Vertex3) and have noticed it reading and writing over 200MB/s at times in gstat, so IÂ´m not worried.

/Sebulon

Hi Sebulon,

I'm not sure what a SLOG is but I have a quick question regarding the L2ARC. I plan on putting 2x SSD as L2ARC. I use your guide regarding "ZFS as root" to setup a few FreeBSD boxes but do you think you can explain what you think is the best way to setup 2x L2ARC devices? I thought L2ARC devices don't need disk alignment at all (4k or not) and I also thought that L2ARC devices "just work" by telling ZFS to just use it as L2ARC. I even heard of some people saying to mirror L2ARC devices, does that actually provide additional read performance?

t1066 · May 4, 2012

The following two drives are 4k aligned.

Code:

$ gpart show -l ada0
=>       34  250069613  ada0  GPT  (119G)
         34        128     1  (null)  (64k)
        162       1886        - free -  (943k)
       2048   20971520     2  ssd1  (10G)
   20973568   10485760     3  log0  (5.0G)
   31459328  218103808     4  cache0  (104G)
  249563136     506511        - free -  (247M)

$ gpart show -l da6
=>       34  250069613  da6  GPT  (119G)
         34        128    1  (null)  (64k)
        162       1886       - free -  (943k)
       2048   20971520    2  swap1  (10G)
   20973568  228589568    3  cache1  (109G)
  249563136     506511       - free -  (247M)

And when accessing a file that is in the cache,

Code:

                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2329.8   0.0 298218.9     0.0    4   2.7  67
da6      2433.7   1.0 311182.9   127.9    5   3.1  80
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2833.2   6.0 362309.5   422.1   10   2.7  82
da6      2932.1   0.0 375080.8     0.0   10   3.2  97
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2845.1   0.0 364178.7     0.0   10   2.8  82
da6      2990.0   0.0 382608.2     0.0   10   3.2  98
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     1583.4   2.0 202676.3   143.9    0   2.8  46
da6      1586.4   0.0 203059.9     0.0    0   3.2  52

You should also increase the sysctl variables vfs.zfs.l2arc_write_max and vfs.zfs.l2arc_write_boost. Their defaults 8M/s are a bit slow for current generation SSDs. Finally, you could change vfs.zfs.l2arc_noprefetch to 0 if you also want to cache streaming data.

Sebulon · May 4, 2012

einthusan said:
I'm not sure what a SLOG is

Separate LOG device.

einthusan said:
I plan on putting 2x SSD as L2ARC.

Why? Is it size? Remember that you still need RAM to be able to allocate all that.

einthusan said:
I thought L2ARC devices don't need disk alignment at all (4k or not) and I also thought that L2ARC devices "just work" by telling ZFS to just use it as L2ARC

It depends on the SSD. Some SSD's have about the same write performance un/aligned, while others, like the Vertex's unaligned writes go half as fast as aligned.
Just keep the partitioning aligned and you'll never have any problems, SSD's or otherwise. I'm assuming the cache devices to be SSD, since otherwise, what's the point?

einthusan said:
I even heard of some people saying to mirror L2ARC devices, does that actually provide additional read performance?

Them people, up to no good again. That is just incorrect. There is no way to mirror cache devices, because you don't have to, they're completely redundant. I'm betting you're just confusing SLOG ZIL with L2ARC. Log devices can be good to have mirrored. Not really necessary now a days, from ZPOOL V19 and up. But if you have a database machine in production doing heavy sync writing with only one log device, and it dies, you are going to feel the paaaiiin of going from around 60-70MB/s write down to, like, 5

With two mirrored SLOGs and one dies, that's not an issue.

/Sebulon

einthusan · May 4, 2012

Sebulon said:
Separate LOG device.

Why? Is it size? Remember that you still need RAM to be able to allocate all that.

It depends on the SSD. Some SSD's have about the same write performance un/aligned, while others, like the Vertex's unaligned writes go half as fast as aligned.
Just keep the partitioning aligned and you'll never have any problems, SSD's or otherwise. I'm assuming the cache devices to be SSD, since otherwise, what's the point?

Them people, up to no good again. That is just incorrect. There is no way to mirror cache devices, because you don't have to, they're completely redundant. I'm betting you're just confusing SLOG ZIL with L2ARC. Log devices can be good to have mirrored. Not really necessary now a days, from ZPOOL V19 and up. But if you have a database machine in production doing heavy sync writing with only one log device, and it dies, you are going to feel the paaaiiin of going from around 60-70MB/s write down to, like, 5
With two mirrored SLOGs and one dies, that's not an issue.

/Sebulon

The reason why I am using 2 x 32 GB SSD instead of 1 x 64 GB SSD is because I believe that by using two drives, I will have more read IOPS and thus better read performance compared to using a single SSD. I need 64 GB because of video streaming and so the more cache space the better, right? We have 8 GB of RAM, I think that should be enough.

Yes, I will align the SSD using the same procedure for regular hard drives. And yes they will be used as cache devices (L2ARC).

You're right. I got ZIL mixed up with L2ARC. They were mirroring the SLOG devices, not L2ARC.

Sebulon · May 7, 2012

@einthusan

SSDs use internal striping across the cells. 1x64GB is as fast as 2x32GB, as long as it has twice the cells to stripe across, and that's really hard to find out. Manufacturers rarely print that out, even in the supposedly complete product sheet. But with 2x32GB, you have to trust- in case- ZFS to be as effective in striping across 2x disks, as giving only 1x disk to ZFS and trust the disk to stripe internally.

In the end, I don't think anyone would notice the difference

One con against 2x disks is that they take up space in the chassis. One pro is that if one of them dies, you'd only loose half the cached information on them.

/Sebulon

gkontos · May 7, 2012

@Sebulon,

I noticed that your are partitioning a L2ARC device with a ZFS type:

[CMD=""]# gpart add -t freebsd-zfs -l log1 -b 2048 -a 4k daX[/CMD]

Is there a particular reason why this is done?

Thanks

Sebulon · May 7, 2012

@gkontos

No, nothing in particular. Is there a "better" type, you think? Does it even matter?

/Sebulon

gkontos · May 7, 2012

Sebulon said:
@gkontos

No, nothing in particular. Is there a "better" type, you think? Does it even matter?

/Sebulon

I guess not. I asked mainly out of curiosity. Usually, I don't partition LOG or CACHE SSD devices.

If you want to 4K align then this is the only way, but do you think that it really makes that difference?

einthusan · May 7, 2012

Sebulon said:
@einthusan

SSDs use internal striping across the cells. 1x64GB is as fast as 2x32GB, as long as it has twice the cells to stripe across, and that's really hard to find out. Manufacturers rarely print that out, even in the supposedly complete product sheet. But with 2x32GB, you have to trust- in case- ZFS to be as effective in striping across 2x disks, as giving only 1x disk to ZFS and trust the disk to stripe internally.

In the end, I don't think anyone would notice the difference One con against 2x disks is that they take up space in the chassis. One pro is that if one of them dies, you'd only loose half the cached information on them.

/Sebulon

But what about the sustained data read rates. For example, a single SSD may have a sustained read rate of 200 MB/s. If you have 2 SSD being read in parallel, wouldn't you have about 400 MB/s of sustained read speed? And if we add random reads to the picture, wouldn't parallel reads be better? Unless ZFS sucks at managing 2x L2ARC devices.

Sebulon · May 7, 2012

gkontos said:
...but do you think that it really makes that difference?

I know it makes that difference on OCZ's drives because I have tested quite a few drives now, looking for the perfect SLOG device, and one of the things I test is aligned/unaligned writes. Please test the difference yourself:

Code:

[CMD="#"]gpart add -t freebsd-zfs -l log(cache)1 (a)daX[/CMD]
Compared to:
[CMD="#"]gpart add -t freebsd-zfs -l log(cache)1 -b 2048 -a 4k (a)daX[/CMD]

Prepare:
[CMD="#"]mdmfs -s 2048m mdX /mnt/ram[/CMD]
[CMD="#"]umount /mnt/ram[/CMD]
(Or use the [FILE]mdconfig[/FILE]-command to do the same thing)
[CMD="#"]dd if=/dev/random of=/dev/mdX bs=1024000 count=2048[/CMD]

Then test three times:
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]

/Sebulon

Sebulon · May 7, 2012

einthusan said:
But what about the sustained data read rates. For example, a single SSD may have a sustained read rate of 200 MB/s. If you have 2 SSD being read in parallel, wouldn't you have about 400 MB/s of sustained read speed? And if we add random reads to the picture, wouldn't parallel reads be better? Unless ZFS sucks at managing 2x L2ARC devices.

The same striping priciple applies to reads as well.

Let's just say hippopotamously that 1xSSD with 16x4GB (=64GB) NAND cells can read sustained at 200MB/s, then a SSD with 32x4GB (=128GB) NAND cells can read sustained at 400MB/s. These numbers were pulled out my ass, but you get the picture

/Sebulon

Sebulon · May 8, 2012

t1066 said:
You should also increase the sysctl variables vfs.zfs.l2arc_write_max and vfs.zfs.l2arc_write_boost. Their defaults 8M/s are a bit slow for current generation SSDs. Finally, you could change vfs.zfs.l2arc_noprefetch to 0 if you also want to cache streaming data.

What have you chosen as values for write_max and write_boost? I cranked them up from 8 to 80MB/s, but what would be a good rule of thumb for it? My Vertex3 can write sustained at about 240MB/s, but going too high may impact reading if it gets 100% busy at writing. Should we say about half(120MB/s) of the write-performance maybe?

/Sebulon

t1066 · May 8, 2012

@Sebulon

I think that L2ARC should mainly be used as a reading cache. You do not want to read from the pool instead of L2ARC when ARC is busy. So write speed should be set to at most half of the maximum write speed. I would prefer to set it to a quarter or a third of the maximum write speed. There are two sysctls that I had overlooked, vfs.zfs.l2arc_feed_again and vfs.zfs.l2arc_feed_min_ms. They would increase the frequency of feeding the l2arc if writing on L2ARC is more than half of write_max. So actually, there are two ways to increase write speed. One way is to set a hard limit, write_max=hard limit and set l2arc_feed_again=0. The other way is set for a lower write_max and let the feed again mechanism kick in. Fortunately, these settings can be tested without reboot the system.

But I think crank up write_boost to almost the maximum write speed should not cause any problems.
Finally L2ARC work in a round robin way instead of striping. However, for streaming data it should perform more or less the same as striping.

stuart · May 9, 2012

If your array contained a mix of 512 and 4K drives, will it do any harm to align everything as 4K? I'm assuming not, as the 512 ones won't care either way?

In terms of aligning them, is it effectively sufficient to make sure that the gpt partitions are on 4k boundaries? I'm not clear why using gnop helps so much, and why you only need to use it once?

kpa · May 9, 2012

Aligning on 4k boundaries without using gnop(8) is not enough because ZFS would still use 512 byte sectors as the smallest I/O unit, creating a fake device with 4KB sectors will force ZFS to use 4KB sectors as the smallest I/O unit in the pool.

stuart · May 9, 2012

Ok, so why is it that you don't need to create the gnop device every time? Most of the walkthroughs I've seen only seem to require it to be created once?

kpa · May 9, 2012

It's only needed when you create the pool, once the ashift property is set it can not be changed.

Sebulon · May 9, 2012

kpa said:
It's only needed when you create the pool, once the ashift property is set it can not be changed.

Not quite correct. You need to have a gnop-device first for every new vdev in the pool. If your pool consists of 8xdrives raidz(2,3), you only need one gnop-device. If it is 2x4drives raidz(2,3), you need two gnops. 4x2 mirror vdevs require four gnops. And finally, a pool with 8x1 single-drive vdevs (no fault-tolerance), you need all eight to be gnops

Also, because the ashift-value is set per vdev, instead of pool-wide, it is possible to have, e.g. 2xdrives mirror with ashift=9 and the next time you buy two more hard drives (most probably AF ones), you can add another 2xdrives mirror into the pool but this new vdev can have ashift=12 instead.

/Sebulon

t1066 · May 10, 2012

This is an update on how L2ARC works.

I had upgraded x11/nvidia-driver. But when I

# kldunload nvidia

this command just hanged. So I rebooted the system.

I had set write_max to 25MB/s. First, when ARC was warming up, I got the following result.

Code:

$ iostat -xz -w 1 ada0 da6

                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  19.0     0.0  2317.5    0   0.5   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  20.0     0.0  2557.2    0   1.6   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  26.0     0.0  3325.3    0   1.2   2
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  35.0     0.0  4475.7    0   1.4   2
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  14.0     0.0  1678.3    0   0.7   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  24.0     0.0  3069.0    0   1.2   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  19.0     0.0  2429.5    0   0.9   1

We can see from above that L2ARC is indeed writing in a round robin way. Each drive got a write in alternate seconds. (ada0 and da6 combined to form the L2ARC.)

After the ARC warmed up, I dd a 3G file to /dev/null and got the following result.

Code:

$ iostat -d -w 1 -c 60 ada0 da6
            ada0              da6
  KB/t tps  MB/s   KB/t tps  MB/s
 86.79  18  1.53  105.60  16  1.68
 128.00   1  0.12   0.00   0  0.00
 72.00   2  0.14   0.00   0  0.00
  0.00   0  0.00   0.00   0  0.00
 128.00   1  0.12  128.00   2  0.25
  0.00   0  0.00   0.00   0  0.00
 119.38  39  4.54  128.00  31  3.87
 126.58  45  5.56  127.04  67  8.30
 124.44  54  6.56  114.33  96 10.71
 128.00  43  5.37  105.65  48  4.95
 119.80  41  4.79  128.00  47  5.87
 128.00  88 10.99  123.33 120 14.44
[B] 119.68 148 17.28  128.00 114 14.24
 128.00 293 36.59  125.91 322 39.55
 126.77 729 90.29  128.00 737 92.16
 128.00 543 67.93  126.50 448 55.29[/B]
 128.00   3  0.37  128.00 150 18.73
 128.00   1  0.12  128.00   7  0.87
 128.00   6  0.75  128.00   1  0.12

The bold part shows the part when turbo warmup kicked in.

stuart · May 10, 2012

Sebulon said:
Not quite correct. You need to have a gnop-device first for every new vdev in the pool. If your pool consists of 8xdrives raidz(2,3), you only need one gnop-device. If it is 2x4drives raidz(2,3), you need two gnops. 4x2 mirror vdevs require four gnops. And finally, a pool with 8x1 single-drive vdevs (no fault-tolerance), you need all eight to be gnops

Also, because the ashift-value is set per vdev, instead of pool-wide, it is possible to have, e.g. 2xdrives mirror with ashift=9 and the next time you buy two more hard drives (most probably AF ones), you can add another 2xdrives mirror into the pool but this new vdev can have ashift=12 instead.

/Sebulon

So we use gnop when creating a vdev to make sure zfs writes in 4K blocks, but once this is done, that's it? So using -b 4096 with GPT and gnop is enough for alignment?

Sorry, just wanted to make sure I understand this fully