Yet another ZFS performance thread

starslab · Feb 14, 2012

Hi all,

So I'm setting up a new server for myself at home. I tend to overbuild things, and in this case I'm using this as an exercise to brush up my FreeBSD skills which are several years out of date. I have no data on the machine at all right now, so this is the perfect time to play around.

I've got a nice Intel motherboard with an i5 and 16GB of ram, though for the time being I'm testing with two old 160GB drives in a zpool mirror (given how harddrive prices are retarded right now).

I've only learned about ZFS in the last two weeks. I like what I've read. I like it a lot. I don't expect it to perform equivalent to UFS, as it does and stores so much more, but these performance impacts are unacceptable. I've got two issues, one of which I've defeated but the other has me stumped.

Under heavy write loads, the system became seriously unresponsive. bonnie++ showed latency in excess of 8,500 msec on some operations. By tuning vfs.zfs.write_limit_override down to 128 - 192 MB (still haven't decided exactly where I want it) I've brought that down to just over 1,000 msec. That's still a lot, but I recognize these are old, small slow disks in a worst-case workload scenario.

Writing to the drives themselves tells me they're capable of performing somewhere in the 60 MB/sec range. And indeed sequential read performance on the ZFS mirror is about 90 MB/sec. The thing that's killing me is sequential writes - I can't get it over 25 MB/sec, and during a sequential write
# dd if=/dev/zero of=/some/file
I can hear the drives thrashing around. I tried giving ZFS a 4 GB md(4) ZIL, and that made no difference. Sequential writes are impacted by the write_limit_override tuning above - small limits mean lower latency (to a point) but also reduced speed. Higher numbers improve speed but also increase latency. 25 MB/sec gives me about 1,500msec latency in bonnie++.

Can anyone suggest how to improve ZFS' sequential writes? Also do these numbers seem reasonable given the hardware involved? Detailed configuration follows

FreeBSD 9.0-RELEASE
Fresh install
gpt partitioning
p1 256k freebsd-boot (aligned to 4k on disk)
p2 2g freebsd-ufs (gmirrored /boot)
p3 freebsd-zfs (zfs mirrored zpool (/) on geli)

geli is using 4k blocks, and hardware aesni crypto on the i5. CPU usage does not appear to be an issue. zfs is not using dedup or compression.

If anyone wants more details, ask and I'll provide as soon as I get home from the office.

phoenix · Feb 14, 2012

Remove GELI from the equation and test again.

Right now, you aren't testing "ZFS performance". You're testing "ZFS on GELI on GPT on yadda yadda performance".

And, for the love of Pete, please, please, please do not use dd() as a benchmarking tool on ZFS. Especially not with /dev/zero.

Goose997 · Feb 15, 2012

Hi

To get an idea of the actual hardware performance of the disk, use # diskinfo -c -t -v ada1. You can then compare this performance to the actual performance of your disks using difference file systems (and preferably using a benchmarking program like bonnie++.

My read system performance on this is about 93% of the transfer rate on the middle part of the disk with UFS, and about 132% on a zpool with raidz with 5 disks. Or I might actually compare totally something wrong (if the above command actually measure writes and not reads :e).

Also, since you are trying to check write performance: the write performance of UFS is 92% of the read performance, and on ZFS it is 67% (without dedup etc) on my system.

regards
Malan

starslab · Feb 15, 2012

Goose997 said:
Hi

To get an idea of the actual hardware performance of the disk, use # diskinfo -c -t -v ada1. You can then compare this performance to the actual performance of your disks using difference file systems (and preferably using a benchmarking program like bonnie++.

Thanks for this. Very helpful. My two disks are same rotation speed and manufacturer (Seagates), but different generations. Transfer rates are almost the same but one is a much zippier seeker than the other. I'm confused now because bonnie++ seems to be able to achieve higher transfer rates on these disks than they appear to be capable of. Is compression turned on by default on zfs? (To the best of my knowledge, it's not.) Now that I think about it, this might be zfs showing off it's aggressive caching. I should probably use more than a 32gb test size. Which invalidates my test results thus far. Phooey!

Goose997 said:
My read system performance on this is about 93% of the transfer rate on the middle part of the disk with UFS, and about 132% on a zpool with raidz with 5 disks. Or I might actually compare totally something wrong (if the above command actually measure writes and not reads :e).

This is getting a little offtopic, but to my understanding read and write performance are synonymous on magnetic drives. Drives have separate read and write assemblies, with the read head stationed after the write head, so your drives are constantly reading back their own recording to verify as they go. Both reads and writes are limited by the time taken to get the head into place and wait for the media to pass underneath.

Goose997 said:
Also, since you are trying to check write performance: the write performance of UFS is 92% of the read performance, and on ZFS it is 67% (without dedup etc) on my system.

Can you tell me what the latencies reported by bonnie++ are? Now that I've rerun my benchmarks without tuning zfs, the actual transfer rates are perfectly acceptable - better than the disks can achieve somehow, which I don't understand as compression is off - but the latency is horrific. bonnie++ is showing me latencies measured in the thousands of msec.

starslab · Feb 15, 2012

phoenix said:
Remove GELI from the equation and test again.

Right now, you aren't testing "ZFS performance". You're testing "ZFS on GELI on GPT on yadda yadda performance".

Thank you for the reminder to Keep It Simple Stupid. I need one of those from time to time.
I've found a USB harddrive to install the system on, freeing me to just run filesystems on the bare disk devices for testing.

phoenix said:
And, for the love of Pete, please, please, please do not use dd() as a benchmarking tool on ZFS. Especially not with /dev/zero.

May I ask why this is? If you wish to evaluate sequential read or write to or from a filesystem, this seems an appropriate tool, assuming compression and dedup are not running of course. Nowhere near as informative as something like bonnie++, but an appropriate quick test to see if sequential performance is sane or not, or to evaluate the interactive implications of maxxed out sequential I/O.

phoenix · Feb 15, 2012

Because the default settings for dd are extremely pessimistic (reading/writing 512 Bytes at a time), and determining the correct bs= option is a black art. For example, bs=1M will be a *HELL* of a lot faster than the default. Sometimes bs=16M is even faster. Othertimes it's slower. Depends on the drives.

Using dd to read from a disk (if=/dev/da0) is okay, and will completely by-pass the filesystem, so you get a decent approximation of the drive's overall speed. Using dd to read from a file in a filesystem like ZFS ([b]if=/path/to/some.fil[/b]) will not give you accurate results. The first time, you may be measuring how fast data can be read from the disk into the ARC. The second time, you're measuring how fast data can be read from the ARC. So you have to make sure you use a file that's a lot bigger than your RAM. ZFS settings like compression, dedupe, ARC size, L2ARC, primarycache/secondarycache settings, prefetch, etc also have very different effects on dd.

Using dd to write zeroes to ZFS doesn't do what you think it does, as ZFS does a lot of stuff in the background to more efficiently write (or not write, as the case may be) files full of zeroes (aka, sparse files). And enabling any of the compression settings makes it even less valid. If you're going to use dd to test write speeds, then you need to read from an existing file, and not /dev/zero, so that you are writing actual data to the filesystem. And you'll want to read it a few times first, in order to make sure it's in RAM/ARC, otherwise you are testing the read+write speed, and not just write speed (meaning the reads may slow down the writes). And you definitely have to make sure you aren't reading from the same drive(s) as you are writing to.

dd is not, has never been, and should not be shoe-horned into being, a benchmarking tool.

starslab · Feb 16, 2012

So I think I'm making some headway here. Now that I'm testing on bare disk (just the faster one) with a dataset large enough to mitigate caching, zfs actually doesn't look that bad at all. And I think I might have a handle on the latency issue.

For the record, my previous posting about vfs.zfs.write_limit_override was barking up the wrong tree, decimates sequential write performance, and is an all-around bad idea that no-one should listen to.

This is the performance of UFS w/ SoftUpdates on the bare drive. I'm treating this as my reference performance. The outside edge manages just over 70MB/sec on this drive.

Code:

pegasus# cat ufs/single/su.txt
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
pegasus.skyhawk 96G  1011  99 52465   5 27084   5  1991  99 57264   4  84.7   1
Latency              8335us     853ms   17357ms    4900us   95381us    5571ms
Version  1.96       ------Sequential Create------ --------Random Create--------
pegasus.skyhawk     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                128 12232  13 136235  51 147860  98 31032  33 230164  99 180042  99
Latency               208ms     404ms      56us     171ms      25us      16us

Not bad, though the latency on Sequential Rewrite is nuts.

This is ZFS on the same drive without tuning:

Code:

pegasus# cat zfs/single/notweak.txt
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
pegasus.skyhawk 96G   159  98 46392   6 24573   3   395  84 66568   3  57.1   1
Latency             45948us    8571ms    4499ms     755ms    1276ms    2011ms
Version  1.96       ------Sequential Create------ --------Random Create--------
pegasus.skyhawk     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                128 24054  94 56060  99 24831  97 25557  99 56084  98 24219  98
Latency               132ms     154us     472us   68248us   24250us     609us

These are pretty awesome numbers. ZFS sucks at per-character IO, but who cares? I like these numbers. Faster than UFS in some cases, and while rewrite latency is very high it's less crazy than UFS' rewrite latency. But look at that sequential output latency - 8.5 seconds. That's when your system takes 30 seconds to render a manual page.

But lookie what I found here.... This is with the vfs.zfs.txg.synctime_ms tweaked from it's default 1000 to 750.

Code:

pegasus# cat zfs/single/synctime_750.txt
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
pegasus.skyhawk 96G   165  98 45794   6 24258   3   388  89 65299   3  53.8   2
Latency             48700us    3465ms    3833ms     449ms    1326ms    1818ms
Version  1.96       ------Sequential Create------ --------Random Create--------
pegasus.skyhawk     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                128 24461  93 53064  99 23586  96 22688  94 83625  99 26582  98
Latency             84425us     129us   49778us     121ms   14043us     231us

Negligible impact on throughput, with a huge reduction in latency.

I'm testing the variable set to 500 right now. As I'm sure you can imagine, bonnie++ runs with a 96GB dataset take a little while...

Goose997 · Feb 16, 2012

starslab said:
Can you tell me what the latencies reported by bonnie++ are? Now that I've rerun my benchmarks without tuning zfs, the actual transfer rates are perfectly acceptable - better than the disks can achieve somehow, which I don't understand as compression is off - but the latency is horrific. bonnie++ is showing me latencies measured in the thousands of msec.

My numbers (5x 2 TB Seagate in raidz + ZIL + L2ARC, no compression or dedup):

Code:

Version      1.96   ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
yablonski2.msho 32G   158  95 144197  20 89745  14   468  99 183498  11 418.4  11
Latency             55823us    7510ms    1649ms   30508us     733ms     704ms
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
yablonski2.mshom 16 31374  90 +++++ +++ 31007  98 +++++ +++ +++++ +++ 32576  98
Latency              7764us      79us    4801us   10292us      23us    4028us

My latencies look lower, but there are too many differences in setup here to be helpful. I think what should make a big performance difference will be a third disk in raidz (2x storage, 1 parity) since then performance should nearly be double to that of the mirror setup?

In terms of the effect of latency I have not noticed this in practice: the only time is sometimes when you open a large directory the first time via a Samba share. But that could also be Windows doing caching + search indexing the directory so unfortunately nothing better than anecdotal evidence.

regards
Malan

starslab · Feb 17, 2012

I've just noticed something disturbing - I ran bonnie++, noted the results, then ran the exact same command again without changing anything, and the reported sequential write latency was different by about 3000 msec. This benchmarking may be a waste of time.

ZFSZealot · Feb 17, 2012

I'm not sure if this helps at all but I'll throw it out. When I built my first FreeBSD ZFS server I rsynced about 1TB over to the ZFS pool from a couple of existing UFS arrays. What I found was that the growing UFS filesystem cache put memory pressure on the system and the system shrank the ZFS ARC to the point where ZFS started really slowing down to a crawl. I noticed the ZFS pool drives were seeking a lot even when large sequential files were being copied to them. This was on a system with 4GB of RAM, a hardware based array with UFS and an 8 x 1TB RAIDZ2 pool.

The ugly hack that fixed it and let me copy files over at a reasonable speed was a script that attempted to unmount the UFS filesystem(s) every minute. Because the filesystem was busy it wouldn't unmount but the UFS cache was deallocated so it never grew enough to really push the ARC out of RAM.

I'm not sure if your benchmarking is creating a similar situation and I know it's been said over and over but ZFS needs lots of memory and if something else is allocating as much as it can get you'll see strange behavior.

starslab · Feb 26, 2012

So I've found out what causes the latency. I finally got around to getting Samba up on this machine, so I can throw files to and from ZFS from Windows.

It's the ZIL. This surprised me because I had gotten the impression that the ZIL only seriously impacted performance on heavy synchronous write loads like databases.

If I run [cmd=]zfs set sync=disabled pegasus-root/data/software[/cmd] then bonnie++ produces the following result:

Code:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
pegasus.skyhawk 96G   188  99 94259  10 47274   6   408  91 106286   5 129.4   1
Latency             46344us     928ms    3352ms     469ms    2925ms    1160ms
Version  1.96       ------Sequential Create------ --------Random Create--------
pegasus.skyhawk     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                128 28660  99 59655 100 24338  98 26555  99 55129  92 22447  99
Latency             62734us     118us   36769us   73585us     105ms    1721us

Sending a large file from Windows to Samba also clearly shows the difference - steady high throughput.

Same test with sync=standard shows high throughput, followed by a several seconds long total stall, followed by high throughput, followed by a several seconds long total stall.

I'd experiment with a log drive, but I've just learned a root pool cannot have multiple vdevs or separate logs. I'm glad I learned this now rather than in a year when I add two more drives to this machine.