ZFS: Worse read performance on striped mirror vdevs vs. striped single-disk vdevs

Sfynx · Oct 23, 2011

I'm putting together a small ZFS file server using 9.0-RC1, and I'm noticing strange sequential read performance behavior. When I put 2 drives in a pool, each on its own vdev, I actually get better read performance compared to 4 drives in a pool in the form of two mirror vdevs! This totally goes against the global opinion that striped mirrors provide really good performance, and I would expect the performance to be at least equal, so something must be horribly wrong here.

Here is my system configuration:

Motherboard: Supermicro X8SIL-F w/ Intel 3420 chipset + 6 AHCI SATA ports
CPU: Intel Core i5-650 (dual core)
Memory: 4GB DDR3-1333 ECC (more is on the way)
Hard drives: 2x Seagate ST2000DL003-9VT166 (Barracuda Green), 2x Samsung HD204UI (F4 EcoGreen). All 2TB, All Advanced format (4K sectors).

zpool version 28, ashift=12 by gnop trick
filesystem version 5
Checksum: fletcher4, compression: off

Drive detection:

Code:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST2000DL003-9VT166 CC3C> ATA-8 SATA 3.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2: Previously was known as ad8
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <ST2000DL003-9VT166 CC32> ATA-8 SATA 3.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3: Previously was known as ad10

We start out with a simple non-redundant pool with two drives in separate vdevs:

$ zpool status

Code:

  pool: zroot
 state: ONLINE
 scan: resilvered 11.9G in 0h1m with 0 errors on Sun Oct 23 23:21:19 2011
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  ada1      ONLINE       0     0     0
	  ada0      ONLINE       0     0     0

errors: No known data errors

We create a 10 GB test file, which exceeds the potential ARC size generously:

$ dd if=/dev/random of=random.bin bs=1m count=10000

Now we wait out the 30-second transaction flush period so the thing is completely quiet, and then we attempt to read it:

$ dd if=random.bin of=/dev/null bs=1m

Code:

10000+0 records in
10000+0 records out
10485760000 bytes transferred in 41.524555 secs (252519503 bytes/sec)

Repeating the command yields the same results.
Seeing that both drives do at least 130 MB/sec separately (the Seagate being a bit higher with 145 MB/sec), I think this is a very nice figure.
$ zpool iostat -v 1 indeed shows that both vdevs handle half of the traffic. Great stuff. CPU utilization hovers around 11-12% (system).

Now we make it a redundant pool, having the Seagates on one vdev, and the Samsungs in the other:

# zpool attach zroot ada0 ada2
# zpool attach zroot ada1 ada3

After resilvering this looks like this:

$ zpool status

Code:

  pool: zroot
 state: ONLINE
 scan: resilvered 11.9G in 0h1m with 0 errors on Mon Oct 24 00:11:57 2011
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada1    ONLINE       0     0     0
	    ada3    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    ada0    ONLINE       0     0     0
	    ada2    ONLINE       0     0     0

errors: No known data errors

Again we wait a bit to be sure there are no pending write operations, and we try to read the file again:

$ dd if=random.bin of=/dev/null bs=1m

Code:

10000+0 records in
10000+0 records out
10485760000 bytes transferred in 74.110371 secs (141488429 bytes/sec)

So we see that a generous 100 MB/sec has just vanished into the big void of emptiness. This time, $ zpool iostat -v 1 reports that all four drives do a 35 MB/sec share each.
CPU utilization is less stable in this test, fluctuating from 7% to 17%... also a bit strange.

As for write performance, both pool configurations have about 220 MB/sec write performance when writing a zero-filled file like this:

$ dd if=/dev/random of=random.bin bs=1m count=10000

Reading this file instead of the random filled file, or putting different brands of drives together in one vdev do not yield a difference.

Lowering the vfs.zfs.vdev.min_pending and vfs.zfs.vdev.max_pending tunables (because we already have NCQ active) does not make any difference also.

So I am hitting a roadblock here. Does anyone have experience with such a performance difference and can give me advice on how to solve it or what tunables I would need to set? Keeping the pool non-redundant is obviously not an option here, and I want to pop in another mirror set later if I need more storage space.
I cannot see that the CPU is the bottleneck here, and if it was the RAM, I would think I would see at least the same throughput using mirrors...

Thanks for any help you can give me.

AndyUKG · Oct 24, 2011

Can you destroy that pool and recreate a pool from scratch with a single mirror vdev using ada0 and ada1? That in theory should provide the same read performance as your original stripe and it only uses the same physical devices so cuts out any possible hardware issues from the test...?

thanks Andy.

Sfynx · Oct 24, 2011

I started from scratch:

# gnop create -S 4096 ada0
# gnop create -S 4096 ada1
# zpool create -o altroot=/tmp/zmirror zmirror mirror ada0.nop ada1.nop

This mirror performs exactly the same as a single drive vdev, at around 130 MB/sec (or 145 MB/sec for a single-disk vdev depending on the drive brand). It does not appear to be load balancing at all, although zpool iostat reports each drive handles half the IOPS. So the bottom line is that I cannot get the pool to exceed read speed of a single drive whenever there is a mirror in the pool.

Same story with any other set of two drives in a mirror, just to exclude a possible flaky drive.
I'm sure I'm not saturating the Intel3420 chipset bandwidth with just one drive, since I do get a speedup when simply striping without any mirror being present in the pool.

As an attempt to maximize throughput, I finally created a 4-way vdev stripe of all four drives:
# zpool create -o altroot=/tmp/zstripe zstripe ada0.nop ada1.nop ada2.nop ada3.nop

Reading from that results in 325MB/sec throughput, so that's at least toward a value I would expect...

Mirrors are supposed to distribute load also right? If not, I would still expect two striped mirrors to perform equally as two striped single-disk vdevs...

AndyUKG · Oct 24, 2011

Hmmm, well I haven't previously studied this before (mirror read performance), but this certainly isn't what I'd have expected.
Can you also look at the percent busy on the disk devices when doing this? Are they maxing out? Ie use gstat.

Andy.

AndyUKG · Oct 24, 2011

BTW (this doesn't explain the mirror vs stripe performance difference) from the diagram here:

http://www.supermicro.com/manuals/motherboard/3420/MNL-1130.pdf

it seems that all 6 SATA ports share a single SATA II bus, so max bandwidth for all drives is 3Gbit/s, which you seem to hit with the 4 way stripe...

Sfynx · Oct 24, 2011

Yeah, that looks fine. For more throughput, I would need to populate some PCI-E slots with the right controller cards.

For now I would aleady be happy to be able to max out the on-board controller using four drives as a striped mirror setup, or at the very least see an read performance improvement over a single drive

AndyUKG · Oct 24, 2011

What about the percent busy as per my question above??

Sfynx · Oct 25, 2011

I did some measurements with gstat for a number of different configurations:

Stripe of 4 single-disk vdevs:

Code:

ada0: 46%, 76 MB/sec
ada1: 32%, 76 MB/sec
ada2: 46%, 76 MB/sec
ada3: 32%, 76 MB/sec

The Seagates (ada1, ada3) are a bit faster, so that could explain the load difference. The SATA bus looks saturated nicely.

Mirror of two different drive models:

Code:

ada0: 60%, 55 MB/sec
ada1: 50%, 55 MB/sec

Now this is strange... the throughput is nowhere near the maximum capabilities of bus or drive.

Mirror of identical drive models:

Code:

ada0: 56%, 55 MB/sec
ada2: 56%, 55 MB/sec

Drives should be fully loaded, and they aren't. Same result for ada1 and ada3, only a little less load (50%) probably due to faster drives.

Stripe of two drives of different models:

Code:

ada0: 80%, 105 MB/sec
ada1: 65%, 105 MB/sec

Stripe of two drives of identical models:

Code:

ada0: 72%, 105 MB/sec
ada1: 72%, 105 MB/sec

Striping clearly increases drive utilization, still nowhere near optimal though. Also curious that a modest load increase results in double the throughput...

All tests done by reading a 10 GB file using dd to /dev/null with a block size of 1 MB. Decreasing or increasing the block size did not yield any better results, and actually made it worse when decreasing it too much.

Writing is fine...when writing a file to a mirror, both drives are completely maxed out.

AndyUKG · Oct 25, 2011

Ok, Ive just tested this on a server I have with exactly the config you want, one pool with two 2 way mirrors. Did the same 10G test file, and I get 100% busy on disks during read as you would expect. So there seems to be some issue with your system specifically rather than a general problem with ZFS.
My system is an old P4 Xeon based Dell server with 7200rpm SATA disks running FreeBSD RELEASE 8.2 and ZFS v15. I'd be surprised if the difference was due to FreeBSD 9 or higher ZFS version but unless you have anything else to test this might be worth checking out. Ie boot from an 8.2 CD and do some ZFS testing...

Andy.

PS 10485760000 bytes transferred in 51.437944 secs (203852626 bytes/sec)

Sfynx · Oct 25, 2011

When I get home I'll test things using 8.2-RELEASE and try my Athlon64 desktop machine with the same drives also, its CPU should handle this test fine also and it has the same amount of RAM installed.

Also going to grab a pair of 7200 RPM drives to see if these lower-RPM Green drives could be an issue.

Thanks for the help so far.

Sylhouette · Oct 25, 2011

Could it be that the seagates performes badly.
I remember the following webpage.

http://maycontaintracesofbolts.blogspot.com/2011/09/more-advanced-format-drives-samsung.html

regards
Johan Hendriks

Sfynx · Oct 25, 2011

Hm, here the Seagates actually perform better than the Samsungs when used individually, getting 145 MB/sec sequential read speeds when directly reading the ada device... And the Samsungs have the same mirror problem here :/

And, I've just tested four Seagate Constellation ES 7200 RPM drives and still single-drive speed when using mirrors in the pool.

Next test: hooking up the drives to my AMD64 desktop machine and test them there. Will be interesting.

phoenix · Oct 25, 2011

In the gstat output, what's shown for the L(q) column (length of the command queue, or number of requests in the queue)? Ideally, this should be 0, meaning the drive is processing commands as fast as they come in and the system is not waiting on the drive. If this is greater than 0, then the drive is processing things slower than the system is sending them and things are bottlenecked.

You can play with the vfs.zfs.vdev.max_pending sysctl to try and bring L(q) down to 0. The default is 10. On the zfs-discuss mailing list, various people recommend setting it to 2 for SATA disks, and definitely no higher than 4. Depends on the quality of the NCQ implementation, speed of the controllers, speed of the drives, and size of the drive's onboard cache.

Sfynx · Oct 26, 2011

So I just tested four drives in my other machine using a 8-2-RELEASE DVD, striped mirrors like this:

mirror ada0 ada1 mirror ada2 ada3

and still the same issue. This time I also zpool offlined one disk of each mirror during the test, and guess what, the pool performance doubled instantly, both remaining drives each handling 4 times the throughput they did before! So with all 4 drives online I get 35 MB/sec on each drive, and with only one member of each mirror online I get > 110 MB/sec on both drives. I'm lost.

L(q) is mostly 0 during all of these tests with the occasional peak to 1 or 2. However, when writing (e.g. /dev/zero to a file, compression=off), all drives are always 100% load with L(q) = 10 all the time, with a total throughput of > 200 MB/sec. I can actually write things faster than reading it back.

AndyUKG · Oct 26, 2011

Guessing that you Athlon system also has a shared SATA bus, perhaps the issue is that internally its doing something like port multiplier command based switching which hurts performance...??

http://en.wikipedia.org/wiki/Port_multiplier#Command-based_switching

If you don't have any other PCI SATA card its a bit hard to test that though....

AndyUKG · Oct 26, 2011

With reference to my last post, it might be worth disabling AHCI (and therefore NCQ) and seeing if that affects performance...

Sfynx · Oct 26, 2011

My Athlon system does not have an AHCI option at all, so it's not NCQ there. Shared SATA bus. I enabled AHCI mode on my new system right after I got it, I will switch it off and see if it makes a difference there.

I do have an old Promise TX2 SATA300 PCI card here which I can test on both systems, but it only has 2 SATA ports. But at least I could move 2 drives there to see what happens.

Sfynx · Oct 26, 2011

The Intel specifications for the Intel 3420 chipset (the chipset used on the X8SIL-F board) do mention the presence of FIS at the SATA controller as part of its AHCI function (page 18 and 76):

http://www.intel.com/content/dam/doc/design-guide/5-and-3400-chipset-specification-update.pdf

...so I would expect NCQ to be functioning correctly there. When I get home I'll plug in that PCI card and move two halves of each mirror there to see if it load balances. The only thing I recall there is that good old PCI only does 133 MB/sec...

AndyUKG · Oct 26, 2011

I was suggesting it perhaps used something like command based switching, FIS is the better more complex option. Anyway, command based switching, FIS and NCQ aren't mutually exclusive. From the document I found on your motherboard it does show that a single SATA bus is shared between multiple ports, so I was guessing that however the SATA chipset is achieving this may be command based switching or suffer from similar performance issues as command based switching. Those being, quoting from wikipedia:

The controller can issue commands to only one disk at a time and cannot issue commands to another disk until the command queue has been completed for the current transactions

I think its fair to assume bandwidth sharing is going to negatively impact performance to some degree, the logic used to do that sharing will define how negatively. For the record, my benchmark I posted earlier is using FIS port multilier, and a PCI-X SATA card. So its not really a high end config, but it may well be better than some onboard SATA systems.

With your 2 port SATA card, the thing I'd be most interested to see is for each of the 2 mirrors, have one half of the mirror onboard and one half on the PCI card.

ta Andy.

Sebulon · Oct 26, 2011

@Sfynx

I find this benchmark very interesting but itÂ´s a shame youÂ´re without better hardware to test IO with. I happen to have a 2-port SATAII Highpoint PCI-e x1 I got in the same package as a 3TB WD Green drive I bought awhile back- just collecting dust. If yourÂ´re interested, I could donate it to the cause?

/Sebulon

Sfynx · Oct 27, 2011

With 2 drives on the PCI card (in different mirrors) the read performance from each drive is still ~ 35 MB/sec, again totaling to only 140 MB/sec.

It completely kills write performance on that setup though, ~ 60MB/sec vs. > 200 MB/sec when everything is onboard

By the way, if I bypass ZFS altogether by just dd-ing drive content to /dev/null from all ada devices at the same time, then gstat reports four fully busy drives at around 100MB/sec each. That would indicate that ZFS would be a bottleneck somehow, but seeing that the CPU is barely busy at all during any ZFS test, I doubt the difference would be this dramatic...

Sebulon said:
@Sfynx

I find this benchmark very interesting but itÂ´s a shame youÂ´re without better hardware to test IO with. I happen to have a 2-port SATAII Highpoint PCI-e x1 I got in the same package as a 3TB WD Green drive I bought awhile back- just collecting dust. If yourÂ´re interested, I could donate it to the cause?

/Sebulon

Thanks for the offer, but at this point I'm honestly not sure if it would make any difference, seeing that I get the same results on two different systems with 3 different SATA controllers using both 'green' and 7200 RPM drives.
At work we have a 3Ware 8-port SATA PCI-X card which pushes 100 MB/sec for each of the 6 connected drives without a problem

Sfynx · Oct 29, 2011

Interesting development here.

I started from scratch, all 4 drives on on-board controller, AHCI mode on, created 4K sector v28 pool with striped mirrors using FreeBSD 9.0-RC1 LiveCD. Explicit checksum=fletcher4, compression=off settings on root file system. dd'ed a 10 GB file to it, and reading it gives the usual bad ~30 MB/sec reading per drive (visible with zpool iostat and gstat). Lowering vfs.zfs.vdev.min_pending and vfs.zfs.vdev.max_pending gives no difference.

So I exported the pool, booted an OpenIndiana installation CD, imported the pool, we dd read the same file using same block size, and there we go, zpool iostat reports we get 70 MB/sec per drive and everything completes twice as fast.

So apparently something is up with FreeBSD's ZFS implementation combined with my hardware combo...

Sebulon · Oct 29, 2011

I would also be very interested to see the same tests performed on 8.2-RELEASE and also 8-STABLE, to see if this is due to the implementation of V28 in general, or if this is restricted only to FreeBSD 9.

/Sebulon

Sfynx · Oct 29, 2011

8.2-RELEASE (pool v15, fs v4) also fails to load-balance mirrors during reads. Will see what happens with 8-STABLE.

Sfynx · Oct 29, 2011

And I got a breakthrough, feeling a bit stupid for not trying this earlier:

vfs.zfs.prefetch_disable was set to 1 due to the slightly less than 4 GB RAM available to FreeBSD. Setting this to 0 myself finally results in the excellent read speeds I also experienced in OpenIndiana. I get a steady 290 MB/sec read figure from the pool now. This also explains why writes were not affected.

I never expected this setting to be that big of a difference on mirrors, but in my case it seems so.

This rig will have 16 GB of RAM fairly soon, I expect that to be enough to leave prefetching enabled.

By the way, when I scrub the pool, it reports > 200 MB/sec scrub performance regardless of prefetch settings.