ZFS Single drive pool faster than 8 drive stripe?

igeek01 · May 19, 2017

I have a weird problem with a ZFS read/write speeds. When I create single drive pool, I'm getting better results than when I create 8 drive stripe pool. How is that possible??? Whatever I do I can't get expected results. Server is IBM x3620 M3 with ServeRAID M5014 SAS/SATA controller (I've tried creating JBOD with mfiutil and with mfip driver) and 8 4TB SAS drives.

Code:

root@spin:~ # zpool create storage /dev/da0.nop

root@spin:/storage # dd if=/dev/zero of=./test.dat bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 16.746805 secs (256464875 bytes/sec)

root@spin:/storage # dd if=./test.dat of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes transferred in 29.072411 secs (147733442 bytes/sec)
root@spin:/storage #


root@spin:~ # zpool create storage /dev/da0.nop /dev/da1.nop /dev/da2.nop /dev/da3.nop /dev/da4.nop /dev/da5.nop /dev/da6.nop /dev/da7.nop

root@spin:/storage # dd if=/dev/zero of=./test.dat bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 22.161224 secs (193805512 bytes/sec)

root@spin:/storage # dd if=./test.dat of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes transferred in 29.321912 secs (146476373 bytes/sec)
root@spin:/storage #

usdmatt · May 19, 2017

See what you get with a proper performance test like bonnie. Using a stream of null bytes from /dev/zero usually gives completely spurious results.

Also with fairly recent FreeBSD versions you can forget all the .nop stuff and just set sysctl vfs.zfs.min_auto_ashift=12 to force a minimum 4k block size.

ralphbsz · May 19, 2017

You are probably not measuring the speed of your disks. How do I know that? The very first test claims to have written to the disk at 256.5 MB/s. Unfortunately, real world-disk go at maximum speeds of 150-200 MB/s (it's theoretically possible that you are using high-performance SSDs, but I doubt it). What you probably really measured is not the speed of your disks, but of the buffer cache (the data you wrote in the write test was probably mostly still in memory), and the performance of the file system code itself.

The other thing is that your read rest uses a different and very small block size: You wrote the file using 1MiB writes (which is good, that IO size is large enough to probably give decent performance, although for optimal results you may want to tune your IO size to match what the file system really wants). But then you read the file using dd's default block size of 512 bytes. That means for every 512-byte sector, your application ( dd running in user space) had to perform a read() system call, go into the kernel, and get the data from there. That's very inefficient. If you're trying to measure system performance, you'll have to use sensible block sizes for the reads too.

From that viewpoint, it makes some sense that an 8-way striped file system is actually somewhat slower, because the ZFS code needs to do more work, to distribute the data across multiple disks.

Suggestion for repeating the test: (a) Make sure the data is flushed to disk at the end of the write, and include the time required to flush to disk in your measurement. I don't think dd can do it, so you'll either have to write a small read/write program yourself and use the fsync() call , or use other performance testing tools. (b) Make sure your read tests are actually reading from disk, not from memory. The best way to do this is to reboot before reading; there are other ways to completely purge the cache before reading, but they're hardware. (c) The dd command simulates a very simple workload, namely reading a big file in one fell swoop with all big IOs. On the other hand, it is completely single-threaded (no parallel or asynchronous reads), yet it has no think time. Even for single file IO, that's pretty unrealistic. And in reality, I'm sure you didn't buy an expensive computer and 8 even more expensive disk drives just to run dd on it. You really need to benchmark your actual workload (which probably has smaller and larger files, a wider mix of IO sizes and of reads and writes, probably quite a few metadata operations and small files, and most importantly more than one IO thread). The results from these micro-benchmarks (such as dd) can be very misleading.

ZFS Single drive pool faster than 8 drive stripe?

igeek01

usdmatt

ralphbsz