ZFS performance issue

Hi all..

I have a performance problem with ZFS that I just can't seem to resolve.

I have an external four-drive eSATA enclosure with four disks in it which I'm using essentially as a NAS on my home network. Each of these drives when directly accessed with dd can easily eclipse 100 MBps throughput. When I create a ZFS pool on them, in any configuration, the performance tanks dramatically. This is true of all four drives, which are all four different models from four different manufacturers (Seagate, Samsung, WD, Hitachi).

My end goal is a pool containing four disks in two mirrors (essentially RAID 10). The performance seems limited to roughly 60-70 MBps regardless of configuration. I've tried:
  • A pool containing a single disk.
  • A pool containing two disks striped.
  • A pool containing two disks in a mirror.

A single disk under ZFS will sustain about 60-70 MBps read. Two disks in a ZFS mirror or stripe configuration will together sustain the same read throughput of about 60-70 MBps (each disk will only read at about 30 MBps). If I launch two instances of dd reading from the same two raw devices they will saturate the SATA bus. I have tried both giving the disks directly to ZFS to manage and also tried using GPT partitions to ensure 4k alignment on two of the drives. No significant differences.

CPU: Intel(R) Core(TM)2 CPU 4300 @ 1.80 GHz (1794.23 MHz K8-class CPU)
Real memory = 6442450944 (6144 MB)
Installed version of FreeBSD is 9.1-RELEASE.

I'm about to migrate this server either to FreeBSD 10 or to Linux. I'd prefer to keep it on FreeBSD, but I don't see this performance issue using ZFS on Linux. As it stands my ZFS array can't even saturate my gigabit Ethernet which is unacceptable.

Below are some ZFS details. I know atime is enabled on the mount, but I'm testing on a single file so it's only one access time update and the other ZFS mirror in that chassis has atime disabled.

Thanks in advance for your time and thoughts!
Code:
# zfs get all test
NAME  PROPERTY              VALUE                  SOURCE
test  type                  filesystem             -
test  creation              Wed Feb 26 16:00 2014  -
test  used                  9.77G                  -
test  available             2.67T                  -
test  referenced            9.77G                  -
test  compressratio         1.00x                  -
test  mounted               yes                    -
test  quota                 none                   default
test  reservation           none                   default
test  recordsize            128K                   default
test  mountpoint            /test                  default
test  sharenfs              off                    default
test  checksum              on                     default
test  compression           off                    default
test  atime                 on                     default
test  devices               on                     default
test  exec                  on                     default
test  setuid                on                     default
test  readonly              off                    default
test  jailed                off                    default
test  snapdir               hidden                 default
test  aclmode               discard                default
test  aclinherit            restricted             default
test  canmount              on                     default
test  xattr                 off                    temporary
test  copies                1                      default
test  version               5                      -
test  utf8only              off                    -
test  normalization         none                   -
test  casesensitivity       sensitive              -
test  vscan                 off                    default
test  nbmand                off                    default
test  sharesmb              off                    default
test  refquota              none                   default
test  refreservation        none                   default
test  primarycache          all                    default
test  secondarycache        all                    default
test  usedbysnapshots       0                      -
test  usedbydataset         9.77G                  -
test  usedbychildren        240K                   -
test  usedbyrefreservation  0                      -
test  logbias               latency                default
test  dedup                 off                    default
test  mlslabel                                     -
test  sync                  standard               default
test  refcompressratio      1.00x                  -
test  written               9.77G                  -


ahci0: <Marvell 88SE912x AHCI SATA controller> port 0xdce0-0xdce7,0xdcd8-0xdcdb,0xdce8-0xdcef,0xdcdc-0xdcdf,0xdcf0-0xdcff mem 0xdfbff800-0xdfbfffff irq 16 at device 0.0 on pci1
ahci0: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported



vfs.zfs.l2c_only_size: 0
vfs.zfs.mfu_ghost_data_lsize: 2289741312
vfs.zfs.mfu_ghost_metadata_lsize: 412580864
vfs.zfs.mfu_ghost_size: 2702322176
vfs.zfs.mfu_data_lsize: 30138880
vfs.zfs.mfu_metadata_lsize: 1067008
vfs.zfs.mfu_size: 86690304
vfs.zfs.mru_ghost_data_lsize: 1084065792
vfs.zfs.mru_ghost_metadata_lsize: 356906496
vfs.zfs.mru_ghost_size: 1440972288
vfs.zfs.mru_data_lsize: 2139996160
vfs.zfs.mru_metadata_lsize: 535059456
vfs.zfs.mru_size: 2708894720
vfs.zfs.anon_data_lsize: 0
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_size: 2408448
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.arc_meta_limit: 1262219264
vfs.zfs.arc_meta_used: 2000077656
vfs.zfs.arc_min: 631109632
vfs.zfs.arc_max: 5048877056
vfs.zfs.dedup.prefetch: 1
vfs.zfs.mdcomp_disable: 0
vfs.zfs.write_limit_override: 0
vfs.zfs.write_limit_inflated: 19024920576
vfs.zfs.write_limit_max: 792705024
vfs.zfs.write_limit_min: 33554432
vfs.zfs.write_limit_shift: 3
vfs.zfs.no_write_throttle: 0
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 0
vfs.zfs.mg_alloc_failures: 8
vfs.zfs.check_hostid: 1
vfs.zfs.recover: 0
vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 0
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.write_gap_limit: 4096
vfs.zfs.vdev.read_gap_limit: 32768
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 6
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 10
vfs.zfs.vdev.bio_flush_disable: 0
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_replay_disable: 0
vfs.zfs.zio.use_uma: 0
vfs.zfs.snapshot_list_prefetch: 0
vfs.zfs.version.zpl: 5
vfs.zfs.version.spa: 28
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
security.jail.param.allow.mount.zfs: 0
security.jail.mount_zfs_allowed: 0
kstat.zfs.misc.xuio_stats.onloan_read_buf: 0
kstat.zfs.misc.xuio_stats.onloan_write_buf: 0
kstat.zfs.misc.xuio_stats.read_buf_copied: 0
kstat.zfs.misc.xuio_stats.read_buf_nocopy: 0
kstat.zfs.misc.xuio_stats.write_buf_copied: 0
kstat.zfs.misc.xuio_stats.write_buf_nocopy: 404258
kstat.zfs.misc.zfetchstats.hits: 175179193
kstat.zfs.misc.zfetchstats.misses: 5846446
kstat.zfs.misc.zfetchstats.colinear_hits: 1250
kstat.zfs.misc.zfetchstats.colinear_misses: 5845196
kstat.zfs.misc.zfetchstats.stride_hits: 174390568
kstat.zfs.misc.zfetchstats.stride_misses: 1348
kstat.zfs.misc.zfetchstats.reclaim_successes: 28295
kstat.zfs.misc.zfetchstats.reclaim_failures: 5816901
kstat.zfs.misc.zfetchstats.streams_resets: 82
kstat.zfs.misc.zfetchstats.streams_noresets: 788536
kstat.zfs.misc.zfetchstats.bogus_streams: 0
kstat.zfs.misc.arcstats.hits: 91927308
kstat.zfs.misc.arcstats.misses: 1396356
kstat.zfs.misc.arcstats.demand_data_hits: 84422170
kstat.zfs.misc.arcstats.demand_data_misses: 10883
kstat.zfs.misc.arcstats.demand_metadata_hits: 5929721
kstat.zfs.misc.arcstats.demand_metadata_misses: 614442
kstat.zfs.misc.arcstats.prefetch_data_hits: 12958
kstat.zfs.misc.arcstats.prefetch_data_misses: 646102
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 1562459
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 124929
kstat.zfs.misc.arcstats.mru_hits: 4522308
kstat.zfs.misc.arcstats.mru_ghost_hits: 313472
kstat.zfs.misc.arcstats.mfu_hits: 85829777
kstat.zfs.misc.arcstats.mfu_ghost_hits: 237884
kstat.zfs.misc.arcstats.allocated: 1947469
kstat.zfs.misc.arcstats.deleted: 963406
kstat.zfs.misc.arcstats.stolen: 1126801
kstat.zfs.misc.arcstats.recycle_miss: 435272
kstat.zfs.misc.arcstats.mutex_miss: 1804
kstat.zfs.misc.arcstats.evict_skip: 1123669
kstat.zfs.misc.arcstats.evict_l2_cached: 0
kstat.zfs.misc.arcstats.evict_l2_eligible: 115656109056
kstat.zfs.misc.arcstats.evict_l2_ineligible: 10684319744
kstat.zfs.misc.arcstats.hash_elements: 311004
kstat.zfs.misc.arcstats.hash_elements_max: 322470
kstat.zfs.misc.arcstats.hash_collisions: 1187880
kstat.zfs.misc.arcstats.hash_chains: 89713
kstat.zfs.misc.arcstats.hash_chain_max: 12
kstat.zfs.misc.arcstats.p: 3910849857
kstat.zfs.misc.arcstats.c: 4171730979
kstat.zfs.misc.arcstats.c_min: 631109632
kstat.zfs.misc.arcstats.c_max: 5048877056
kstat.zfs.misc.arcstats.size: 4170212696
kstat.zfs.misc.arcstats.hdr_size: 73015904
kstat.zfs.misc.arcstats.data_size: 2797993472
kstat.zfs.misc.arcstats.other_size: 1299203320
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_read_bytes: 0
kstat.zfs.misc.arcstats.l2_write_bytes: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 76
kstat.zfs.misc.arcstats.l2_write_trylock_fail: 0
kstat.zfs.misc.arcstats.l2_write_passed_headroom: 0
kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 0
kstat.zfs.misc.arcstats.l2_write_in_l2: 0
kstat.zfs.misc.arcstats.l2_write_io_in_progress: 0
kstat.zfs.misc.arcstats.l2_write_not_cacheable: 96729
kstat.zfs.misc.arcstats.l2_write_full: 0
kstat.zfs.misc.arcstats.l2_write_buffer_iter: 0
kstat.zfs.misc.arcstats.l2_write_pios: 0
kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 0
kstat.zfs.misc.vdev_cache_stats.delegations: 0
kstat.zfs.misc.vdev_cache_stats.hits: 0
kstat.zfs.misc.vdev_cache_stats.misses: 0
 
There are some things missing:

  • Are these 4K drives? diskinfo -v ada0 | grep stripesize
  • Was gnop(8) used to force ZFS to use an ashift of 12? zdb | grep ashift
  • When GPT partitions were used, were they aligned? gpart show

FreeBSD 9.1 is also relatively old. ZFS has advanced a lot since then.
 
"Are these 4K drives? diskinfo -v ada0 | grep stripesize"

2 drives are 512 byte sector, 2 are 4k but report as 512.


"Was gnop(8) used to force ZFS to use an ashift of 12? "

Yes. I've tried it both with ashift of 9 and of 12. No change at all in performance.


"When GPT partitions were used, were they aligned? gpart show"
Code:
=>        34  5860533101  ada2  GPT  (2.7T)
          34        2014        - free -  (1M)
        2048  5860531080     1  freebsd-zfs  (2.7T)
  5860533128           7        - free -  (3.5k)

=>        34  5860533101  ada3  GPT  (2.7T)
          34        2014        - free -  (1M)
        2048  5860531080     1  freebsd-zfs  (2.7T)
  5860533128           7        - free -  (3.5k)

"FreeBSD 9.1 is also relatively old. ZFS has advanced a lot since then."

I'm sure it has, but it was well in use by 9.1 and I can't fathom it gained popularity with such glaringly bad performance. So I'm assuming there's a problem with my configuration that I am just missing. As I mentioned my plan is to migrate to FreeBSD 10 or Linux, with this performance issue really being my deciding factor. I've not observed any such issues with ZFS on Linux. If it wasn't for my love of jails and pf I'd simply switch. I've lived with the performance up until now but if I'm going to expend the energy to move to 10 or Linux I want to get the performance I should have had all along after the move.

Thanks for the response!
 
The problem is not ZFS itself. I have only two drives (512 byte sectors, a 1TB and a 3TB SATA), with a much weaker CPU (1.8 GHz Atom running in 32-bit mode, with 3GB of memory), and a ZFS mirror, and I can read through ZFS at 130 to 140 MB/sec (about 2/3 of the raw disk bandwidth), and write at about 60 MB/s (also 2/3 of the raw disk bandwidth). This using an even older version of ZFS, namely FreeBSD 9.0.

Three questions about how you did your performance measurement. First, you say that each disk can do 100 MB/s. Can they all do that simultaneously? Try running four dd in parallel to the raw disks, one for each disk, both reading and writing. Second, when you use dd to read and write through the ZFS file system, try using large IOs. I personally have standardized my performance testing on 16MiB IOs. Last question: When you do reading and writing through ZFS, do you know what your bottleneck is? Try go gather some CPU and IO performance statistics (top would be a good starting point). For example, on my system writing to ZFS is not CPU bound at all (about 10% CPU utilization, probably mostly for checksumming), while reading causes about 30% CPU utilization.
 
I'm primarily testing using dd.

I've tried using 128k and 1M blocks, though neither affects the numbers. I used 128k to align with the block sizes ZFS uses, though in theory 1M will align as well being evenly divisible by 128k.

I did state in my original post that I can use dd to read 2 raw disks at once and that saturates my SATA bus, so that can't be the bottleneck. Also while a single disk reads at its performance limit when directly accessed (north of 100MBps), that same disk when reading a file from ZFS does good if it hits 40MBps.

CPU is well within reason during operations. While reading a file from ZFS using dd:
Code:
CPU:  0.0% user,  0.0% nice,  9.1% system,  0.4% interrupt, 90.5% idle

gstat shows for both drives L at its max limit of 10, and both 100% busy and reading at 40MBps (was one of the faster runs for ZFS)

When I use dd to read one of the raw devices L never exceeds 1 while it sustains 140-150MBps and 95% busy.
 
I'm stumped. Everything you say indicates that it should be working at high speed, but it doesn't. It's sort of sad that my system (half as many disks, and much slower CPU) runs at twice the speed. ZFS experts to the rescue please ...
 
Over the weekend I booted the machine to FreeBSD 10 release from a USB stick and ran performance tests (dd) on both the version 28 zfs and then upgraded the zfs on the disks to 5000.. In both cases performance was stuck in the 60MBps range from the mirror..
 
And, if you test filesystem performance using something other than dd, what happens? Like using, I don't know, and actual filesystem benchmarking tool like iobench, or bonnie++, or fio, etc?

dd is one of the worst tools to use when it comes to benchmarking filesystems, especially ZFS. Just don't use it.
 
phoenix said:
dd is one of the worst tools to use when it comes to benchmarking filesystems, especially ZFS. Just don't use it.

Why?

dd does a very simple IO pattern: open file, lots of reads or writes, all the same size, one at a time (single-threaded sequential). There is really only one thing to adjust, which is IO size. That is its strength: simplicity. As such, it makes for a very fine micro-benchmark, which simulates sequential IO from real-world applications.

Sure, there are other patterns (yes, I'm very familiar with them, since my specialty used to be workload characterization and generation), and there are tools that can generate those (I think a good overview of the state of the art from a decade ago can be found in Eric Anderson's Buttress paper). But dd makes an excellent starting point, and it is available on any computer. If a file system fails the dd test by a large margin (meaning: its performance is a factor of several away from where one would expect it), there is no need to go on to complicated workload generators.
 
There are some simple read-only disk speed tests in diskinfo(8). It's essentially the same thing as using dd(1) to read from the drive with a moderate buffer size.
diskinfo -tv ada0
 
Here are similar tests on my HP N40L Microserver (3x3TB RAIDz1 ashift=12, 8Gb ECC RAM, FreeBSD 10)

dd if=PCBSD10.0-RELEASE-x64-DVD-USB-latest.iso of=/dev/null
7479756+0 records in
7479756+0 records out
3829635072 bytes transferred in 44.136510 secs (86767963 bytes/sec)

dd if=PCBSD10.0-RELEASE-x64-DVD-USB-latest.iso of=/dev/null bs=128k
29217+1 records in
29217+1 records out
3829635072 bytes transferred in 11.680077 secs (327877558 bytes/sec)
The performance was similar since zfs version 15 (FreeBSD 8).
Mirror should be even faster!

Is the SATA controller in AHCI mode in BIOS? I suspect the esata cabling issue or the enclosure controller.
Try connecting the drives to the internal SATA ports and re-benchmark.
 
I did some more testing yesterday.

I booted the server to Linux Mint Debian edition on a USB stick and installed ZFS on Linux.

I imported my test pool which was created in FreeBSD 9.1 and then upgraded to ZFS 5000 in FreeBSD 10.

Using dd with no options on the test file on that pool I got the same terrible performance.. But, when I retried it specifying a block size of 128k it saturated the sata link. I retried again with a block size of 4096 bytes and it again saturated the sata link. I tried again forcing a block size of 512 bytes and the performance tanked back to the usual poor numbers.

No amount of fiddling with block sizes seemed to really make much difference on my existing mirror which contains my actual data.

I then destroyed my test pool. and rebooted back into FreeBSD 9.1. I created a new test pool and test file then retried the same steps above. Using 4096 byte or 128k blocks the performance wasn't as good as under Linux, but it was acceptable (maybe 20% less than saturating the link). When forced to 512 byte blocks it went right back into the toilet.

So phoenix was onto something questioning the use of dd in this case, at least with default values.

For now I'll leave alone the question of why ZFS on Linux is able to saturate the sata link while ZFS under FreeBSD 9.1 can't..

My assumption is this is a bug in ZFS and since the main code is now shared between FreeBSD and Linux, both suffer from it. When I use 4096 or 128k blocks the drives both light up and stay virtually solid while reading. When I use the 512 byte blocks the drives literally light up for about half a second, then go idle for a second then repeats that pattern the entire time. It looks like something is going on internally which is causing reads to stall.

Another interesting thing I've found is how the ARC seems selective about what it caches..

I tried this test:

Copied a 1GB file which will fit into the ARC to /dev/null using cp.
Recopied the file and timed it. It reread the file from the disks instead of from the cache.
I then copied the file from the array to a stand alone drive. It again reread from the disks instead of the cache.
After I copied it to the stand alone disk I then copied the file again from the array to /dev/null and that time it read from cache.

I'm pretty stumped.. At this point I don't believe I'll increase the performance of this hardware in ZFS enough to warrant the work involved to migrate to Linux, so I'll migrate to FreeBSD 10 instead, but my faith in ZFS is a little shaken by the poor handling of sub-sector reads. I'm also still at a loss to explain the poor performance of my existing array under any conditions.

I have now added the 2 additional 3TB drives to my existing array as a mirror vdev so they are not available for further testing.

I would love any insight anyone may have into this.
 
yourlord said:
When I use 4096 or 128k blocks the drives both light up and stay virtually solid while reading. When I use the 512 byte blocks the drives literally light up for about half a second, then go idle for a second then repeats that pattern the entire time. It looks like something is going on internally which is causing reads to stall.

The great advantage of using dd is that you actually know which system calls it makes. If you do bs=512 or leave it at the default, it will do read() or write() system calls with a block size of 512 bytes. If you set bs=4096 or =1048576, it will do 4KiB or 1MiB read() and write() calls.

It looks to me like the limiting factor in your tests is the system call interface into the file system, with the bottleneck probably being the VFS layer implementation of ZFS. Here are a few speed tests I just did (FreeBSD 9.0 on a 32-bit Atom CPU), all using dd without an fsync() (so mostly testing transfer to/from buffer cache). Note that I'm still using disks with 512 byte sectors; people with more modern disks should probably completely forego doing any IO with block sizes smaller than 4KiB. And all these numbers are probably inaccurate by 10% or 20%, since I just measured them a few times and reported the last value (no averaging or data quality control here, this is quick and dirty):
  • Copying from /dev/zero to /dev/null: 512 byte blocks give 152 MB/s, and 1MiB blocks give 1.46 GB/s. This gives us an upper limit of what the system call interface is capable of handling through the user/system boundary (on my wimpy machine).
  • Writing into a UFS file system (without fsync at the end, just testing the VFS layer) gives 34 MB/s with 512 byte blocks, and 1MiB blocks give 171 MB/s.
  • Same writing into a ZFS file system (2-way mirrored) with 512 byte blocks gives 10.2 MB/s, and with 1MiB blocks 53 MB/s. The disk light starts blinking immediately, so ZFS seems to be less aggressive about write-behind of dirty data in the buffer cache.
    [*}Reading from a UFS file system, a file that is still warm in cache (was read or written seconds ago), with 512 byte blocks 85 MB/s, with 1MiB blocks 853 MB/s. No IO to the drive actually occurs (make sense, the SATA disk wouldn't be able to do IO at that speed anyhow).
  • Same reading from a ZFS file system (again, warm file): 512 byte blocks 14.8 MB/s, with 1MiB blocks 151 MB/s. Actually, it seems that the speed of reading from warm cache is the same as the speed of reading from disk. Weird. Even weirder: Reading a file that SHOULD be cached (it was read two seconds ago) goes back to disk, as verified by running iostat on the disk simultaneously. On the other hand, the speed of reading from SATA is respectable. So ignore this test, it is not a test of the VFS layer, but of whole stack IO throughput.

I think what this tells me is that the ZFS's system call implementation is slower for writes, in particular for small block sizes. It could just be that ZFS has more constant overhead. Even that makes a lot of sense, it might be doing interestingly complicated caching decisions (ARC and such) that improve total throughput for aggregate workloads, but don't look so hot on these microbenchmarks. Or it could be performing ACL checks (I don't use ACLs, but I haven't even looked whether they are enabled on my machine). For reads from ZFS cache, something is seriously broken, and it is probably my fault for not configuring it right.

What do we learn from this? That engineering is the art of the compromise. Both for file system implementors, and for file system users. And that I need to figure out why my ZFS setup is not caching files for multiple reads, and fix it.
 
Back
Top