ZFS stripe performance

Hello all,

I have the following problem which I don't understand and I've been struggling for the past few days to solve or at least understand. My configuration is as follows:
Code:
FreeBSD rabbit.example.com 9.2-RC4 FreeBSD 9.2-RC4 #0: Tue Sep 24 15:33:26 UTC 2013     root@rabbit.example.com:/usr/obj/usr/src/sys/GENERIC  amd64
2xE5-2620
128Gb 
3xSAS2308 (built in mps driver, driver_version: 14.00.00.01-fbsd)
12xSamsung SSD 840 
and some others not important here.
Issue is that performance on striped ZFS pool is degrading when adding more disks and it should be the other way around.

ZFS customization:
Code:
zfs set recordsize=4k Store
zfs set compression=lz4 Store
(rest default so sync=default)
And now for some tests:
1 SSD in pool

Code:
       No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  328333       0   728087        0  676709  134946
2 SSD in pool from second controller
Code:
        No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  277307       0   695430        0  678698  140415
3 SSD in stripe with all 3 controllers
Code:
       No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  235144       0   428078        0  451033  214316
 
What makes it worse is that the IOPS limit actually degrades gstats reports around 6000 with one SSD and with 2 SSD ~ 3000 and 3000 actually lower as shown by iostat as well:
Tests made with sync=always
Code:
   File size set to 204800 KB
        Command line used: iozone -O -i 0 -i 1 -i 2 -+n -r4K -s 200m
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          204800       4    6103       0   207337        0  190212    3511
And with 2 SSD
Code:
    File size set to 204800 KB
        Command line used: iozone -O -i 0 -i 1 -i 2 -+n -r4K -s 200m
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          204800       4    5900       0   207428        0  190251    4704

Here are the ZFS settings
Code:
vfs.zfs.l2c_only_size: 0
vfs.zfs.mfu_ghost_data_lsize: 131072
vfs.zfs.mfu_ghost_metadata_lsize: 24576
vfs.zfs.mfu_ghost_size: 155648
vfs.zfs.mfu_data_lsize: 5358592
vfs.zfs.mfu_metadata_lsize: 614400
vfs.zfs.mfu_size: 6358016
vfs.zfs.mru_ghost_data_lsize: 313931776
vfs.zfs.mru_ghost_metadata_lsize: 776704
vfs.zfs.mru_ghost_size: 314708480
vfs.zfs.mru_data_lsize: 12297728
vfs.zfs.mru_metadata_lsize: 4158976
vfs.zfs.mru_size: 20823040
vfs.zfs.anon_data_lsize: 0
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_size: 49152
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.arc_meta_limit: 33051212800
vfs.zfs.arc_meta_used: 27936768
vfs.zfs.arc_min: 16525606400
vfs.zfs.arc_max: 132204851200
vfs.zfs.dedup.prefetch: 1
vfs.zfs.mdcomp_disable: 0
vfs.zfs.nopwrite_enabled: 1
vfs.zfs.write_limit_override: 3221225472
vfs.zfs.write_limit_inflated: 0
vfs.zfs.write_limit_max: 8584478720
vfs.zfs.write_limit_min: 33554432
vfs.zfs.write_limit_shift: 0
vfs.zfs.no_write_throttle: 1
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 1
vfs.zfs.no_scrub_prefetch: 0
vfs.zfs.no_scrub_io: 0
vfs.zfs.resilver_min_time_ms: 3000
vfs.zfs.free_min_time_ms: 1000
vfs.zfs.scan_min_time_ms: 1000
vfs.zfs.scan_idle: 50
vfs.zfs.scrub_delay: 4
vfs.zfs.resilver_delay: 2
vfs.zfs.top_maxinflight: 32
vfs.zfs.write_to_degraded: 0
vfs.zfs.mg_alloc_failures: 36
vfs.zfs.check_hostid: 1
vfs.zfs.deadman_enabled: 1
vfs.zfs.deadman_synctime: 1000
vfs.zfs.recover: 0
vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5
vfs.zfs.vdev.cache.bshift: 13
vfs.zfs.vdev.cache.size: 0
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.write_gap_limit: 4096
vfs.zfs.vdev.read_gap_limit: 32768
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 29
vfs.zfs.vdev.min_pending: 1
vfs.zfs.vdev.max_pending: 2
vfs.zfs.vdev.bio_delete_disable: 0
vfs.zfs.vdev.bio_flush_disable: 0
vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.vdev.trim_max_bytes: 2147483648
vfs.zfs.cache_flush_disable: 1
vfs.zfs.zil_replay_disable: 0
vfs.zfs.sync_pass_rewrite: 2
vfs.zfs.sync_pass_dont_compress: 5
vfs.zfs.sync_pass_deferred_free: 2
vfs.zfs.zio.use_uma: 0
vfs.zfs.snapshot_list_prefetch: 0
vfs.zfs.version.ioctl: 3
vfs.zfs.version.zpl: 5
vfs.zfs.version.spa: 5000
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
 
Yes, I did, and the performance gap really shows when I limit L2ARC so that caching is disabled. Read almost doubles when the second SSD is added but when I add the third, the increase is very low and so goes for the fourth, fifth, etc.
 
It is hard to say something for sure without looking on alive system with proper tools. At this moment I am working on improving FreeBSD block level storage performance. After a lot of changes on alike system with 16 SSDs on 4 LSI HBAs I was able to reach 800-900K IOPS and about 3.5GB/s bandwidth total from the raw disks (http://people.freebsd.org/~mav/disk.pdf). Experiments with ZFS on top of that unfortunately shown much lower numbers, only about 120-180K IOPS (from the disks, without ARC caching). Analysis shows congestion on several locks inside ZFS. I've shown that to Andriy Gapon (avg@) and he is now thinking on that too. May be your situation is alike.
 
Back
Top