ZFS ZFS scrub/resilver performance

I am installing a big storage to backup other backups...

Performance are not really important because we have only gigabit connection, except for scrub and resilver.

I find the performance really bad:
  • FreeBSD 10.3 RC2
  • H830 in JBOD mode, mrsas driver (same performance with MFI driver)
  • 24 * 4TB hardidsk in 2 MD1400
  • zfs: 2 raidz2 volumes of 12 disks
    zpool create MD1400 raidz2 da1 da2 da3 da4 da5 da6 da7 da8 da9 da10 da11 da12
    zpool add MD1400 raidz2 da13 da14 da15 da16 da17 da18 da19 da20 da21 da22 da23 da24
  • 512 byte sectors for zfs and for harddrive (according to datasheet)
  • 96GB of RAM, 2 xeon E5, 4 cores 2.4Ghz

For scrub I have around 250MB/s, for resilver around 30MB/s (with no disk load). I find those numbers really slow. I know I should put less disk in raidz2 pools and put a SSD ZIL device, but no money for the moment. I am a FreeBSD beginner, I have only experience on Linux and with btrfs (on a local backup server with 12 disks scrubbing is around 750 GB/s).

Did I make a really big mistake somewhere

I have tried to modify some ioctls
Code:
vfs.zfs.scrub_delay=0
vfs.zfs.top_maxinflight=256
vfs.zfs.resilver_min_time_ms=5000
vfs.zfs.min_auto_ashift=12
vfs.zfs.resilver_delay=0

But no real performance gain when no other disk load.
Performance with iozone are good, may be too good except for really big files (memory cache ???)
iozone -a -e -I -i 0 -i 1 -i 2
Code:
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
               4       4     129     481  1310136  4000000 1905268     160
             128       4    5271    2864  1684006  2843510 2238774    3326
             128     128    4541    3329 14200794 1588107812842051    4528
            1024       4   30094   35292  2056183  2694787 2297012   35918
            1024    1024   30695   17596 15295157 1551618115240881   29417
          131072      64  692812  663117  5492899  4988311 4607916  520182
          131072   16384  489109  576190  3880159  3040704 3300888  505766
         1048576      64 1158665 1269999  6342591  5880403 5766660 1205105
         1048576   16384  899561 1087380  3574267  2776977 3605299 1097044
         4194304      64  265900  237421  4017686  7439293 7003663  244079
         4194304   16384  311796  294150  2884899  3617101 1771297  307059
       134217728      64  486427   64904   568734   564655
 
Last edited by a moderator:
In FreeBSD 10.2 (and probably 10.3-RC2) ZFS differentiates between different kinds of I/O when scheduling I/O operations. The defaults are far to low to saturate most pools with scrubs on purpose. It's a trade-off between throughput and latency. Spinning disks are limited to about 150 seeks per seconds. Your disk are under full load if the disk I/O queue depth is never empty. SATA disks are limited to a queue depth of 32. SAS disks might offer a higher queue depth but using such deep queues with spinning disks will severely increase your I/O latency.

You can modify the scheduler parameters with sysctl vfs.zfs.vdev.*active*=$VALUE. Experiment with values up to 30 for spinning disks. You can trash ZFS performance for some workloads by playing with these knobs. These knobs are per system and affect all ZFS pools.

Scrubbing with just and handful of outstanding read requests at the same time won't degrade your system throughput by that much. Does a slight performance degradation over the weekend hurt in your use-case? If not start scrubs on friday evenings and let them run over the weekend.
 
Thanks for your answer. Waiting a whole week-end for scrub is not a problem. The problem is at 250MB/s a week-end is not enough. When the volume will be filled I will have almost 80Tb to scrub/resilver.
At 250MB/s it is almost 4 days with no other disk activity...
And with current resilvering performance, I will need weeks to resilver.

As I said previously normal performance are capped by the gigabit connection and it is a volume to gather copy of other backup server, so I do not care "interactive" performance.
But I can not accept several weeks to resilver, risk of other failures is too high.

Now my sysctl.conf is
Code:
vfs.zfs.scrub_delay=0
vfs.zfs.top_maxinflight=256
vfs.zfs.resilver_min_time_ms=10000
vfs.zfs.resilver_delay=0
vfs.zfs.vdev.scrub_max_active: 8
vfs.zfs.vdev.scrub_min_active: 4
vfs.zfs.txg.timeout: 15

And my performance is abysmal
3.17T scanned out of 26.8T at 14.8M/s, 463h22m to go
 
I suppose you want an average
gstat -I 10s

Code:
dT: 10.009s  w: 10.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0    231    228    673    1.3      3     11   22.6   13.2| da1
    0    230    227    702    1.4      3     11   18.1   11.3| da2
    0    232    229    700    1.2      3     11   16.0   10.4| da3
    0    232    228    671    1.3      4     12   15.5   11.2| da4
    0    231    227    701    1.4      3     12   17.0   11.2| da5
    0    227    224    669    1.4      3     12   17.1   11.5| da6
    0    229    226    703    1.6      3     11   18.7   12.0| da7
    0    229    225    700    1.4      3     11   24.5   13.1| da8
    0    234    230    678    1.3      3     11   16.2   11.0| da9
    0    234    231    701    1.2      3     11   17.1   11.4| da10
    0    231    227    700    1.3      3     11   16.3   11.2| da11
    2    329    327    627    0.4      2      6   17.9    8.5| da12
    1    326    324    666    0.5      2      6   21.4    9.1| da13
    1    329    327    666    0.5      2      6   19.1    8.5| da14
    1    327    325    627    0.4      2      6   21.9    9.1| da15
    1    329    327    669    0.5      2      6   12.5    7.1| da16
    2    329    327    669    0.4      2      6   21.8    8.9| da17
    2    326    324    624    0.4      2      6   18.0    8.5| da18
    2    329    327    667    0.5      2      6   25.8   10.5| da19
    2    330    328    631    0.4      2      6   21.7    8.5| da20
    2    330    328    667    0.5      2      6   22.7    9.1| da21
    2    327    325    666    0.4      2      6   22.4    8.7| da22
    1    104      0      0    0.0    104    315    8.6   88.9| da23
    0     26      0      0    0.0     26    370    8.5   21.6| da24
    0      0      0      0    0.0      0      0    0.0    0.0| diskid/DISK-00a912aa105588e11d0068fc48708741
    0      0      0      0    0.0      0      0    0.0    0.0| diskid/DISK-00a912aa105588e11d0068fc48708741p1
    0      0      0      0    0.0      0      0    0.0    0.0| diskid/DISK-00a912aa105588e11d0068fc48708741p2
    0      0      0      0    0.0      0      0    0.0    0.0| diskid/DISK-00a912aa105588e11d0068fc48708741p3
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/MD1400_12/CTMM

Nota Bene: I have lost two hard disk
zpool status MD1400_12
Code:
        NAME                        STATE     READ WRITE CKSUM
        MD1400_12                   DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            da1                     ONLINE       0     0     0
            da2                     ONLINE       0     0     0
            da3                     ONLINE       0     0     0
            da4                     ONLINE       0     0     0
            spare-4                 DEGRADED     0     0     0
              3926900443355641810   OFFLINE      0     0     0  was /dev/da5
              da24                  ONLINE       0     0     0  (resilvering)
            da5                     ONLINE       0     0     0
            da6                     ONLINE       0     0     0
            da7                     ONLINE       0     0     0
            da8                     ONLINE       0     0     0
            da9                     ONLINE       0     0     0
            da10                    ONLINE       0     0     0
            da11                    ONLINE       0     0     0
          raidz2-1                  DEGRADED     0     0     0
            da12                    ONLINE       0     0     0
            da13                    ONLINE       0     0     0
            da14                    ONLINE       0     0     0
            da15                    ONLINE       0     0     0
            da16                    ONLINE       0     0     0
            da17                    ONLINE       0     0     0
            da18                    ONLINE       0     0     0
            da19                    ONLINE       0     0     0
            da20                    ONLINE       0     0     0
            replacing-9             DEGRADED     0     0     0
              10166242517980806007  REMOVED      0     0     0  was /dev/da22/old
              da23                  ONLINE       0     0     0  (resilvering)
            da21                    ONLINE       0     0     0
            da22                    ONLINE       0     0     0
 
H830 in JBOD mode, mrsas driver (same performance with MFI driver)
What do you mean by JBOD mode? As far as I know, the only way you'll get daN devices on a PERC H830 is by creating individual volumes for each drive. With the mfi(4) driver, what does # mfiutil show volumes report? How about # mfiutil cache 0?
zfs: 2 raidz2 volumes of 12 disks
zpool create MD1400 raidz2 da1 da2 da3 da4 da5 da6 da7 da8 da9 da10 da11 da12
zpool add MD1400 raidz2 da13 da14 da15 da16 da17 da18 da19 da20 da21 da22 da23 da24
As you point out, you probably want to have fewer drives in each vdev. But if normal file I/O is not horribly slow, this probably isn't your problem, or at least not a major part of it.
512 byte sectors for zfs and for harddrive (according to datasheet)
What brand/model/firmware?
Did I make a really big mistake somewhere
Do you happen to have deduplication enabled? I found (back in FreeBSD 8.x, so it may have changed) that dedupe had a terrible impact on scrub / resilver performance.
I have tried to modify some ioctls
You generally don't want / need to do that. On somewhat similar hardware (2 x E5620, 96GB DDR3-1333, 16 x 2TB drives) with default settings I can scrub a 27TB pool (20TB used) or resilver a failed drive in around 5.5 hours (first is scrub on 10.3, second is resilver on 8.4):

Code:
(0:2) pool4:/sysprog/terry# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data  27.2T  20.0T  7.16T         -    49%    73%  1.00x  ONLINE  -
(0:3) pool4:/sysprog/terry# zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 5h38m with 0 errors on Sun Feb  7 12:03:30 2016
Code:
(0:2) rz2:/sysprog/terry# zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
data  27.2T  20.4T  6.79T    75%  1.00x  ONLINE  -
(0:3) rz2:/sysprog/terry# zpool status
  pool: data
 state: ONLINE
  scan: resilvered 1.26T in 5h25m with 0 errors on Wed Jun 17 03:22:44 2015
 
What do you mean by JBOD mode? As far as I know, the only way you'll get daN devices on a PERC H830 is by creating individual volumes for each drive. With the mfi(4) driver, what does # mfiutil show volumes report? How about # mfiutil cache 0?
I have enabled mrsas driver, so no mfiutil (I was unable to access smarctl info with default driver)
Latest generation of Perc (H730/H830) have an HCA mode
HBA mode: In the HBA mode, PERC controller operates as Host Bus Adapter (HBA). This mode does not contain virtual disks or the ability to create them. All physical disks function as non-RAID disks
under operating system control. The PERC card acts as a conduit between the host server and the physical disks. Input and output requests originate from the host and are passed through the controller to the physical drives.


What brand/model/firmware?
smartctl -a /dev/da12
Code:
=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG03SCA400
Revision:             DG09
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
...

Firmware is specific to Dell. But performance of iozone seems correct
diskinfo -t /dev/da12 gives result a bit slow, but not catastrophic (and I am resilvering)
Code:
 diskinfo -t /dev/da12
/dev/da12
        512             # sectorsize
        4000787030016   # mediasize in bytes (3.6T)
        7814037168      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        486401          # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        75R0A07IFVL8    # Disk ident.

Seek times:
        Full stroke:      250 iter in   4.765073 sec =   19.060 msec
        Half stroke:      250 iter in   3.596500 sec =   14.386 msec
        Quarter stroke:   500 iter in   6.203005 sec =   12.406 msec
        Short forward:    400 iter in   2.329779 sec =    5.824 msec
        Short backward:   400 iter in   2.454308 sec =    6.136 msec
        Seq outer:       2048 iter in   0.080367 sec =    0.039 msec
        Seq inner:       2048 iter in   0.261640 sec =    0.128 msec
Transfer rates:
        outside:       102400 kbytes in   0.717927 sec =   142633 kbytes/sec
        middle:        102400 kbytes in   0.873888 sec =   117177 kbytes/sec
        inside:        102400 kbytes in   1.771477 sec =    57805 kbytes/sec

Do you happen to have deduplication enabled? I found (back in FreeBSD 8.x, so it may have changed) that dedupe had a terrible impact on scrub / resilver performance.

You generally don't want / need to do that. On somewhat similar hardware (2 x E5620, 96GB DDR3-1333, 16 x 2TB drives) with default settings I can scrub a 27TB pool (20TB used) or resilver a failed drive in around 5.5 hours (first is scrub on 10.3, second is resilver on 8.4):

I have not enabled deduplication (will be useless for my data), so unless it is enabled by default.
zpool get all MD1400_12
Code:
NAME       PROPERTY                       VALUE                          SOURCE
MD1400_12  size                           87T                            -
MD1400_12  capacity                       30%                            -
MD1400_12  altroot                        -                              default
MD1400_12  health                         DEGRADED                       -
MD1400_12  guid                           15975816981946802139           default
MD1400_12  version                        -                              default
MD1400_12  bootfs                         -                              default
MD1400_12  delegation                     on                             default
MD1400_12  autoreplace                    off                            default
MD1400_12  cachefile                      -                              default
MD1400_12  failmode                       wait                           default
MD1400_12  listsnapshots                  off                            default
MD1400_12  autoexpand                     off                            default
MD1400_12  dedupditto                     0                              default
MD1400_12  dedupratio                     1.00x                          -
MD1400_12  free                           60.2T                          -
MD1400_12  allocated                      26.8T                          -
MD1400_12  readonly                       off                            -
MD1400_12  comment                        -                              default
MD1400_12  expandsize                     -                              -
MD1400_12  freeing                        0                              default
MD1400_12  fragmentation                  5%                             -
MD1400_12  leaked                         0                              default
MD1400_12  feature@async_destroy          enabled                        local
MD1400_12  feature@empty_bpobj            active                         local
MD1400_12  feature@lz4_compress           active                         local
MD1400_12  feature@multi_vdev_crash_dump  enabled                        local
MD1400_12  feature@spacemap_histogram     active                         local
MD1400_12  feature@enabled_txg            active                         local
MD1400_12  feature@hole_birth             active                         local
MD1400_12  feature@extensible_dataset     enabled                        local
MD1400_12  feature@embedded_data          active                         local
MD1400_12  feature@bookmarks              enabled                        local
MD1400_12  feature@filesystem_limits      enabled                        local
MD1400_12  feature@large_blocks           enabled                        local
 

Attachments

  • upload_2016-3-24_8-54-16.png
    upload_2016-3-24_8-54-16.png
    3.4 KB · Views: 517
since you have such bad performance and you're running your controller in HBA mode, have you checked what the OS has negotiated the disk transfer speed to?

I'm about to make a post regarding my PERC830 which can do a 12G and the drives are 12G, and PERC730 can do 6G and drives 6G, why the transfer rate is only negotiated as 150MB/sec camcontrol and dmesg.
 
I'm about to make a post regarding my PERC830 which can do a 12G and the drives are 12G, and PERC730 can do 6G and drives 6G, why the transfer rate is only negotiated as 150MB/sec camcontrol and dmesg.
The transfer rates reported for some devices are made-up numbers and don't reflect reality. A bunch of the drivers have had things cleaned up, for example mfi(4) doesn't report any transfer rate information for multi-drive volumes. It looks like mps(4) gets it right in IT mode, correctly reporting "300.000MB/s transfers" for my 3Gbps SATA drives and "600.000MB/s transfers" for my 6Gbps SAS drives. But it incorrectly reports the speed in IR mode, where it reports a stripeset of 4 SATA SSDs as "150.000MB/s transfers". The mpsutil(8) utility, just committed to 10-STABLE yesterday (so it isn't in 10.3-RELEASE) displays the actual port speed(s) correctly:
Code:
(0:1) host:/sysprog/terry# mpsutil show devices
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   06   4433221100000000 0009    0001      SATA Target   3.0   0001 03    1
00   07   4433221101000000 000a    0002      SATA Target   3.0   0001 02    1
00   08   4433221102000000 000b    0003      SATA Target   3.0   0001 01    1
00   09   4433221103000000 000c    0004      SATA Target   3.0   0001 00    1
That's the 4-SSD SATA stripeset that is mis-reported as "150.000MB/s transfers":
Code:
da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <LSI Logical Volume 3000> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number  1569089174241760694
da0: 150.000MB/s transfers
da0: Command Queueing enabled
da0: 301360MB (617185280 512 byte sectors)
So, when in doubt, ask the hardware and don't believe the dmesg(8) output if it seems incorrect.
 
I have the same message with my configuration. I do not understand why the cap at 150MB/s. Those are SAS drives and even the first generation SAS is 3Gbit/s. But it does not seems really important for me because Toshiba MG03SCA400 4TB drive have only a max sustained transfer rate of 165MB/s

For resilvering is the fix from opensolaris applied to Freebsd
https://blogs.oracle.com/roch/entry/sequential_resilvering

But I have asked other people resilvering with those kind of hard disk should be less than a week, not a month. Bad interaction between Freebsd and H830 ?
 
thanks for the info.. it's good to know it's only cosmetic.

I was wrong in saying the PERC 730 only can do 6G, it can do 12G as well.. but due to I (for some reason?) have a couple of 6G for system drives it pulls down the speed for the rest of the drives.
 
What do you mean by JBOD mode? As far as I know, the only way you'll get daN devices on a PERC H830 is by creating individual volumes for each drive. With the mfi(4) driver, what does # mfiutil show volumes report? How about # mfiutil cache 0?

As Fabrice pointed out, the 730/830 series can be configured as HBA, the 730 can even do mixed mode (I haven't tried this on the 830 controller), so in RAID you can select some drives for RAID (i.e. system on raid1) and the rest you can select and configure as Non-RAID.
Selecting multiple disks as Non-RAID is easiest done in the iDRAC interface.

Sorry I got a little off topic.
 
Back
Top