1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

ZFS performance problems with 2TB Samsung drives drives

Discussion in 'Storage' started by arad85, Apr 30, 2012.

  1. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Hi all,

    Hopefully you can help. I have four Samsung 2TB drives in a RAIDZ array. They are given to zfs as complete disks, meaning their stripe offset is 0.

    Code:
    diskinfo -v /dev/ada3
    /dev/ada3
            512             # sectorsize
            2000398934016   # mediasize in bytes (1.8T)
            3907029168      # mediasize in sectors
            4096            # stripesize
            0               # stripeoffset
            3876021         # Cylinders according to firmware.
            16              # Heads according to firmware.
            63              # Sectors according to firmware.
            S2H7J1CB702251  # Disk ident.
    


    They are connected into the fileserver with Adaptec 1430SA SATA controllers. This is the performance I'm getting:

    Straight dd from the raw disks:
    Code:
    dd if=/dev/ada1 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada2 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada4 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada6 of=/dev/null bs=1m count=10000 &
    
    gives:
    
    10000+0 records out
    10485760000 bytes transferred in 72.958848 secs (143721568 bytes/sec)
    10000+0 records in
    10000+0 records out
    10485760000 bytes transferred in 75.176198 secs (139482446 bytes/sec)
    10000+0 records in
    10000+0 records out
    10485760000 bytes transferred in 75.296263 secs (139260032 bytes/sec)
    10000+0 records in
    10000+0 records out
    10485760000 bytes transferred in 75.555696 secs (138781860 bytes/sec)
    

    gstat shows the disks as 100% busy and reading 135-140MBytes/sec.

    dd through the filesystem:
    Code:
    $ dd if=/storage/temp/test.file of=/dev/null bs=1m count=10000
    10000+0 records out
    10485760000 bytes transferred in 57.488728 secs (182396799 bytes/sec)
    

    gstat showing disk reads as 90-100% busy at 45-50MBytes/sec read.

    Straight dd to one of the disks (same disk type, but uninited):

    Code:
    $ dd if=/dev/zero of=/dev/ada3 bs=1m count=10000
    10000+0 records in
    10000+0 records out
    10485760000 bytes transferred in 77.051038 secs (136088498 bytes/sec)
    

    with gstat showing the disk 90+% busy and 130MBytes/sec.

    dd to the pool
    Code:
    $ dd if=/dev/zero of=/storage/temp/test.file bs=1m count=10000
    10000+0 records in
    10000+0 records out
    10485760000 bytes transferred in 191.702491 secs (54698089 bytes/sec)
    


    with gstat showing all four disks as 100% active at 25-30MBytes/sec.

    CPU barely registers - the system thinks it is 95+% idle (CPU is an AMD 630 quad core).

    Any ideas?
     
  2. t1066

    t1066 Member

    Messages:
    173
    Thanks Received:
    33
    arad85 thanks for this.
  3. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Thanks. Is there any way to get the drives to appear as 4096 byte block devices without destroying the ZFS pool? I think I have enough spare space to move stuff around but....
     
  4. t1066

    t1066 Member

    Messages:
    173
    Thanks Received:
    33
    No, you cannot. Maybe you could first check that these drives are really 4k drives emulating 512 bytes before making the plunge.
     
  5. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Yes, they are, so I took the plunge.

    bonnie++ benchmarks from before the rework (I have 10G of memory in the server):

    Code:
    # bonnie++ -d /storage/dir -u 0:0  -s 20g
    ...
    Version  1.96       -------Sequential Output------- --Sequential Input--  --Random-
    Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block--  --Seeks--
    Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
    MAINSERVER      20G   140  99 186515  50 109949  29   369  98 295539  34 122.0   6
    Latency               150ms     2414ms     1607ms    71019us     728ms    1267ms
    Version  1.96       ------Sequential Create------ --------Random Create--------
    MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 25232  91 +++++ +++ 21512  97 19158  71 +++++ +++ 24000  85
    Latency              9114us     194us     251us   22049us     107us     489us
    


    and after:
    Code:
    # bonnie++ -d /storage/dir -u 0:0  -s 20g
    ...
    Version  1.96       -------Sequential Output------- --Sequential Input-- --Random-
    Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block-- --Seeks--
    Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
    MAINSERVER      20G   151  99 276372  74 151062  39   376  99 383933  45 122.4   6
    Latency             88207us      746ms     1049ms    37431us     210ms    1147ms
    Version  1.96       ------Sequential Create------ --------Random Create--------
    MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 18532  70 +++++ +++ 24457  86 26525  96 +++++ +++ 28279  97
    Latency              6297us     122us     228us   18907us      82us     212us
    
    


    So, ignoring the per character values, and knowing that this is a fileserver serving (primarily) music and video to the network, I think the most important benchmarks are the block based sequential reads/writes and rewrites. Which gives:

    Code:
                      shift=9    shift=12       Gain
    Block write        186M        276M       48% faster
    Block rewrite      110M        151M       37% faster
    Block read         295M        384M       30% faster
    


    Having said that, the CPU percentage has increased, but that's OK if I can get better throughput. In addition, the latencies (although not the max per sec values) of the file creation are much lower.

    I'm just restoring the data from the single disk and whilst I'm not able to read the data at 100% speed (I think rsync is causing a bottleneck there) the write speeds as stated by gstat seem to be much better (before I was only ever getting 40-50Mbytes/sec per disk writing with the interface maxed out, now I'm getting 100+Mbytes/sec and the interface isn't maxed out.

    Hopefully, real world performance will improve now too (most of my writing is files to/from SMB shares).
     
    coppermine thanks for this.
  6. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    For completeness, these are the commands I used to rework the array:

    # gpart create -s gpt ada1
    # gpart create -s gpt ada2
    # gpart create -s gpt ada4
    # gpart create -s gpt ada6
    # gpart add -t freebsd-zfs -l disk1 -b 2048 -a 4k ada1
    # gpart add -t freebsd-zfs -l disk2 -b 2048 -a 4k ada2
    # gpart add -t freebsd-zfs -l disk3 -b 2048 -a 4k ada4
    # gpart add -t freebsd-zfs -l disk4 -b 2048 -a 4k ada6
    # gnop create -S 4096 /dev/gpt/disk1
    # gnop create -S 4096 /dev/gpt/disk2
    # gnop create -S 4096 /dev/gpt/disk3
    # gnop create -S 4096 /dev/gpt/disk4
    # zpool create storage raidz /dev/gpt/disk1.nop /dev/gpt/disk2.nop /dev/gpt/disk3.nop /dev/gpt/disk4.nop
    # zpool export storage
    # gnop destroy /dev/gpt/disk1.nop
    # gnop destroy /dev/gpt/disk2.nop
    # gnop destroy /dev/gpt/disk3.nop
    # gnop destroy /dev/gpt/disk4.nop
    # zpool import storage
     
    thethirdnut, rabfulton, kpa and 3 others thank for this.
  7. coppermine

    coppermine New Member

    Messages:
    1
    Thanks Received:
    0
    A nice and complete post containing clean presentation of a problem, analysis and solution. I would encourage you to drop a small memo what has been done, since many people will be concerned by this or related. Keep up the good work.
     
  8. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Post #6 says it all really - from bare disks to an array with 4096 aligned blocks. AFAICT, the commands do:

    • gpart create creates a geometry scheme for each disk (i.e. enables them to be accessed through gpart and they exist as devices in /dev/gpt/...)
    • gpart add adds a partition starting at 2048 blocks (1Mbyte) and aligns on a 4k byte boundary. The disk is also labelled with the -l parameter
    • gnop create creates a "pseudo" device referring to the disk with sector size 4096 bytes (the disk now looks like a 4096 byte disk).
    • zpool create creates the pool using the pseudo devices
    • zpool export removes the pool from active duty
    • gnop destroy destroys the pseudo devices that the array was created with
    • zpool import re-imports the array but references to the actual HDD drives rather than the transparent pseudo devices.

    Running:

    Code:
    # zdb storage | grep ashift
                ashift: 12
    


    which means I'm 4096 byte aligned (2^12 = 4096).

    This can be shown to be true by running gpart show on any drive:

    Code:
    ]# gpart show ada1
    =>        34  3907029101  ada1  GPT  (1.8T)
              34        2014        - free -  (1M)
            2048  3907027080     1  freebsd-zfs  (1.8T)
      3907029128           7        - free -  (3.5k)
    


    I'll report back later when the 2.5TBytes of data have been restored and I've run some more benchmarking on the re-setup disk. I also have a mirror of 2x1TByte Samsung drives that I might do the same to and see if the performance increases.
     
  9. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    PS. The dd command from /dev/random showed no improvement in performance but more realistic benchmarks (bonnie++) seemed to show significant improvements.
     
  10. Sebulon

    Sebulon Member

    Messages:
    665
    Thanks Received:
    102
    @arad85
    First of all, congratulations! It feels good conquering technology:)
    One quick about gnop though, it`s only necessary on the first drive in every vdev. So in your case with your raidz, you would only need the disk1.nop.

    But.
    ashift has never been any kind of alignment! The ashift-value determines the smallest IO that ZFS will send; in this case 2^12 = 4096. And it would have kept sending 4k IO`s regardless of what alignment you may have had.

    This is your alignment, which is done perfectly:
    And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
    4x2TB disk partition help

    The procedure begins at post #12 and is for a bootable striped mirror pool, which you may change to a different pool layout to better suit your needs. Omit the first partition and bootcoding, plus zpool set bootfs if you don`t want to boot from it.

    /Sebulon
     
    arad85 thanks for this.
  11. t1066

    t1066 Member

    Messages:
    173
    Thanks Received:
    33
    If I remember correctly, dd if=/dev/random is limited to less than 100MB/s.

    You can also retain the labels by using

    Code:
    #zpool export mypool
    #zpool import -d /dev/gpt mypool  /*if you are using gpt labels*/ 
    or
    #zpool import -d /dev/label mypool /*if you are using plain labels*/
    
     
    arad85 thanks for this.
  12. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Of course... that makes sense.

    Ahh rats... I wanted to keep the labels, but you are right, you lose the labels. I now have adaXp1 as my drives.

    Is there any way to get the system to see them as labels without having to rebuild the array (nearly finished transferring back the files, but if I must, I will...)? I'm guessing this may be important if I ever moved the disk array to a different controller (which I did a couple of months ago) as it doesn't then matter how they are physically connected up.
     
  13. phoenix

    phoenix Moderator Staff Member Moderator

    Messages:
    3,425
    Thanks Received:
    755
    For the pool to search the /dev/gpt directory for labels instead of using the device nodes directly:
    Code:
    # zpool export poolname
    # zpool import -d /dev/gpt poolname
     
    arad85 thanks for this.
  14. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Thanks, yes, saw that from t1066's post. Just waiting for by restore to finish before trying that.

    Got to love ZFS..... :e
     
  15. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Ahh.. I was nearly finished. Yes, that works - thanks all:

    Code:
    # zpool export storage
    # zpool import -d /dev/gpt storage
    # zpool status storage
      pool: storage
     state: ONLINE
     scan: none requested
    config:
    
            NAME           STATE     READ WRITE CKSUM
            storage        ONLINE       0     0     0
              raidz1-0     ONLINE       0     0     0
                gpt/disk1  ONLINE       0     0     0
                gpt/disk2  ONLINE       0     0     0
                gpt/disk3  ONLINE       0     0     0
                gpt/disk4  ONLINE       0     0     0
    
    errors: No known data errors
    
    
     
  16. Sebulon

    Sebulon Member

    Messages:
    665
    Thanks Received:
    102
    @phoenix

    impot? Talk about Freudian slip:)

    /Sebulon
     
  17. phoenix

    phoenix Moderator Staff Member Moderator

    Messages:
    3,425
    Thanks Received:
    755
    Doesn't everyone pay a ZFS tax? ;)

    Spelling fixed in original post.
     
  18. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    Just as an update to this, I tried running 3x reads from local disks and write to the array simultaneously (all 7 disks are across 2x Adaptec 1430SA controllers) and got this:

    Code:
    
    dd if=/dev/ada3 of=/storage/testing bs=1024000 count=10000 &
    dd if=/dev/ada0 of=/storage/testing1 bs=1024000 count=10000 &
    dd if=/dev/ada5 of=/storage/testing2  bs=1024000 count=10000 &
    [1] 15331
    [2] 15332
    [3] 15333
    # 10000+0 records in
    10000+0 records out
    10240000000 bytes transferred in 112.665802 secs (90888271 bytes/sec)
    10000+0 records in
    10000+0 records out
    10240000000 bytes transferred in 120.971984 secs (84647698 bytes/sec)
    10000+0 records in
    10000+0 records out
    10240000000 bytes transferred in 124.426937 secs (82297292 bytes/sec)
    

    So, 3x reads, 4x writes across the 3+1 RAIDZ array from two controllers hosting all 7 disks and I get a sustained 275Mbytes/sec write rate. Not bad IMHO - and the disks aren't maxed out according to gstat :)
     
  19. arad85

    arad85 New Member

    Messages:
    12
    Thanks Received:
    7
    A further update. I scrub this array on a weekly basis and get a cron job to mail me 7 hours (11am) after it started. This is last weeks mail:

    Code:
      pool: storage
     state: ONLINE
     scan: scrub in progress since Sun Apr 29 04:00:02 2012
        2.74T scanned out of 3.27T at 114M/s, 1h21m to go
        0 repaired, 83.73% done
    


    and this is this weeks run:

    Code:
      pool: storage
     state: ONLINE
     scan: scrub repaired 0 in 2h35m with 0 errors on Sun May  6 06:35:58 2012
    


    Admittedly there is slightly less data (2.76T) on the array as I tidied it up before reworking it, but that is a MASSIVE improvement in speed (close to 300M/s as opposed to 114M/s).