21e23 [Solved] ZFS performance problems with 2TB Samsung drives drives - The FreeBSD Forums
The FreeBSD Forums  

Go Back   The FreeBSD Forums > Base System > Storage

Storage Place to ask questions about partitioning, labelling, filesystems, encryption or anything else related to storage area.

Reply
 
Thread Tools Display Modes
  #1  
Old April 30th, 2012, 20:08
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default ZFS performance problems with 2TB Samsung drives drives

Hi all,

Hopefully you can help. I have four Samsung 2TB drives in a RAIDZ array. They are given to zfs as complete disks, meaning their stripe offset is 0.

Code:
diskinfo -v /dev/ada3
/dev/ada3
        512             # sectorsize
        2000398934016   # mediasize in bytes (1.8T)
        3907029168      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        3876021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S2H7J1CB702251  # Disk ident.
They are connected into the fileserver with Adaptec 1430SA SATA controllers. This is the performance I'm getting:

Straight dd from the raw disks:
Code:
dd if=/dev/ada1 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada2 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada4 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada6 of=/dev/null bs=1m count=10000 &

gives:

10000+0 records out
10485760000 bytes transferred in 72.958848 secs (143721568 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.176198 secs (139482446 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.296263 secs (139260032 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.555696 secs (138781860 bytes/sec)
gstat shows the disks as 100% busy and reading 135-140MBytes/sec.

dd through the filesystem:
Code:
$ dd if=/storage/temp/test.file of=/dev/null bs=1m count=10000
10000+0 records out
10485760000 bytes transferred in 57.488728 secs (182396799 bytes/sec)
gstat showing disk reads as 90-100% busy at 45-50MBytes/sec read.

Straight dd to one of the disks (same disk type, but uninited):

Code:
$ dd if=/dev/zero of=/dev/ada3 bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 77.051038 secs (136088498 bytes/sec)
with gstat showing the disk 90+% busy and 130MBytes/sec.

dd to the pool
Code:
$ dd if=/dev/zero of=/storage/temp/test.file bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 191.702491 secs (54698089 bytes/sec)
with gstat showing all four disks as 100% active at 25-30MBytes/sec.

CPU barely registers - the system thinks it is 95+% idle (CPU is an AMD 630 quad core).

Any ideas?

Last edited by DutchDaemon; May 1st, 2012 at 01:38.
Reply With Quote
  #2  
Old May 1st, 2012, 11:02
t1066 t1066 is offline
Member
 
Join Date: Jun 2010
Posts: 144
Thanks: 3
Thanked 26 Times in 25 Posts
Default

Since you have bad performance when using dd, most probably you have those advanced format drives. The following FAQ should be helpful.

https://forums.freebsd.org/showthread.php?t=21644
Reply With Quote
The Following User Says Thank You to t1066 For This Useful Post:
arad85 (May 2nd, 2012)
  #3  
Old May 1st, 2012, 11:17
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Thanks. Is there any way to get the drives to appear as 4096 byte block devices without destroying the ZFS pool? I think I have enough spare space to move stuff around but....
Reply With Quote
  #4  
Old May 2nd, 2012, 10:28
t1066 t1066 is offline
Member
 
Join Date: Jun 2010
Posts: 144
Thanks: 3
Thanked 26 Times in 25 Posts
Default

No, you cannot. Maybe you could first check that these drives are really 4k drives emulating 512 bytes before making the plunge.
Reply With Quote
  #5  
Old May 2nd, 2012, 10:54
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Yes, they are, so I took the plunge.

bonnie++ benchmarks from before the rework (I have 10G of memory in the server):

Code:
# bonnie++ -d /storage/dir -u 0:0  -s 20g
...
Version  1.96       -------Sequential Output------- --Sequential Input--  --Random-
Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block--  --Seeks--
Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
MAINSERVER      20G   140  99 186515  50 109949  29   369  98 295539  34 122.0   6
Latency               150ms     2414ms     1607ms    71019us     728ms    1267ms
Version  1.96       ------Sequential Create------ --------Random Create--------
MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 25232  91 +++++ +++ 21512  97 19158  71 +++++ +++ 24000  85
Latency              9114us     194us     251us   22049us     107us     489us
and after:
Code:
# bonnie++ -d /storage/dir -u 0:0  -s 20g
...
Version  1.96       -------Sequential Output------- --Sequential Input-- --Random-
Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block-- --Seeks--
Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
MAINSERVER      20G   151  99 276372  74 151062  39   376  99 383933  45 122.4   6
Latency             88207us      746ms     1049ms    37431us     210ms    1147ms
Version  1.96       ------Sequential Create------ --------Random Create--------
MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 18532  70 +++++ +++ 24457  86 26525  96 +++++ +++ 28279  97
Latency              6297us     122us     228us   18907us      82us     212us
So, ignoring the per character values, and knowing that this is a fileserver serving (primarily) music and video to the network, I think the most important benchmarks are the block based sequential reads/writes and rewrites. Which gives:

Code:
                  shift=9    shift=12       Gain
Block write        186M        276M       48% faster
Block rewrite      110M        151M       37% faster
Block read         295M        384M       30% faster
Having said that, the CPU percentage has increased, but that's OK if I can get better throughput. In addition, the latencies (although not the max per sec values) of the file creation are much lower.

I'm just restoring the data from the single disk and whilst I'm not able to read the data at 100% speed (I think rsync is causing a bottleneck there) the write speeds as stated by gstat seem to be much better (before I was only ever getting 40-50Mbytes/sec per disk writing with the interface maxed out, now I'm getting 100+Mbytes/sec and the interface isn't maxed out.

Hopefully, real world performance will improve now too (most of my writing is files to/from SMB shares).

Last edited by DutchDaemon; May 2nd, 2012 at 16:53.
Reply With Quote
The Following User Says Thank You to arad85 For This Useful Post:
coppermine (May 2nd, 2012)
  #6  
Old May 2nd, 2012, 11:02
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

For completeness, these are the commands I used to rework the array:

# gpart create -s gpt ada1
# gpart create -s gpt ada2
# gpart create -s gpt ada4
# gpart create -s gpt ada6
# gpart add -t freebsd-zfs -l disk1 -b 2048 -a 4k ada1
# gpart add -t freebsd-zfs -l disk2 -b 2048 -a 4k ada2
# gpart add -t freebsd-zfs -l disk3 -b 2048 -a 4k ada4
# gpart add -t freebsd-zfs -l disk4 -b 2048 -a 4k ada6
# gnop create -S 4096 /dev/gpt/disk1
# gnop create -S 4096 /dev/gpt/disk2
# gnop create -S 4096 /dev/gpt/disk3
# gnop create -S 4096 /dev/gpt/disk4
# zpool create storage raidz /dev/gpt/disk1.nop /dev/gpt/disk2.nop /dev/gpt/disk3.nop /dev/gpt/disk4.nop
# zpool export storage
# gnop destroy /dev/gpt/disk1.nop
# gnop destroy /dev/gpt/disk2.nop
# gnop destroy /dev/gpt/disk3.nop
# gnop destroy /dev/gpt/disk4.nop
# zpool import storage
Reply With Quote
The Following 6 Users Say Thank You to arad85 For This Useful Post:
coppermine (May 2nd, 2012), jalla (May 2nd, 2012), kpa (May 2nd, 2012), rabfulton (July 10th, 2012), thethirdnut (December 3rd, 2012), wblock@ (May 2nd, 2012)
  #7  
Old May 2nd, 2012, 11:29
coppermine coppermine is offline
Junior Member
 
Join Date: Nov 2008
Posts: 1
Thanks: 2
Thanked 0 Times in 0 Posts
Default

A nice and complete post containing clean presentation of a problem, analysis and solution. I would encourage you to drop a small memo what has been done, since many people will be concerned by this or related. Keep up the good work.
Reply With Quote
  #8  
Old May 2nd, 2012, 12:18
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Post #6 says it all really - from bare disks to an array with 4096 aligned blocks. AFAICT, the commands do:
  • gpart create creates a geometry scheme for each disk (i.e. enables them to be accessed through gpart and they exist as devices in /dev/gpt/...)
  • gpart add adds a partition starting at 2048 blocks (1Mbyte) and aligns on a 4k byte boundary. The disk is also labelled with the -l parameter
  • gnop create creates a "pseudo" device referring to the disk with sector size 4096 bytes (the disk now looks like a 4096 byte disk).
  • zpool create creates the pool using the pseudo devices
  • zpool export removes the pool from active duty
  • gnop destroy destroys the pseudo devices that the array was created with
  • zpool import re-imports the array but references to the actual HDD drives rather than the transparent pseudo devices.

Running:

Code:
# zdb storage | grep ashift
            ashift: 12
which means I'm 4096 byte aligned (2^12 = 4096).

This can be shown to be true by running gpart show on any drive:

Code:
]# gpart show ada1
=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  freebsd-zfs  (1.8T)
  3907029128           7        - free -  (3.5k)
I'll report back later when the 2.5TBytes of data have been restored and I've run some more benchmarking on the re-setup disk. I also have a mirror of 2x1TByte Samsung drives that I might do the same to and see if the performance increases.
Reply With Quote
  #9  
Old May 2nd, 2012, 12:20
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

PS. The dd command from /dev/random showed no improvement in performance but more realistic benchmarks (bonnie++) seemed to show significant improvements.
Reply With Quote
  #10  
Old May 2nd, 2012, 15:42
Sebulon's Avatar
Sebulon Sebulon is offline
Member
 
Join Date: Nov 2010
Location: Uppsala, Sweden
Posts: 559
Thanks: 24
Thanked 94 Times in 78 Posts
Default

@arad85
First of all, congratulations! It feels good conquering technology
One quick about gnop though, it`s only necessary on the first drive in every vdev. So in your case with your raidz, you would only need the disk1.nop.

But.
Quote:
Originally Posted by arad85 View Post
Running:
Code:
# zdb storage | grep ashift
            ashift: 12
which means I'm 4096 byte aligned (2^12 = 4096).
ashift has never been any kind of alignment! The ashift-value determines the smallest IO that ZFS will send; in this case 2^12 = 4096. And it would have kept sending 4k IO`s regardless of what alignment you may have had.

This is your alignment, which is done perfectly:
Quote:
Originally Posted by arad85 View Post
Code:
]# gpart show ada1
=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  freebsd-zfs  (1.8T)
  3907029128           7        - free -  (3.5k)
And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help

The procedure begins at post #12 and is for a bootable striped mirror pool, which you may change to a different pool layout to better suit your needs. Omit the first partition and bootcoding, plus zpool set bootfs if you don`t want to boot from it.

/Sebulon
Reply With Quote
The Following User Says Thank You to Sebulon For This Useful Post:
arad85 (May 2nd, 2012)
  #11  
Old May 2nd, 2012, 16:03
t1066 t1066 is offline
Member
 
Join Date: Jun 2010
Posts: 144
Thanks: 3
Thanked 26 Times in 25 Posts
Default

Quote:
Originally Posted by arad85 View Post
PS. The dd command from /dev/random showed no improvement in performance but more realistic benchmarks (bonnie++) seemed to show significant improvements.
If I remember correctly, dd if=/dev/random is limited to less than 100MB/s.

Quote:
Originally Posted by Sebulon View Post
And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help

The procedure begins at post #12 and is for a bootable striped mirror pool, which you may change to a different pool layout to better suit your needs. Omit the first partition and bootcoding, plus zpool set bootfs if you don`t want to boot from it.

/Sebulon
You can also retain the labels by using

Code:
#zpool export mypool
#zpool import -d /dev/gpt mypool  /*if you are using gpt labels*/ 
or
#zpool import -d /dev/label mypool /*if you are using plain labels*/

Last edited by DutchDaemon; May 2nd, 2012 at 16:54.
Reply With Quote
The Following User Says Thank You to t1066 For This Useful Post:
arad85 (May 2nd, 2012)
  #12  
Old May 2nd, 2012, 16:08
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Quote:
Originally Posted by Sebulon View Post
But.

ashift has never been any kind of alignment! The ashift-value determines the smallest IO that ZFS will send; in this case 2^12 = 4096. And it would have kept sending 4k IO`s regardless of what alignment you may have had.
Of course... that makes sense.

Quote:
Originally Posted by Sebulon View Post
And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help
Ahh rats... I wanted to keep the labels, but you are right, you lose the labels. I now have adaXp1 as my drives.

Is there any way to get the system to see them as labels without having to rebuild the array (nearly finished transferring back the files, but if I must, I will...)? I'm guessing this may be important if I ever moved the disk array to a different controller (which I did a couple of months ago) as it doesn't then matter how they are physically connected up.
Reply With Quote
  #13  
Old May 2nd, 2012, 16:34
phoenix's Avatar
phoenix phoenix is offline
Moderator
 
Join Date: Nov 2008
Location: Kamloops, BC, Canada
Posts: 3,179
Thanks: 43
Thanked 715 Times in 587 Posts
Default

For the pool to search the /dev/gpt directory for labels instead of using the device nodes directly:
Code:
# zpool export poolname
# zpool import -d /dev/gpt poolname
__________________
Freddie

Help for FreeBSD: Handbook, FAQ, man pages, mailing lists.

Last edited by phoenix; May 2nd, 2012 at 21:02. Reason: impot --> import
Reply With Quote
The Following User Says Thank You to phoenix For This Useful Post:
arad85 (May 2nd, 2012)
  #14  
Old May 2nd, 2012, 16:48
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Quote:
Originally Posted by phoenix View Post
For the pool to search the /dev/gpt directory for labels instead of using the device nodes directly:
Code:
# zpool export poolname
# zpool impot -d /dev/gpt poolname
Thanks, yes, saw that from t1066's post. Just waiting for by restore to finish before trying that.

Got to love ZFS.....
Reply With Quote
  #15  
Old May 2nd, 2012, 16:53
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Ahh.. I was nearly finished. Yes, that works - thanks all:

Code:
# zpool export storage
# zpool import -d /dev/gpt storage
# zpool status storage
  pool: storage
 state: ONLINE
 scan: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        storage        ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0

errors: No known data errors
Reply With Quote
  #16  
Old May 2nd, 2012, 19:16
Sebulon's Avatar
Sebulon Sebulon is offline
Member
 
Join Date: Nov 2010
Location: Uppsala, Sweden
Posts: 559
Thanks: 24
Thanked 94 Times in 78 Posts
Default

@phoenix

impot? Talk about Freudian slip

/Sebulon
Reply With Quote
  #17  
Old May 2nd, 2012, 21:02
phoenix's Avatar
phoenix phoenix is offline
Moderator
 
Join Date: Nov 2008
Location: Kamloops, BC, Canada
Posts: 3,179
Thanks: 43
Thanked 715 Times in 587 Posts
Default

Doesn't everyone pay a ZFS tax?

Spelling fixed in original post.
__________________
Freddie

Help for FreeBSD: Handbook, FAQ, man pages, mailing lists.
Reply With Quote
  #18  
Old May 3rd, 2012, 00:56
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

Just as an update to this, I tried running 3x reads from local disks and write to the array simultaneously (all 7 disks are across 2x Adaptec 1430SA controllers) and got this:

Code:
dd if=/dev/ada3 of=/storage/testing bs=1024000 count=10000 &
dd if=/dev/ada0 of=/storage/testing1 bs=1024000 count=10000 &
dd if=/dev/ada5 of=/storage/testing2  bs=1024000 count=10000 &
[1] 15331
[2] 15332
[3] 15333
# 10000+0 records in
10000+0 records out
10240000000 bytes transferred in 112.665802 secs (90888271 bytes/sec)
10000+0 records in
10000+0 records out
10240000000 bytes transferred in 120.971984 secs (84647698 bytes/sec)
10000+0 records in
10000+0 records out
10240000000 bytes transferred in 124.426937 secs (82297292 bytes/sec)
So, 3x reads, 4x writes across the 3+1 RAIDZ array from two controllers hosting all 7 disks and I get a sustained 275Mbytes/sec write rate. Not bad IMHO - and the disks aren't maxed out according to gstat

Last edited by DutchDaemon; May 3rd, 2012 at 01:44.
Reply With Quote
  #19  
Old May 6th, 2012, 11:10
arad85 arad85 is offline
Junior Member
 
Join Date: Apr 2012
Posts: 12
Thanks: 4
Thanked 7 Times in 2 Posts
Default

A further update. I scrub this array on a weekly basis and get a cron job to mail me 7 hours (11am) after it started. This is last weeks mail:

Code:
  pool: storage
 state: ONLINE
 scan: scrub in progress since Sun Apr 29 04:00:02 2012
    2.74T scanned out of 3.27T at 114M/s, 1h21m to go
    0 repaired, 83.73% done
and this is this weeks run:

Code:
  pool: storage
 state: ONLINE
 scan: scrub repaired 0 in 2h35m with 0 errors on Sun May  6 06:35:58 2012
Admittedly there is slightly less data (2.76T) on the array as I tidied it up before reworking it, but that is a MASSIVE improvement in speed (close to 300M/s as opposed to 114M/s).
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ZFS Replacing 512b drives by 4k drives? kisscool-fr System Hardware 3 February 9th, 2012 18:36
In latest Gnome, CD-ROM drives appear as mass storage drives... Doctor_Who GNOME 2 January 28th, 2012 14:44
Add more drives to ZFS atwinix General 7 December 8th, 2010 19:51
[Solved] problems with gpt partitioned drives wonslung Installing & Upgrading 2 July 24th, 2010 20:20
Two drives Boxmaker Peripheral Hardware 5 March 8th, 2010 10:13


All times are GMT +1. The time now is 03:52.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2013, vBulletin Solutions, Inc.
The mark FreeBSD is a registered trademark of The FreeBSD Foundation and is used by The FreeBSD Project with the permission of The FreeBSD Foundation.
Web protection and acceleration provided by CloudFlare
0