ZFS performance problems with 2TB Samsung drives drives

Hi all,

Hopefully you can help. I have four Samsung 2TB drives in a RAIDZ array. They are given to zfs as complete disks, meaning their stripe offset is 0.

Code:
diskinfo -v /dev/ada3
/dev/ada3
        512             # sectorsize
        2000398934016   # mediasize in bytes (1.8T)
        3907029168      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        3876021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S2H7J1CB702251  # Disk ident.

They are connected into the fileserver with Adaptec 1430SA SATA controllers. This is the performance I'm getting:

Straight dd from the raw disks:
Code:
dd if=/dev/ada1 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada2 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada4 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada6 of=/dev/null bs=1m count=10000 &

gives:

10000+0 records out
10485760000 bytes transferred in 72.958848 secs (143721568 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.176198 secs (139482446 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.296263 secs (139260032 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 75.555696 secs (138781860 bytes/sec)
gstat shows the disks as 100% busy and reading 135-140MBytes/sec.

dd through the filesystem:
Code:
$ dd if=/storage/temp/test.file of=/dev/null bs=1m count=10000
10000+0 records out
10485760000 bytes transferred in 57.488728 secs (182396799 bytes/sec)
gstat showing disk reads as 90-100% busy at 45-50MBytes/sec read.

Straight dd to one of the disks (same disk type, but uninited):

Code:
$ dd if=/dev/zero of=/dev/ada3 bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 77.051038 secs (136088498 bytes/sec)
with gstat showing the disk 90+% busy and 130MBytes/sec.

dd to the pool
Code:
$ dd if=/dev/zero of=/storage/temp/test.file bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 191.702491 secs (54698089 bytes/sec)

with gstat showing all four disks as 100% active at 25-30MBytes/sec.

CPU barely registers - the system thinks it is 95+% idle (CPU is an AMD 630 quad core).

Any ideas?
 
Thanks. Is there any way to get the drives to appear as 4096 byte block devices without destroying the ZFS pool? I think I have enough spare space to move stuff around but....
 
No, you cannot. Maybe you could first check that these drives are really 4k drives emulating 512 bytes before making the plunge.
 
Yes, they are, so I took the plunge.

bonnie++ benchmarks from before the rework (I have 10G of memory in the server):

Code:
# bonnie++ -d /storage/dir -u 0:0  -s 20g
...
Version  1.96       -------Sequential Output------- --Sequential Input--  --Random-
Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block--  --Seeks--
Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
MAINSERVER      20G   140  99 186515  50 109949  29   369  98 295539  34 122.0   6
Latency               150ms     2414ms     1607ms    71019us     728ms    1267ms
Version  1.96       ------Sequential Create------ --------Random Create--------
MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 25232  91 +++++ +++ 21512  97 19158  71 +++++ +++ 24000  85
Latency              9114us     194us     251us   22049us     107us     489us

and after:
Code:
# bonnie++ -d /storage/dir -u 0:0  -s 20g
...
Version  1.96       -------Sequential Output------- --Sequential Input-- --Random-
Concurrency   1     -Per Chr-  --Block--  -Rewrite- -Per Chr-  --Block-- --Seeks--
Machine        Size K/sec %CP  K/sec %CP  K/sec %CP K/sec %CP  K/sec %CP  /sec %CP
MAINSERVER      20G   151  99 276372  74 151062  39   376  99 383933  45 122.4   6
Latency             88207us      746ms     1049ms    37431us     210ms    1147ms
Version  1.96       ------Sequential Create------ --------Random Create--------
MAINSERVER          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 18532  70 +++++ +++ 24457  86 26525  96 +++++ +++ 28279  97
Latency              6297us     122us     228us   18907us      82us     212us

So, ignoring the per character values, and knowing that this is a fileserver serving (primarily) music and video to the network, I think the most important benchmarks are the block based sequential reads/writes and rewrites. Which gives:

Code:
                  shift=9    shift=12       Gain
Block write        186M        276M       48% faster
Block rewrite      110M        151M       37% faster
Block read         295M        384M       30% faster

Having said that, the CPU percentage has increased, but that's OK if I can get better throughput. In addition, the latencies (although not the max per sec values) of the file creation are much lower.

I'm just restoring the data from the single disk and whilst I'm not able to read the data at 100% speed (I think rsync is causing a bottleneck there) the write speeds as stated by gstat seem to be much better (before I was only ever getting 40-50Mbytes/sec per disk writing with the interface maxed out, now I'm getting 100+Mbytes/sec and the interface isn't maxed out.

Hopefully, real world performance will improve now too (most of my writing is files to/from SMB shares).
 
For completeness, these are the commands I used to rework the array:

# gpart create -s gpt ada1
# gpart create -s gpt ada2
# gpart create -s gpt ada4
# gpart create -s gpt ada6
# gpart add -t freebsd-zfs -l disk1 -b 2048 -a 4k ada1
# gpart add -t freebsd-zfs -l disk2 -b 2048 -a 4k ada2
# gpart add -t freebsd-zfs -l disk3 -b 2048 -a 4k ada4
# gpart add -t freebsd-zfs -l disk4 -b 2048 -a 4k ada6
# gnop create -S 4096 /dev/gpt/disk1
# gnop create -S 4096 /dev/gpt/disk2
# gnop create -S 4096 /dev/gpt/disk3
# gnop create -S 4096 /dev/gpt/disk4
# zpool create storage raidz /dev/gpt/disk1.nop /dev/gpt/disk2.nop /dev/gpt/disk3.nop /dev/gpt/disk4.nop
# zpool export storage
# gnop destroy /dev/gpt/disk1.nop
# gnop destroy /dev/gpt/disk2.nop
# gnop destroy /dev/gpt/disk3.nop
# gnop destroy /dev/gpt/disk4.nop
# zpool import storage
 
A nice and complete post containing clean presentation of a problem, analysis and solution. I would encourage you to drop a small memo what has been done, since many people will be concerned by this or related. Keep up the good work.
 
Post #6 says it all really - from bare disks to an array with 4096 aligned blocks. AFAICT, the commands do:

  • gpart create creates a geometry scheme for each disk (i.e. enables them to be accessed through gpart and they exist as devices in /dev/gpt/...)
  • gpart add adds a partition starting at 2048 blocks (1Mbyte) and aligns on a 4k byte boundary. The disk is also labelled with the -l parameter
  • gnop create creates a "pseudo" device referring to the disk with sector size 4096 bytes (the disk now looks like a 4096 byte disk).
  • zpool create creates the pool using the pseudo devices
  • zpool export removes the pool from active duty
  • gnop destroy destroys the pseudo devices that the array was created with
  • zpool import re-imports the array but references to the actual HDD drives rather than the transparent pseudo devices.

Running:

Code:
# zdb storage | grep ashift
            ashift: 12

which means I'm 4096 byte aligned (2^12 = 4096).

This can be shown to be true by running gpart show on any drive:

Code:
]# gpart show ada1
=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  freebsd-zfs  (1.8T)
  3907029128           7        - free -  (3.5k)

I'll report back later when the 2.5TBytes of data have been restored and I've run some more benchmarking on the re-setup disk. I also have a mirror of 2x1TByte Samsung drives that I might do the same to and see if the performance increases.
 
PS. The dd command from /dev/random showed no improvement in performance but more realistic benchmarks (bonnie++) seemed to show significant improvements.
 
@arad85
First of all, congratulations! It feels good conquering technology:)
One quick about gnop though, it`s only necessary on the first drive in every vdev. So in your case with your raidz, you would only need the disk1.nop.

But.
arad85 said:
Running:
Code:
# zdb storage | grep ashift
            ashift: 12

which means I'm 4096 byte aligned (2^12 = 4096).
ashift has never been any kind of alignment! The ashift-value determines the smallest IO that ZFS will send; in this case 2^12 = 4096. And it would have kept sending 4k IO`s regardless of what alignment you may have had.

This is your alignment, which is done perfectly:
arad85 said:
Code:
]# gpart show ada1
=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3907027080     1  freebsd-zfs  (1.8T)
  3907029128           7        - free -  (3.5k)

And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help

The procedure begins at post #12 and is for a bootable striped mirror pool, which you may change to a different pool layout to better suit your needs. Omit the first partition and bootcoding, plus zpool set bootfs if you don`t want to boot from it.

/Sebulon
 
arad85 said:
PS. The dd command from /dev/random showed no improvement in performance but more realistic benchmarks (bonnie++) seemed to show significant improvements.

If I remember correctly, dd if=/dev/random is limited to less than 100MB/s.

Sebulon said:
And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help

The procedure begins at post #12 and is for a bootable striped mirror pool, which you may change to a different pool layout to better suit your needs. Omit the first partition and bootcoding, plus zpool set bootfs if you don`t want to boot from it.

/Sebulon

You can also retain the labels by using

Code:
#zpool export mypool
#zpool import -d /dev/gpt mypool  /*if you are using gpt labels*/ 
or
#zpool import -d /dev/label mypool /*if you are using plain labels*/
 
Sebulon said:
But.

ashift has never been any kind of alignment! The ashift-value determines the smallest IO that ZFS will send; in this case 2^12 = 4096. And it would have kept sending 4k IO`s regardless of what alignment you may have had.
Of course... that makes sense.

Sebulon said:
And for people reading this and wondering why it defaults back to using the device-names instead of the labels in zpool after re-import can try my approach instead, which doesn`t need export/importing and keeps the labels showing:
4x2TB disk partition help
Ahh rats... I wanted to keep the labels, but you are right, you lose the labels. I now have adaXp1 as my drives.

Is there any way to get the system to see them as labels without having to rebuild the array (nearly finished transferring back the files, but if I must, I will...)? I'm guessing this may be important if I ever moved the disk array to a different controller (which I did a couple of months ago) as it doesn't then matter how they are physically connected up.
 
For the pool to search the /dev/gpt directory for labels instead of using the device nodes directly:
Code:
# zpool export poolname
# zpool import -d /dev/gpt poolname
 
phoenix said:
For the pool to search the /dev/gpt directory for labels instead of using the device nodes directly:
Code:
# zpool export poolname
# zpool impot -d /dev/gpt poolname

Thanks, yes, saw that from t1066's post. Just waiting for by restore to finish before trying that.

Got to love ZFS..... :e
 
Ahh.. I was nearly finished. Yes, that works - thanks all:

Code:
# zpool export storage
# zpool import -d /dev/gpt storage
# zpool status storage
  pool: storage
 state: ONLINE
 scan: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        storage        ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0

errors: No known data errors
 
Just as an update to this, I tried running 3x reads from local disks and write to the array simultaneously (all 7 disks are across 2x Adaptec 1430SA controllers) and got this:

Code:
dd if=/dev/ada3 of=/storage/testing bs=1024000 count=10000 &
dd if=/dev/ada0 of=/storage/testing1 bs=1024000 count=10000 &
dd if=/dev/ada5 of=/storage/testing2  bs=1024000 count=10000 &
[1] 15331
[2] 15332
[3] 15333
# 10000+0 records in
10000+0 records out
10240000000 bytes transferred in 112.665802 secs (90888271 bytes/sec)
10000+0 records in
10000+0 records out
10240000000 bytes transferred in 120.971984 secs (84647698 bytes/sec)
10000+0 records in
10000+0 records out
10240000000 bytes transferred in 124.426937 secs (82297292 bytes/sec)
So, 3x reads, 4x writes across the 3+1 RAIDZ array from two controllers hosting all 7 disks and I get a sustained 275Mbytes/sec write rate. Not bad IMHO - and the disks aren't maxed out according to gstat :)
 
A further update. I scrub this array on a weekly basis and get a cron job to mail me 7 hours (11am) after it started. This is last weeks mail:

Code:
  pool: storage
 state: ONLINE
 scan: scrub in progress since Sun Apr 29 04:00:02 2012
    2.74T scanned out of 3.27T at 114M/s, 1h21m to go
    0 repaired, 83.73% done

and this is this weeks run:

Code:
  pool: storage
 state: ONLINE
 scan: scrub repaired 0 in 2h35m with 0 errors on Sun May  6 06:35:58 2012

Admittedly there is slightly less data (2.76T) on the array as I tidied it up before reworking it, but that is a MASSIVE improvement in speed (close to 300M/s as opposed to 114M/s).
 
Back
Top