ZFS write performance issues with WD20EARS

I've seen a couple of other posts about using zfs on WD EARS drives which have the fun
feature of using 4096 byte sectors while reporting 512 byte sectors to the OS.

I tried the trick of using gnop to make a new device that reports itself as having 4096 byte sectors, but the improvements seem to be superficial.

system specs:
22x 2TB WD 20 EARS attached to a single 3ware 9690SA4I with a SAS expander in the backplane.
they are configured as 22 separate 'single' units on the controller, and the cache is on.

[CMD=]tw_cli /c0 show[/cmd]
Code:
Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    SINGLE    OK             -       -       -       1862.63   RiW    OFF

cpu is a Xeon 3450 @ 2.67 Ghz.
8GB ECC DDR3
FreeBSD 8.1-BETA1 amd64
the drives are in 3x (7 disk raidz) configuration. I've tried using the last drive as a
hot spare, log device and cache, all of which have the same performance issues.
Code:
vm.kmem_size="4G"
vm.kmem_size_max="4G"
vm.zfs.arc_max="2G"

Before gnop, I was getting about 1.5 MB/s write speeds.
Now it looks like i'm getting about 3 MB/s

I've tested the bandwidth to the drives, and found that using 20 simultaneous dd processes, I can write 15-30MB/s to each drive.

I've done tests dd'ing into a file on the zfs file system, copying a file using rsync, nfs
and just copying a file from a local UFS2 partition.

looking at the drives with zpool iostat, it looks like writes are occurring sporadically. (this is with gnop)

[CMD=]zpool iostat 1[/cmd]
Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
store        114G  37.8T      1     19   127K  1002K
store        114G  37.8T      0      0      0      0
store        114G  37.8T      0      0      0      0
store        114G  37.8T      2    252  12.0K  12.2M
store        114G  37.8T      0      0      0      0
store        114G  37.8T      1    118  7.97K  10.9M

I've also seen posts about writes starving everything else, causing stalls, and only intermittent disk activity,
so I attempted to follow the advice for that by setting vfs.zfs.txg.write_limit_override
I am unsure of the units for that, so I tried a bunch of different values, 256, 262144, and 268435456. (and some others)
I mostly noticed either no change in the speed, or it dropping to under 100k/s (262144 did that.)

I've tried to use only 4 disks, and it looked like it was working better, but I did not do thorough testing
I'm sure there is more useful information that I haven't provided, and I am more than willing to provide anything that I've left out.
 
Have you seen this post?

I'd be interested in seeing the results of your more thorough testing on a configuration with a greater number of fewer-drive raidzs.
 
I dont really know, but maybe the 3ware 9690SA cant handle that sort of drive with the present firmware version.

At least it is not tested with it :
http://www.3ware.com/products/pdf/Drive_compatibility_list_9690SA_9.5.3codeset_900-0070-02RevK.pdf

And well it is a "desktop drive", it is not certified for 24/7. I could still be a good drive but i dont know if LSI/3ware will care about the support of it.

In the last months i saw some posts in german forums where people reporting performance problems with the WD20EARS + 3ware controllers.

In your case i would contact the LSI/3ware support asking for a solution. i hope there is any.
 
I have a similar setup, and am having similar issues. However, I think the problem lies with the ZFS configuration, and not with the 3ware / EARS combination. Here's why.

There are ten 2T EARS drives connected to a 3ware 9650SE-24M8, and 8 of them are configured (with gnop in-between for 4k sectors) in a pool with two four-drive raidzs.

First, I make a memory disk to hold the test data, so we won't be bound by the speed of /dev/random.

# mdmfs -s 500m md2 /mnt

Next, I have a script for grabbing 400 megs of random junk.
Code:
# cat freshen.sh
dd if=/dev/random of=/mnt/a bs=1m count=400
Copying straight to the device (even without gnop) is quite zippy.
Code:
# sh freshen.sh
419430400 bytes transferred in 5.719221 secs (73336979 bytes/sec)
# dd if=/mnt/a of=/dev/da9 bs=1m
419430400 bytes transferred in 3.333524 secs (125821923 bytes/sec)
But copying onto the zfs filesystem is much slower.
Code:
# sh freshen.sh
419430400 bytes transferred in 5.721352 secs (73309667 bytes/sec)
# dd if=/mnt/a of=/tank/tmp/a bs=1m
419430400 bytes transferred in 25.927887 secs (16176806 bytes/sec)

So it's gotta be an issue with the ZFS config, right?

mjrosenb, do you have a similar result? That is, if you copy directly to the device, do your benchmarks perform well?
 
sub_mesa said:
zpool status output?

Here is my config.

Code:
# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da5.nop  ONLINE       0     0     0
            da6.nop  ONLINE       0     0     0
            da7.nop  ONLINE       0     0     0
            da8.nop  ONLINE       0     0     0

errors: No known data errors
 
yes, writing to a single drive maxed it out at about 90 MB/s, and I was able to write to 20 drives at the same time at 20MB/s *each*.

I tried creating 5x(4 disk raidz's). Initially the performance was good (as with 3x(7 disk raidz), however, the performance quickly dropped off.
Code:
  9386557875 100%   25.25MB/s    0:05:54 (xfer#1, to-check=799/801)
  4699592172 100%    7.39MB/s    0:10:06 (xfer#2, to-check=798/801)
  4694425121 100%    5.38MB/s    0:13:51 (xfer#3, to-check=797/801)
  9391526150 100%    4.71MB/s    0:31:43 (xfer#4, to-check=796/801)
 
As a side note, wouldn't it not make too much sense, from a speed perspective, to use one of the other drives as a log or cache device? I thought the idea behind those was that they would be hosted on speedier hardware than the main storage devices.

For the cache, for example, it uses main memory if you don't specify any cache devices. And for the ZIL, "the intent log is allocated from blocks within the main pool", according to the man page for zpool.
 
gah, why is there no "edit reply" button

# zpool status
Code:
  pool: store
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        store         ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da0.nop   ONLINE       0     0     0
            da1.nop   ONLINE       0     0     0
            da2.nop   ONLINE       0     0     0
            da3.nop   ONLINE       0     0     0
            da4.nop   ONLINE       0     0     0
            da5.nop   ONLINE       0     0     0
            da6.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da7.nop   ONLINE       0     0     0
            da8.nop   ONLINE       0     0     0
            da9.nop   ONLINE       0     0     0
            da10.nop  ONLINE       0     0     0
            da11.nop  ONLINE       0     0     0
            da12.nop  ONLINE       0     0     0
            da13.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da14.nop  ONLINE       0     0     0
            da15.nop  ONLINE       0     0     0
            da16.nop  ONLINE       0     0     0
            da17.nop  ONLINE       0     0     0
            da18.nop  ONLINE       0     0     0
            da19.nop  ONLINE       0     0     0
            da20.nop  ONLINE       0     0     0

errors: No known data errors
and for the smaller vdev raidz:
# zpool status
Code:
 pool: store
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        store         ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da0.nop   ONLINE       0     0     0
            da1.nop   ONLINE       0     0     0
            da2.nop   ONLINE       0     0     0
            da3.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da4.nop   ONLINE       0     0     0
            da5.nop   ONLINE       0     0     0
            da6.nop   ONLINE       0     0     0
            da7.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da8.nop   ONLINE       0     0     0
            da9.nop   ONLINE       0     0     0
            da10.nop  ONLINE       0     0     0
            da11.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da12.nop  ONLINE       0     0     0
            da13.nop  ONLINE       0     0     0
            da14.nop  ONLINE       0     0     0
            da15.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da16.nop  ONLINE       0     0     0
            da17.nop  ONLINE       0     0     0
            da18.nop  ONLINE       0     0     0
            da19.nop  ONLINE       0     0     0

errors: No known data errors

and as far as I can tell, it does not make sense to use a one of the raid disks as a log or a cache, but I figured it could not hurt to try it.
 
I was able to get decent write speeds (61MB/s) out of my array, but only by making a bunch of mirror pairs instead of raidzs. :(

I still haven't given up, though, as I really want that extra space.
 
I tried to put each disk at the top level and got speeds somewhere around 6MB/s.
I'm going to see if osol has similar behavior with this setup.
 
Check each disk separately. Also check the SMART data for each HDD for UDMA CRC Error Count; indicating cabling problems.

Its all possible for one or more disks to kill the array's performance due to them being duds or having cabling CRC errors.
 
ZFS works wonderfully on 4K disks IF AND ONLY IF the harddrive firmware reports a 4K physical disk size.

IOW, ZFS is broken on the WD 4K disks as they report a physical sector size of 512 B.

This is covered fairly often on the zfs-discuss mailing list. :) It applies to ZFS, not the OS it's running on. IOW, it affects OSol, Sol, FreeBSD, Linux (via FUSE), etc.

Until WD fixes their firmware, you cannot use their 4K disks with ZFS.
 
ZFS works wonderfully on 4K disks IF AND ONLY IF the harddrive firmware reports a 4K physical disk size.
I have read those, and it seems like the recommendation is "use gnop to create a dummy device that is a duplicate of the original device, but reports backa 4096 byte sector size"

Until Maxtor fixes their firmware, you cannot use their 4K disks with ZFS.
Maxtor? you mean WD? pretty sure Maxtor was bought out several years ago at this point.
 
sub_mesa said:
Check each disk separately. Also check the SMART data for each HDD for UDMA CRC Error Count; indicating cabling problems.

Its all possible for one or more disks to kill the array's performance due to them being duds or having cabling CRC errors.

Neat suggestion; thanks. Although mine were all good in this respect.
 
i think everyone who owns these drives needs to send a letter to WD and let them know you want a firmware upgrade.

This is going to be a problem for more than awhile...This is why i use hitachi seagate and SOMETIMES samsung drives.

right now i'm getting the best ZFS performance from the hitachi 7200 RPM 2TB drives
 
yeah, being able to upgrade the firmware to report 4k sectors seems trivial, but it seems like gnop and/or other means should be able to circumvent these issues

Increasing the amount of memory for ARC and the kernel seems to have increased the write speed that it bottoms out at, however, I am still confused as to why
it decreases over the course of 20GB or so, when this is significantly larger than the amount of memory on the system.
 
I've got the same problem. Western Digital offers a Tool, which could solve the problem.
From http://www.amandtec.com
In order to solve the misalignment issue, Western Digital is offering two solutions. The first solution for correcting misaligned partitions is specifically geared towards Win 5.x, and that is an option on the drive itself to use an offset. Through the jumpering of pins 7 and 8 on an Advanced Format drive, the drive controller will use a +1 offset, resolving Win 5.xx’s insistence on starting the first partition at LBA 63 by actually starting it at LBA 64, an aligned position. This is exactly the kind of crude hack it sounds like since it means the operating system is no longer writing to the sector it thinks its writing to, but it’s simple to activate and effective in solving the issue so long as only a single partition is being used. If multiple partitions are being used, then this offset cannot be used as it can negatively impact the later partitions. The offset can also not be removed without repartitioning the drive, as the removal of the offset would break the partition table.

The second method of resolving misaligned partitions is through the use of Western Digital’s WD Align utility, which moves a partition and its data from a misaligned position to an aligned position

Has anyone already tried this?
 
These drives (advanced format) work fine in Windows and seems that the linux guys have already figured too. Reports say that performance is good also in MacOSX. Not sure why FreeBSD is always way behind the curve.
 
I've got the same problem. Western Digital offers a Tool, which could solve the problem.
I do not believe that ZFS's issue is alignment. The issue with older windows is that they like to write 4K chunks on sections that are a multiple of 4K offset from the beginning of the partition. Since the default offset of the first partition is 63 512 byte sectors, it is not actually aligned to a multiple of 4096 bytes, which destroys the performance.

The issue with ZFS is that it attempts to pack data into sector sized chunks. This means that as long as the drive is reporting that it has 512 byte sectors, ZFS will attempt to write 512 bytes at a time, So no matter where you attempt to start the partition, as long as ZFS thinks the sector size is 512 bytes, it will make no attempts to align to 4k boundaries. This is why gnop should theoretically fix the issues, if ZFS thinks the disk has 4K sectors, it should write out everything aligned to 4K. It is also recommended to not use a partition table at all, and let ZFS handle the raw device directly, so aligning the "partition" with the WD tool would most likely do nothing.
 
The bottom line is WD just doesn't care about ZFS customers.

If they did, they'd release firmware for these drives.

The correct response is:

do not buy WD for ZFS.

If you're unlucky enough to have already done this, then you need to write them and ask for:

a working firmware
or a refund.

it's unlikely this will work at first but if enough people voice this concern, it will eventually have an effect.
 
I wrote them, asking for this. I also encourage others to do the same.

In the mean time, gnop is working out just fine for me.
 
I wrote them, asking for this. I also encourage others to do the same.
who did you write to, customer service department?

also, how do you have your disks attached to the system?

I am also beginning to wonder if this is an issue with the sum total of raidz, WD*EARS and my 3ware raid controller. e.g. zfs says "write these 4k chunks these 20 disks", then the 3ware card itself breaks it up into 512 byte chunks.
 
Back
Top