ZFS write performance issues with WD20EARS

aragon · Nov 10, 2010

sub_mesa said:
If the .nop procedure would only have to be performed once, this would simplify setup since you won't need the .nop providers anymore upon reboot, correct?

That would make these lying 4k drives much less discouraging to purchase. Can anyone confirm if this works in practice?

palmboy5 · Nov 10, 2010

Never mind working in practice or not, does it actually make performance acceptable? The benchmark results from sub.mesa show negligible improvement at best. I'm starting to not blame ZFS or the drives at all for my situation. Even the UFS OS drive has poor performance through Samba (what I need).

sub_mesa · Nov 10, 2010

I'd say performance of 4K disks is very acceptable, especially when considering the workloads these disks are most suitable for (sequential I/O) and they develop unreadable sectors less rapidly then 512-byte disks with smaller ECC pools.

It would make sense to avoid some disk configurations when using 4K sector disks. A 5-disk RAID-Z will be good but 6-disk RAID-Z will be bad. I posted a full list of these combinations somewhere, if you can't find it i'll write it again.

For Samba performance you would want ASYNC I/O to be enabled when compiling samba port, as well as a client capable of ASYNC I/O; Windows 7 clients for example, and i think Vista as well. But XP is old and would have lower performance indeed. Tuning Samba may be worthwhile, but it appears that it is less needed on FreeBSD 8+, thanks to auto TCP buffersize tuning.

tty23 · Nov 11, 2010

I posted a full list of these combinations somewhere

I already wrote that a few posts ago, but here is it again:
http://arstechnica.com/civis/viewtopic.php?p=20797605&sid=d09ee0bd397ffd18c0dbaac4ba2e0678#p20797605

bthomson · Nov 14, 2010

In case anyone is curious about the performance of mixed sector-size pools, I replaced two dead disks in a legacy 6-drive 512b raidz2 pool with two 4k WD greens (512b emulation mode).

Looking at gstat it appears sometimes write cycles end up aligned and sometimes not. When aligned I get better than 90MB/s. When misaligned less than 10MB/s.

On average (say, writing 30GB), write speed seems to average out to 30MB/s which is better than some results earlier in this thread with all drives running in emulated mode. So perhaps the more emulated drives in the pool, the higher the chances of misalignment on a given write cycle and the lower the performance.

palmboy5 · Nov 17, 2010

So I'm about to redo my array since its currently 2x WD20EADS and 2x WD20EARS with 512 byte sectors and apparently I wouldn't be able to add an actual 4K (or gnop'ed) drive to the current array. Right now I know to gnop create -S 4096 at least one of the drives before doing zfs create, but is there something else that I'm missing? Will there be an alignment issue? I'm looking for things that I can't do after the array is created...

Thanks!

bthomson · Nov 18, 2010

palmboy5 said:
I see in the zdb output that ashift exists in the level above the drives, so ashift doesn't exist for each drive. Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?

I only made one 4k gnop device and after a reboot my pool still has ashift=12. So unless there is some parameter other than ashift, it works.

palmboy5 · Nov 21, 2010

I can't seem to glabel the .nop drive

Code:

[root@brisbane-1 /dev]# gnop create -S 4096 /dev/ada0
[root@brisbane-1 /dev]# glabel label wd20ears01 /dev/ada0.nop
glabel: Can't store metadata on /dev/ada0.nop: Invalid argument.

Any help is appreciated! :|

noz · Dec 28, 2010

Sorry to bump an old thread, but with the upcoming release of 8.2 I'd like to ask if anyone knows whether or not it includes fixes for the 4K sector problem. I took a quick glance at the FreeBSD 8.2 Release Engineering TODO but it doesn't seem to mention anything.

AndyUKG · Dec 29, 2010

Epikurean said:
I tried the suggested patch, but unfortunately it killed my ZFS Pool:

Maybe someone else can confirm this, but I'd guess the patch makes your system incompatible with pools created on an unpatched system. If you reboot with your old kernel you may still be able to get your pool back...

olav · Feb 9, 2011

Do anyone know if there are still problems with the newer WD20EARS drives?
They've became REALLY cheap here now.

danbi · Feb 9, 2011

Probably because users increasingly do not want them anymore?

AndyUKG · Feb 9, 2011

olav said:
Do anyone know if there are still problems with the newer WD20EARS drives?
They've became REALLY cheap here now.

Actually someone just posted a solution to one of the fundamental problems with 4k drives.

http://forums.freebsd.org/showthread.php?p=122617

The thread has gone off a bit on how to measure performance, but the important thing is the trick to set ashift correctly for a pool containing EARS or other 4k disks. You use gnop to temporarily create a 4k device in /dev which is used in the initial zpool create. Once the pool is create the ashift is set correctly and can never be changed, even when you remove the gnop device and replace it with a regular disk device.
I've not tested this, but setting ashift is fundamental to ZFS performing correctly on 4k drives so this is a solution in theory.

cheers Andy.

PS misalignment will still be possible depending on partitioning of the disk, so care is needed for this also. You can't just set ashift correctly and assume that is all that is needed.

olav · Feb 10, 2011

Aha,

So if I have a raidz with 10 disks I only need to format the "first" one with gnop?

What about the spin-down after 8 seconds issue?

AndyUKG · Feb 10, 2011

olav said:
So if I have a raidz with 10 disks I only need to format the "first" one with gnop?

Yep, which after the raidz creation you can destroy the one gnop device. I just tested this, ZFS automatically uses the real devices after you delete the gnop device. Ie, you export the pool, destroy the ada1.nop device, re-import pool, ZFS uses the ada1 device instead. Therefore no further need for gnop, apart from at the time of vdev creation.

olav said:
What about the spin-down after 8 seconds issue?

Seperate issue, this isn't related to ashift or 4k sectors.

bthomson · Feb 10, 2011

olav said:
Do anyone know if there are still problems with the newer WD20EARS drives?
They've became REALLY cheap here now.

I have been running 6x WD15EARS (4k sector) for several months now with no problems... set ashift with gnop and you are good to go.

jem · Feb 25, 2011

olav said:
What about the spin-down after 8 seconds issue?

I've just noticed in the 8.2 Release Notes the following:

The ada(4) driver now supports a new sysctl(8) variable kern.cam.ada.spindown_shutdown which controls whether or not to spin-down disks when shutting down if the device supports the functionality. The default value is 1.[r215173]

I wonder if this will prevent these Green drives from powering themselves down?

bfreek · Mar 31, 2011

Hello. I'll just join the round.

I've got this baby: hp proliant n36l (pdf)
plus 3x wd20ears to use them in a raidz1 pool (freenas 0.7.2 w/ zfs v3 or v13, I think :q).

If I format them with ufs, I get dd write speeds of ~120mbyte/s.
If I use them in the 3-drive raidz1 pool, I get dd write speeds of ~35mbyte/s.
I've used gnop pseudo-4k-devices (ad*.nop) in both cases.

For days I've been crawling the web, reading threads like this one and completely lost my head over this.

If I would have known about 4k issues with raidz, I wouldn't have bought such drives.
4k drives seem to be the future. What's the future on the software side? When will zfs natively deal with this?
I've spent 215 euros on these 3 drives and I sincerely believe that there's a way to have them perform nearly as fast as single usage (as I stated above).
Afaik, the Samsung F4 does emulate 512b as well as the WD20ears. Then how is it that those drives perform better in a raidz pool?
Everyone who's reporting NO PROBLEMS with 4k + raidz, for god's sake, tell us your read/write speeds! "no problems" doesn't necessarily mean "great speed" as I also have "no problems" using them but also "no speed"...

aragon · Apr 1, 2011

bfreek said:
Everyone who's reporting NO PROBLEMS with 4k + raidz, for god's sake, tell us your read/write speeds!

Code:

$ dd if=/tmp/blah of=blah bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 2.584883 secs (405657012 bytes/sec)

/tmp/blah is on a TMPFS mount, generated from /dev/urandom.

5 Samsung F4 drives in raidz1.

AndyUKG · Apr 1, 2011

bfreek said:
HIf I format them with ufs, I get dd write speeds of ~120mbyte/s.
If I use them in the 3-drive raidz1 pool, I get dd write speeds of ~35mbyte/s.
I've used gnop pseudo-4k-devices (ad*.nop) in both cases.

Quick point, write performance will be lower on ZFS RAIDZ than UFS, to what degree depends on a lot of things. About your setup, have you verified that your pool actually has ashift=12 set? Also, even with ashift set correctly you can have misalignment problems, you seem to be using the whole disk which I assume will be ok, maybe others with "no problems" can comment what they have done to avoid these issues.

bfreek said:
4k drives seem to be the future. What's the future on the software side? When will zfs natively deal with this?

ZFS already does support 4k drives perfectly, unfortunately due to lack of support in other OS's current 4k drives all emulate 512byte drives! When drives start reporting their real block size to ZFS, ZFS will work great without any messing about. In the mean time I agree it would be nice to have a fix to the 4k emulation fix

thanks Andy.

jem · Apr 1, 2011

ZFS performance is highly dependant on the amount of available memory. The amount of RAM a ZFS system has should be stated along with any speed test results.

I'm no expert on such things, but as I understand it you need to ensure that the size of a read or write test is large enough so that it will definitely involve accessing the disk immediately instead of just caching/buffering the entire I/O operation in RAM. A 1GB file read or write on a ZFS system with 4GB of RAM will probably give unrealistic results.

I too have just set up an HP ProLiant MicroServer with 8GB RAM and four 2TB Samsung F4 disks (which have 4KB sector sizes). I have a raidz1 pool with ashift=12 across those four disks. I ran some speed tests writing 100GB of /dev/zero to a file on the zpool. I don't have my exact results to hand as the server is currently packed up for a house move, but if I recall correctly the speeds were around the 150MB/sec mark.

This is also with a sub-optimal number of disks in the raidz pool: 4 = 3 data + 1 parity.

bfreek · Apr 1, 2011

AndyUKG said:
Quick point, write performance will be lower on ZFS RAIDZ than UFS, to what degree depends on a lot of things. About your setup, have you verified that your pool actually has ashift=12 set? Also, even with ashift set correctly you can have misalignment problems, you seem to be using the whole disk which I assume will be ok, maybe others with "no problems" can comment what they have done to avoid these issues.

Sure, i just wanted to prove that the drives are capable of more.
But 120mb/s compared to 35mb/s is like too much difference.

ashift: Yes, all drives are *.nop devices at least at the time of zpool create.
And, of course, i have no partitions on them. Entire drive is used.

AndyUKG said:
ZFS already does support 4k drives perfectly, unfortunately due to lack of support in other OS's current 4k drives all emulate 512byte drives! When drives start reporting their real block size to ZFS, ZFS will work great without any messing about. In the mean time I agree it would be nice to have a fix to the 4k emulation fix

I may be lacking a bit of understanding here: don't we circumvent the emulation with the *.nop method?
Are native 4k drives available already?

jem said:
ZFS performance is highly dependant on the amount of available memory. The amount of RAM a ZFS system has should be stated along with any speed test results.

Good point. Missed it: 1gb ecc ram.
freenas 0.7.2 (freebsd 7.2-based)

jem said:
I'm no expert on such things, but as I understand it you need to ensure that the size of a read or write test is large enough so that it will definitely involve accessing the disk immediately instead of just caching/buffering the entire I/O operation in RAM. A 1GB file read or write on a ZFS system with 4GB of RAM will probably give unrealistic results.

As stated above, 1GB. The results have just been too bad to indicate use of caching in my tests, though.

jem said:
I too have just set up an HP ProLiant MicroServer with 8GB RAM and four 2TB Samsung F4 disks (which have 4KB sector sizes). I have a raidz1 pool with ashift=12 across those four disks. I ran some speed tests writing 100GB of /dev/zero to a file on the zpool. I don't have my exact results to hand as the server is currently packed up for a house move, but if I recall correctly the speeds were around the 150MB/sec mark.

This is also with a sub-optimal number of disks in the raidz pool: 4 = 3 data + 1 parity.

150mb/s. My eyes are getting wet. What OS are you using?
To clear this one up: does the F4 do 4k emulation just like the WD20EARS? (That's what i know/read...)
If that's the case, why do they perform better? RAM?

btw: Forced capitalization? The internet doesn't have time for such pointless thing.

usdmatt · Apr 1, 2011

Firstly, don't use dd to do performance testing. When I first started testing my own server I preferred the simple, clear throughput figure you get from dd but quickly realised it's all over the place. Just changing dd's blocksize option can make a massive difference to the output. Please use something like bonnie using a test file at least twice your RAM.

Looking at your results, I am doubtful you're actually getting real sustained 120MB/s throughput on UFS. Also jems result of 150MB/s seems unrealistic (although possible with fast hardware). It's well known using /dev/zero for throughput testing is a bad idea.

You also have to remember that ZFS is a complex filesystem. It has to calculate all the checksums, redundancy data and write all this to disk along with metadata and backup metadata. It's designed to run of machines with at least 2GB or RAM (preferably 4) and modern, powerful CPUs. You'll probably see an improvement by increasing ram and moving to FreeBSD 8.2 amd64 with zfs v15.

In regard to Samsung/WD 4k drives, they both do the same emulation.

Just for comparison here's some quick throughput figures from my system using bonnie (and dd) to 4GB files. (I have 4 2TB 512b disks in raidz).

Bonnie Write: 86MB/s
Bonnie Read: 130MB/s
Bonnie Write (compress=on): 95MB/s
Bonnie Read (compress=on): 104MB/s (Think my poor atom is struggling here...)

I think these are reasonable results for a zfs redundant array of 4 consumer disks with an atom d525 cpu / 2GB ram (FreeBSD 8.2 amd64).

Completely useless dd results:

Write from /dev/random and 8m block size to file: 19MB/s
Write from /dev/zero and 8m block size to file: 150MB/s
Write from /dev/zero with 1m block size: 379MB/s
Read with standard block size: 27MB/s
Read with 1m block size: 192MB/s

AndyUKG · Apr 1, 2011

bfreek said:
I may be lacking a bit of understanding here: don't we circumvent the emulation with the *.nop method?
Are native 4k drives available already?

I was responding to your question of when ZFS will natively deal with 4k disks, as I stated it already does. What ZFS doesn't currently natively deal with is 4k disks that report 512byte sectors to the OS. My little after thought was simply that in the mean time while vendors are selling this 4k drives with 512byte emulation forced on us it would be nice if this was automatically taken care of by ZFS. Using gnop is obviously not "native", it will be missed by some people completely and still allows for misalignment (when using partitions).

Andy.

PS or perhaps the answer you were looking for is, gnop does not circumvent directly the 512 emulution. The disk presents each 4096byte sector to the OS as eight 512 byte sectors, gnop maps eight of the emulated sectors back into one 4096byte sector. So you are not disabling the emulation, just putting another layer of abstraction on top to get back to where you started from (ie the real physical sector size).

danbi · Apr 2, 2011

bfreek said:
freenas 0.7.2 (freebsd 7.2-based)

You should note, that ZFS has different versions. Each new version adds new features and sometimes dramatic performance improvements.

FreeBSD 7.2 has ZFS v6, 7.3+ has v13, 8.0 has v14, 8.1+ has v14, 8.2+ has v15 and 9.0+ has v28.

v15, which is in the current freebsd-stable has already made significant performance improvement even on v13, which is far better than v6. v28 has even greater improvements for NAS use, primarily because of the "slim ZIL" feature.

It is real pity, that FreeNAS has not moved to more recent FreeBSD version.