ZFS write performance issues with WD20EARS

vermaden · Oct 18, 2010

Maybe it works only for NEW pools.

Epikurean · Oct 19, 2010

I believe too, that the patch works only for new pools. The problem is, that even after recompiling the kernel without the patch, the pool has not returned.

I thought it would be best to leave a comment like this in the thread: I lost ~3.5TB of Data. Fortunately for me I lost only nonessential things: all the important stuff was on a backup.

palmboy5 · Oct 20, 2010

What if you first used gnop (one at a time, allowing for zfs to recover each time) on each drive, then tried the patch?

EDIT:
Also has anyone done performance comparisons for the patch yet?

Epikurean · Oct 21, 2010

After following Palmboy's suggestion, here is what I did

1. Created a new ZFS Pool
2. gnop create -S 4096 on each drive at the same time
3. copied some data on the pool
4. compiled and installed a new kernel with the patch
5. reboot

As expected, the *.nop drives are "lost" after a reboot, BUT the ZFS Pool is in perfect shape!

The only thing that bugs me, is that there was no message whatoever indicating the use of the .nop drives in the ZFS pool: no degraded state, no indication, that the *.nop drives are used when entering "zpool status" (besides when I wanted to replace my adX drive with the adX.nop drive, which didn't work: ZFS told me the .nop drive was already used).

wonslung · Oct 21, 2010

Epikurean said:

I tried the suggested patch, but unfortunately it killed my ZFS Pool:

Code:

pool: tank
state: UNAVAIL
scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	tank        UNAVAIL      0     0     0  insufficient replicas
	  raidz1    UNAVAIL      0     0     0  corrupted data
	    ad6     ONLINE       0     0     0
	    ad10    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0

It won't work because of the variable block size ZFS uses for RAIDZ.

Until firmware updates come out, avoid those drives for raidz.

palmboy5 · Oct 21, 2010

wonslung said:
it won't work because of the variable block size ZFS uses for RAIDZ.

until firmware updates come out, aviod those drives for raidz

Believe me I would have avoided the 4K drives if I knew then what I know now, but most of us in this thread already invested in 4K drives and are stuck with them. As such we are trying to find workarounds to make these 4K drives behave adequately. I see that you repeatedly slam against these 4K drives but otherwise are not offering any real help.

vermaden · Oct 21, 2010

palmboy5 said:
Believe me I would have avoided the 4K drives if I knew then what I know now, but most of us in this thread already invested in 4K drives and are stuck with them.

Why not just SELL them and buy a good ones?

palmboy5 · Oct 21, 2010

To who, and with what monetary loss? It would have to sell for less than the purchase price AND cost more in shipping it. Not practical.

vermaden · Oct 21, 2010

palmboy5 said:
To who, and with what monetary loss? It would have to sell for less than the purchase price AND cost more in shipping it. Not practical.

Have You ever heard of an eBay maybe?

You will probably lose a little money (used vs new price), but they are on warranty, so not much difference in price, the price for shipping is on buyer side, so there is no cost for shipping.

It seems that its not that big problem if You want to stay with them (along with bundled PITA)

phoenix · Oct 21, 2010

vermaden said:
It seems that this little patch can 'fix' issues with 4k WD Green drives:
http://lists.freebsd.org/pipermail/freebsd-fs/2010-October/009706.html

Code:

[B]/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c[/B] [color="Red"]-*ashift = highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1;[/color] [color="Green"]+*ashift = highbit(MAX(MAX(4096, pp->sectorsize), SPA_MINBLOCKSIZE)) - 1;

[/color]

Is this patch along the same lines as this one:
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.

(I've posted a reply to that message to find out.)

noz · Oct 21, 2010

phoenix said:
Is this patch along the same lines as this one:
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.

(I've posted a reply to that message to find out.)

It looks like it's doing the same thing. The patch from the freebsd mailing list alters the calculation of ashift, while the solarismen.de solution bypasses the calculation and directly sets ashift to the correct value.

Assuming 4096 is larger than pp->sectorsize and SPA_MINBLOCKSIZE, and that highbit() returns the position of the highbit in an int, the freebsd patch also sets ashift to 12.

Oh, I got my Samsung Spinpoint F4 drives. It has 4k sectors but uses 512b emulation. There doesn't seem to be a way to turn the emulation off. My ashift is 9.

However, I did see an improvement over my EARS drives, possibly due to the larger platters.

$ dd if=/dev/random of=./testfile bs=1m count=500
Running the above is about 10 seconds quicker at 19.446565 secs (26960443 bytes/sec). It's still terrible, but it's better than before and I now have 2TBs. :B

I think I'll wait for an official fix before I try to recompile stuff on my own.

wonslung · Oct 23, 2010

I'd sell them at a loss before I'd use them with raidz.

sub_mesa · Oct 24, 2010

Instead, you may want to check a place with valuable information regarding Samsung F4 / WD EARS and FreeBSD + ZFS; they can perform quite nicely:

http://hardforum.com/showthread.php?t=1546137

More likely you just have some ZFS tuning to do. Feel free to use the benchmark feature as described in the thread above, so you can test your own disks and get the same kind of benchmark charts as posted in the mentioned thread.

tty23 · Nov 7, 2010

After following Palmboy's suggestion, here is what I did

1. Created a new ZFS Pool
2. gnop create -S 4096 on each drive at the same time
3. copied some data on the pool
4. compiled and installed a new kernel with the patch
5. reboot

As expected, the *.nop drives are "lost" after a reboot, BUT the ZFS Pool is in perfect shape!

The only thing that bugs me, is that there was no message whatoever indicating the use of the .nop drives in the ZFS pool: no degraded state, no indication, that the *.nop drives are used when entering "zpool status" (besides when I wanted to replace my adX drive with the adX.nop drive, which didn't work: ZFS told me the .nop drive was already used).

I did the same, and it works perfectly.

I hope there won't be any problems later on...

raab · Nov 9, 2010

Does anyone have any before/after performance stats after applying this patch?

Having just bought 6 WD20EARS then coming across this issue I want to know if its worth applying the patch or just selling them and getting non 4k drives

vermaden · Nov 9, 2010

raab said:
Does anyone have any before/after performance stats after applying this patch?

Having just bought 6 WD20EARS then coming across this issue I want to know if its worth applying the patch or just selling them and getting non 4k drives

I havent tried, but You may also check method I found on [H]ard|Forum,
to align pool blocks with appreciate raid level:

Code:

disks  type   recordsize / ( disks - parity disks )   sector  status
raidz1     128KiB / 2                          64KiB   good
raidz1     128KiB / 3                          43KiB   BAD
raidz2     128KiB / 2                          64KiB   good
raidz1     128KiB / 4                          32KiB   good
raidz2     128KiB / 4                          32KiB   good
raidz1     128KiB / 8                          16KiB   good
raidz2     128KiB / 8                          16KiB   good

[H]ard|Forum --> http://hardforum.com/

raab · Nov 9, 2010

Yeah, I'll have 8 in total however only ordered 6 to ease the pain between pay cycles although I'm hesitant in getting an additional two what with the 512 byte emulation issues.

Which thread in particular on hardforum were you referring to?

palmboy5 · Nov 9, 2010

sub.mesa posted that table/chart in many threads, including I believe this one in one of the previous pages. He says it is just a theory he has, it isn't actually proven yet.

vermaden · Nov 9, 2010

Here are results of that theory in practice:
http://hardforum.com/showpost.php?p=1036276885&postcount=26

tty23 · Nov 9, 2010

phoenix said:
Is this patch along the same lines as this one:
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.

(I've posted a reply to that message to find out.)

Those patches are not needed. It seems zfs figures out the sector size of the zpool devices by looking at the sector size of the "real" devices.

Now if you create your zpool on top of gnop devices emulating 4k sectors, your zpool device will end up having 4k sectors (you can verify that by running zdb, it will display ashift=12, for 2^12=4k sectors). Then you can reboot, which removes the gnop 4k sector emulation. ZFS won't event notice, as it identifies the disks by uid and not by device name.

Just do this:

Code:

gnop create -S 4096 adaX

for each device.
Create an zpool on top of the adaX.nop devices.
check with zdb that the ashift value is correct:

Code:

zdb <poolname>

Reboot.
Recheck ashift value: its still 12, the zpool is fine, nop devices are gone.

And you have an zpool with 4k sector size without having to patch the source.

palmboy5 · Nov 9, 2010

I see in the zdb output that ashift exists in the level above the drives, so ashift doesn't exist for each drive. Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?

tty23 · Nov 9, 2010

Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?

Good question, could be. As I already set up my drives and am using them again, maybe someone else wants to give it a shot?

Or even better, I guess this behavior should be described somewhere in the zfs docs.

Perhaps I will have a look on the weekend... but maybe someone wants to try it..?

tty23 · Nov 9, 2010

This thread over at arstechnica is quite interesting, too. They also figured out that issue, but there is some useful information not mentioned here before:
http://arstechnica.com/civis/viewtopic.php?f=11&t=37779&start=1200.

Especially this post, where sub.mesa explains why he thinks raidz performance is that bad with the WD EARS drives:
http://arstechnica.com/civis/viewtopic.php?p=20797605&sid=d09ee0bd397ffd18c0dbaac4ba2e0678#p20797605

usdmatt · Nov 9, 2010

I was interested if finding this out the other day when I learnt that ashift appears to be stored at vdev level. I wondered what would happen if you mixed drives, or used a new 4k native device to replace a drive in a 512b vdev.

Here's what happens when you do it with mdX devices:

Code:

files-backup# mdconfig -a -t malloc -s 100M -S 512
md2
files-backup# mdconfig -a -t malloc -s 100M -S 4096
md3
files-backup#
files-backup#
files-backup# zpool create test mirror md2 md3 # <- 512b disk is specified first
files-backup# zdb |grep 'ashift'
                ashift=12
files-backup# zpool destroy test
files-backup#
files-backup# zpool create test md2
files-backup# zdb | grep 'ashift'
                ashift=9
files-backup# zpool attach test md2 md3
cannot attach md3 to md2: devices have different sector alignment
files-backup#

It appears ZFS uses the largest block size from the disks you are adding to the vdev so in theory you could gnop just one of them.

Also, the ZFS designers have clearly though about all this (although not documented well) and you can't add a 4k native disk to a 512b vdev (to be expected). Not much of an issue at the moment, and maybe never will be but this means that in the future you *could* (unless there's some workaround I'm not aware of) be in trouble if you need to replace a failed disk in a 512b vdev and can only get hold of 4k native disks.

sub_mesa · Nov 9, 2010

If the .nop procedure would only have to be performed once, this would simplify setup since you won't need the .nop providers anymore upon reboot, correct? This makes it possible to automate 4K installations in my ZFSguru distribution with ease. Update: now implemented this feature on Pools->Create page, available in ZFSguru version 0.1.7-preview2c, update via System->Update.

@usdmatt
perhaps a GEOM class can be designed that emulates lower sector sizes. Right now you can go up using GNOP or GELI or something similar, but you can't go lower; that's usually a job for the filesystem to handle. But this may be an interesting small project for anyone interested. Something like:

gsect -S 512 /dev/my4Ksectdrive
and you would get:

Code:

/dev/my4Ksectdrive  (4K sectorsize)
/dev/my4Ksectdrive.sect (512B sectorsize)