ZFS write performance issues with WD20EARS

bfreek · Apr 2, 2011

Sorry. It's FreeBSD 7.3 actually. ZFS version is 13.

Your info leaves me considering the 4k drives aren't the problem. rather zfs. I think about moving to zfsguru or my own bsd setup. All this is really sickening

bfreek · Apr 3, 2011

Updates:

I installed FreeBSD 8.2, created a pool, installed samba, shared the pool and put some files (6-8gb) on it (remotely from a windows 7 machine).
Write speed: 60-70 mbyte/s
Read speed: 70-80 mbyte/s

I then did nothing for quite some time and eventually decided to do some more file copying.
All of a sudden, without further ado, transfer rates dropped to 15-20 mbyte/s in both directions.

Any explanation, anyone?

olav · Apr 4, 2011

You might have the same problem I had. Memory leak (disappearing). It happens with the combination of 8.2-release and Samba.

jem · Apr 4, 2011

bfreek said:
1gb ecc ram.

Sorry I didn't point this out earlier, but that there is your problem. ZFS wants LOTS of memory. I'd strongly recommend at least 4GB for a system running raidz pools. It's why I threw 8GB RAM into my MicroServer.

bfreek · Apr 4, 2011

olav said:
I might have the same problem I had. Memory leak(disappearing). It happens with the combination of 8.2-release and Samba.

Please, elaborate on that. What's up with that (disappearing) memory leak?

jem said:
Sorry I didn't point this out earlier, but that there is your problem. ZFS wants LOTS of memory. I'd strongly recommend at least 4GB for a system running raidz pools. It's why I threw 8GB RAM into my MicroServer.

I now have 5GB of ram put into it. Nothing really changed -.-

olav · Apr 5, 2011

It's a bug in 8.2-Release, there is already a patch for it available with 8.2-Stable. Use top to check if all your memory is there.

AndyUKG · Apr 5, 2011

olav said:
It's a bug in 8.2-Release, there is already a patch for it available with 8.2-Stable. Use top to check if all your memory is there.

Have you got a link to a bug report or anything? What does the leak affect? ZFS?

ta Andy.

AndyUKG · Apr 5, 2011

bfreek said:
I now have 5GB of ram put into it. Nothing really changed -.-

A lot of RAM won't help write performance in a benchmark (where the system is otherwise idle). RAM helps read performance with ZFS, and can help write performance on busy systems by avoiding reads from the disks which cause IO contention...

ta Andy.

olav · Apr 5, 2011

AndyUKG said:
Have you got a link to a bug report or anything? What does the leak affect? ZFS?

ta Andy.

http://blog.vx.sk/archives/24-Backported-patches-for-FreeBSD-82-RELEASE.html

carlton_draught · Apr 8, 2011

vermaden said:
The Samsung F3 ones are fast and low on power at the same time (available sizes 500GB/1TB):
http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-10.html

The only drawback is a little worse 'access time' then in WD Black/Blue drives.

Not sure about the 2TB drives, but the 1.5TB samsungs are hopeless IME. I bought 5 of them, have RMA'd 3 of them and about to send back the 4th, in less than 1 year. I can't understand how they get good reviews on Newegg.

bfreek · Apr 8, 2011

olav said:
http://blog.vx.sk/archives/24-Backported-patches-for-FreeBSD-82-RELEASE.html

Interesting, thanks!

carlton_draught said:
Not sure about the 2TB drives, but the 1.5TB samsungs are hopeless IME. I bought 5 of them, have RMA'd 3 of them and about to send back the 4th, in less than 1 year. I can't understand how they get good reviews on Newegg.

Because they test them under unrealistic circumstances, that is buy them and test them for a few days. This is nothing near a real use case...
I have very bad experience with Samsung drives on the long run - return rates of 60-70% within 1-2 years.

In the meantime, I just set up FreeNAS again and up until now I'm getting good Samba speeds of 60-90 mbyte/s in both directions across my 1 gbit lan. (Drives set up via gnop 4k providers and usual zpool raidz1 spanning my 3 wd20ears. Disabled head parking of the wd20ears as well.)

Sidenote: The distinction between consumer and non-consumer drives is just bullshit as computers tend to run for hours each day even at home.

carlton_draught · Apr 8, 2011

bfreek said:
Because they test them under unrealistic circumstances, that is buy them and test them for a few days. This is nothing near a real use case...

Well, to newegg's credit most of the 3 star or less involve an RMA. So judging reliability by the sum of the percentage of 1-3 star reviews is not a terrible methodology in the scheme of things. If you have a better one, I'm all ears. Perhaps only use reviews in the last 6 months and do the same thing. If I do this to the drive I bought, it jumps from 26% to 40%. That's only 56 reviews to judge them on (down from 505), but I suppose as long as your sample size is 30+ it's reliable enough.

And by contrast, if you look at what I see as the shining example of reliability as far as a drive for the money (and it's an SSD of course, but at least it points to the utility of newegg as a review site), the Intel X-25 series. Scores like 7% 3 star or less reviews. That's awesome. And you browse to see why people actually gave them those ratings, it's stuff like "Never got the rebate", "cant use ssd toolbox with raid hp aloe motherboard", "Windows boot up wasn't all that impressive", and "Intel's Data Migration Utility software doesn't recognize the Intel SSD as made by Intel, so it won't even install!". Those are the 3 star reviews. And the 1 star? (There are no 2 star reviews). 3/5 are non-reliability related! (2 are rebates, 1 doesn't know how to format a HDD). To me, all those are actually really encouraging!

And when Intel publishes that with their new 320 series, that they had <1% return rates for their 2nd gen, it's believable to me. And it cross checks with what I see in newegg and other sites (e.g. Amazon). And it is why that when I go to buy another SSD, I will be buying Intel despite other drives being reputedly faster.

Still, using that same methodology, Hitachi which I think I'll try next does not seem to be much better - 39% initially and 36% last 6 months. That is for the 7K2000. The 7K3000 scores only 14%, but all reviews are 6 months or less. Too early to tell really. I don't think any drive gets less than about 35% in the last 6 months that has been out a while.

And if you look at the 2TB Samsungs, which must not have been out for much more than 6 months judging by the number of reviews, it starts out at 17% and climbs to 22% in the last 2 weeks (i.e. after 6 months approx). Again, if we go back to the Hitachi 2TB 7K3000, it appears to be better as the overall is 14% but it drops to 12% in the last two weeks but with only 16 reviews how reliable is that?

Meh. The only thing I can gather from this is that HDDs do not seem to be able to be reliably made in the 1.5TB+ era. It's like riding a motorcycle - it's not a matter of if you come off, but when you come off. With HDDs it's not a matter of if they should fail, but what percentage will fail in the first year of ownership.

Which is why running ZFS + redundancy + backups is a no-brainer for any data you care about these days.

I have very bad experience with Samsung drives on the long run - return rates of 60-70% within 1-2 years.

Well, I'm at 80% in 1 year. We'll see how I go.

Sidenote: The distinction between consumer and non-consumer drives is just bullshit as computers tend to run for hours each day even at home.

Exactly. I'm not sure what the solution is. At least Hitachi are honest enough to put 24x7 usage, though they qualify that and say that it's for low duty cycle.

carlton_draught · Apr 8, 2011

Lol, I've just been browsing storagereview.com, and it appears that someone else uses the exact same methodology that I do!

Sebulon · Apr 8, 2011

Recently bought a WD30EZRS, a 3TB 4K drive, and actually didnt get what all the fuss was about, Ive been using it now for some time with zfs end recv, as a secondary pool, replicating my primary for disaster recovery- no problemo. I just shrugged it off, thinking that Im just lucky, or that it liked me better than every one else...=)

Well...it didnt. Yesterday I sat into place eight 1TB drives raidz2 for my primary pool and (until its filled) one 3TB for the secondary, adding another 3TB to match the size of the primary later. So I had switched the polarity of the pools so that I would be able to crash my previous primary 4-drive pool and build this:

Code:

[karli@main ~]$ zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

	NAME               STATE     READ WRITE CKSUM
	pool1              ONLINE       0     0     0
        label/rack1:1      ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: scrub completed after 2h59m with 0 errors on Fri Apr  8 16:48:04 2011
config:

	NAME                STATE     READ WRITE CKSUM
	pool2               ONLINE       0     0     0
	  raidz2            ONLINE       0     0     0
	    label/rack-1:2  ONLINE       0     0     0
	    label/rack-1:3  ONLINE       0     0     0
	    label/rack-1:4  ONLINE       0     0     0
	    label/rack-1:5  ONLINE       0     0     0
	    label/rack-2:1  ONLINE       0     0     0
	    label/rack-2:2  ONLINE       0     0     0
	    label/rack-2:3  ONLINE       0     0     0
	    label/rack-2:4  ONLINE       0     0     0
	cache
	  label/cache1      ONLINE       0     0     0
	  label/cache2      ONLINE       0     0     0

errors: No known data errors

Then, when it was time to replicate everything back from the one 3TB pool1 to the eight 1TBs pool2, was where it backfired- with force. messages was flooded with zfs errors and the send recv failed shortly after. Scrubbing just made it worse. Fortunately for me, these things have made me paranoid enough to have another, somewhat older backup, on a ordinary 2TB drive from the last time the entire system almost went down the drain. What doesnt kill you...

So to sum up, dont think youre safe just because you can write to it, cause once you try to read out that data again, youre gonna be sorry if you havent gnoped it first!

/Sebulon

bthomson · Apr 9, 2011

carlton_draught said:
And when Intel publishes that with their new 320 series, that they had <1% return rates for their 2nd gen, it's believable to me. And it cross checks with what I see in newegg and other sites (e.g. Amazon). And it is why that when I go to buy another SSD, I will be buying Intel despite other drives being reputedly faster.

No doubt. I bought a 40G 320 this week even though it is slower and more expensive than the alternatives.

fadolf · May 5, 2011

I'm not sure if I get this right. I have 6 of those drives, which I plan to use for a raidz2 pool. Do I have to worry about the alignment if I use the gnop method to create the zpool? And if so, would this achieve what is necessary?

Code:

for i in /dev/ada*;do dd if=/dev/zero of=$i bs=1m count=1;done
for i in {0..5};do glabel label disk$i /dev/ada$i;done
for i in /dev/ada*;do gnop create -S 4096 $i;done
for i in /dev/ada*;do gpart create -s gpt $i;done
for i in /dev/ada*;do gpart add -t freebsd-zfs -b 2048 $i.nop;done
zpool create storage raidz2 ada0p1.nop ada1p1.nop ada2p1.nop ada3p1.nop ada4p1.nop ada5p1.nop

wonslung · May 6, 2011

Sebulon said:
Recently bought a WD30EZRS, a 3TB 4K drive, and actually didnt get what all the fuss was about, Ive been using it now for some time with zfs end recv, as a secondary pool, replicating my primary for disaster recovery- no problemo. I just shrugged it off, thinking that Im just lucky, or that it liked me better than every one else...=)

Well...it didnt. Yesterday I sat into place eight 1TB drives raidz2 for my primary pool and (until its filled) one 3TB for the secondary, adding another 3TB to match the size of the primary later. So I had switched the polarity of the pools so that I would be able to crash my previous primary 4-drive pool and build this:

The issue doesn't affect single drive or mirrored pools in nearly the way it does raidz and raidz2.

wonslung · May 6, 2011

carlton_draught said:
Exactly. I'm not sure what the solution is. At least Hitachi are honest enough to put 24x7 usage, though they qualify that and say that it's for low duty cycle.

I've used TONS of hitachi 7k2000 drive with raidz and have had hardly any fail. I've had a COUPLE DOA but I attribute this to the fact that hitachi used glass spindles, and during shipping they can get broken.

I've recently started using the 5k3000's and 7k3000's and so far haven't had to return any.

Now, I am aware this is anecdotal evidence, but I've used about 300+ of the 7k2000's in different servers (most running opensolaris or open indiana).

Take it for what it's worth.

AndyUKG · May 9, 2011

fadolf said:
I'm not sure if I get this right. I have 6 of those drives, which I plan to use for a raidz2 pool. Do I have to worry about the alignment if I use the gnop method to create the zpool? And if so, would this achieve what is necessary?

Code:

for i in /dev/ada*;do dd if=/dev/zero of=$i bs=1m count=1;done for i in {0..5};do glabel label disk$i /dev/ada$i;done for i in /dev/ada*;do gnop create -S 4096 $i;done for i in /dev/ada*;do gpart create -s gpt $i;done for i in /dev/ada*;do gpart add -t freebsd-zfs -b 2048 $i.nop;done zpool create storage raidz2 ada0p1.nop ada1p1.nop ada2p1.nop ada3p1.nop ada4p1.nop ada5p1.nop

I've only played around a little with gnop, but I have an idea that if you create a gnop device for say ada0, that doesn't mean that ada0p1.gnop will also exist. If I'm wrong, then your steps look good, if I'm right then you will need to do a gnop create for the ada*p1 devices... Apart from that, yes you need to worry about alignment but you are good for that using the "-b 2048" option with gpart,

thanks Andy.

phoenix · May 9, 2011

Why are you labeling the disks, then partitioning the disks directly, then using the partitions to create the pool?

A simpler method is to just label the disks, create the gnop devices using the labels, then create the pool using the gnop devices:

Code:

$ for i in 0 1 2 3 4 5; do glabel label disk0$i ada$i; done
$ for i in 01 02 03 04 05; do gnop create -S 4096 label/disk$i; done
$ zpool create storage raidz2 label/disk01.nop label/disk02.nop label/disk03.nop label/disk04.nop label/disk05.nop

fadolf · May 10, 2011

AndyUKG said:
I've only played around a little with gnop, but I have an idea that if you create a gnop device for say ada0, that doesn't mean that ada0p1.gnop will also exist. If I'm wrong, then your steps look good, if I'm right then you will need to do a gnop create for the ada*p1 devices... Apart from that, yes you need to worry about alignment but you are good for that using the "-b 2048" option with gpart

You're actually right, or well half: upon creating the partitions on the ada*.nop, there are ada*.nopp1 slices which I used to create the zpool, but when I exported it and destroyed the gnop devices, the zpool was gone too. Those probably weren't any valid devices to begin with. I should probably have made ada*p1.nop devices.

How can I verify the alignment by the way?

phoenix said:
Why are you labelling the disks, then partitioning the disks directly, then using the partitions to create the pool?
A simpler method is to just label the disks, create the gnop devices using the labels, then create the pool using the gnop devices:

Code:

$ for i in 0 1 2 3 4 5; do glabel label disk0$i ada$i; done $ for i in 01 02 03 04 05; do gnop create -S 4096 label/disk$i; done $ zpool create storage raidz2 label/disk01.nop label/disk02.nop label/disk03.nop label/disk04.nop label/disk05.nop

The gnop devices are needed to tell the OS the disks have 4K sectors and the partitions are needed to get the proper alignment.
I only labelled the disks to know which one is connected to which port, in case the disks get disconnected from the cables (they also have a physical label though, but those come off more easily).

So taking everything into consideration, this would probably have been better (have yet to try if it works).

Code:

for i in {0..5};do glabel label disk$i /dev/ada$i;done
for i in {0..5};do glabel label disk$i /dev/label/disk$i;done
for i in /dev/ada*;do gpart create -s gpt $i;done
for i in label/disk*;do gpart add -t freebsd-zfs -b 2048 $i;done
do gnop create -S 4096 label/disk0p1
zpool create storage raidz2 label/disk0p1.nop label/disk1p1 label/disk2p1 label/disk3p1 label/disk4p1 label/disk5p1

phoenix · May 10, 2011

If you gnop the disk but then add a GPT table to the disk and add the partition to the pool ... you get an ashift of 9 (512 byte sectors).

You have to gnop the device that you add to the pool. Meaning, you have to gnop the partition in order to get ashift=12. Which means you can't label the disk.

And, you can't gnop a GPT label.

However, if you use the entire disk, you automatically get proper alignment, since you start at sector 0.

So, again, why not just label the disk, gnop the label, and add the gnop device to the pool?

I've been playing around with glabel, gpart, gnop, and zpool for a couple of hours now, and what you want to do is not possible.

phoenix · May 10, 2011

Okay, after more testing, and realising the difference between # gnop create -s 4096 and # gnop create -S 4096, the following is what you want:

Code:

# gpart create -s GPT ada0
# gpart add -b 2048 -t freebsd-zfs ada0
# gpart modify -i 1 -l disk1 ada0
# gnop create -S 4096 gpt/disk1
# zpool create poolname gpt/disk1.nop

That will:

create a single partition on the disk starting at 1 MB
label the partition with a meaningful name
configure the labelled partition GEOM provider to use 4 KB sectors
add the labelled/gnop'd partition to the pool, thus setting the ashift to 12

I'll leave it up to you to figure out how to script it.

fadolf · May 10, 2011

phoenix said:
However, if you use the entire disk, you automatically get proper alignment, since you start at sector 0.

So if I just add a gnop provider with -S 4096 to the device, it will already be aligned correctly? So far the pool has an ashift of 12, which would be correct.
Is this the way it is supposed to look if its aligned correctly?

Code:

geom label list ada0
Geom name: ada0
Providers:
1. Name: label/disk0
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada0
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2

Zare · Oct 22, 2012

FreeBSD 9.0-RELEASE, Atom D510, 2GB RAM, 2x WD20EARX

UFS (frag-size 4096) on gmirror = 110M/s write, 156M/s read
ZFS mirror = 39M/s write, 48M/s read
ZFS mirror on 4k nop device = 80M/s write, 82M/s read