ZFS: write performance challenges

I'm a happy FreeBSD user and have decided to dig into ZFS a bit more.
My current setup is as follows;
- Intel E7400 @ 2.8GHz
- MSI P43 mainboard
- 4GB of DDR2 PC6400 ram
- 2x WDC WD1200JD (120GB each, SATA-150) in a simple ZFS storage pool
- 5x WDC WD10EADS (1TB each, SATA-300) in a RAIDZ1 storage pool
the first two are connected to a JMicron onboard SATA controller, the latter 5 to the Intel ICH10R onboard controller.

I'm running FreeBSD 8.0 with ZFS 13.

The last few days I've tried to optimize my ZFS performance as much as I could. loader.conf has been extended with the following ZFS-related settings;
Code:
vfs.zfs.prefetch_disable=0
vm.kmem_size="2048M"
vfs.zfs.arc_min="1024M"
vfs.zfs.arc_max="1536M"
vfs.zfs.vdev.min_pending=2
vfs.zfs.vdev.max_pending=8
vfs.zfs.txg.timeout=5

Write tests have been performed on both pools by using the following command:
Code:
dd if=/dev/random of=./file1 bs=1M count=1024

Read tests have been performed by using the following command:
Code:
dd if=./file1 of=/dev/null bs=1M count=1024

After running these commands 5 times on both pools, I got the following (best-case) results:
on zpool "tank" (2x WDC WD1200JD):
Code:
Write:   68.050.720 bytes/sec (e.g.  68MB/sec approx)
Read: 4.338.138.680 bytes/sec (e.g. 4.3GB/sec approx)
on zpool "data" (5x WDC WD10EADS):
Code:
Write:   67.108.264 bytes/sec (e.g.  67MB/sec approx)
Read: 5.004.572.331 bytes/sec (e.g. 5.0GB/sec approx)

I do understand that reads are heavily impacted by caching and as such create these tremendous results. (initial runs always around 140MB/sec for zpool "tank" and 230MB/sec for zpool "data", so that's fine).

What I don't understand is why the write performance seems "capped" at 67-68MB/sec, regardless of the storage pool. To me, i'd be happy if this would be the performance from a single SATA-300 disk, but in this particular case, I'm not too excited about the results.

What could be the possible causes of this limitation, and in what direction should I search to increase performance? I've been searching this forum in ZFS write performance related topics but could find little. I hope you can help...

Looking forward to your thoughts!
Maurice
 
Root cause found!

I think I've found the solution myself, silly me :-)
It makes quite a difference if I use:
Code:
dd if=/dev/zero of=./file1 bs=1M count=1024

instead of the original:
Code:
dd if=/dev/random of=./file1 bs=1M count=1024

So apparently /dev/random was the bottleneck here!
I now manage to get the following results on both pools;
zpool "tank": 166.684.468 bytes/sec (so approx. 166MB/sec)
zpool "data": 369.130.567 bytes/sec (so approx. 369MB/sec)

This all makes perfect sense given the # of discs that are allocated to these two pools.

As such, this topic can be closed.
cheers!
 
I doubt your array is that fast, I think the original test is a closer indication of the speed of your array. You just writing a whole bunch of zeros.

My understanding is that a zraid is only as fast as a single disk within it. So your 5 disks will run at the speed of 1 and I don't think a sata WD disk can write at 350M/s :).

Zfs stripes across multiple top level devices, so the more raidz arrays you put in the pool the faster it will be.
 
Correct. You should not be seeing more than about 100 MB/s of write throughput on any consumer-level SATA harddrive.

Due to the way raidz works, you are limited to the write IOPS of a single drive. Read IOPS can be greater than that of a single disk, depending on the data layout, stripe size, block size, etc. This is the main reason why all the recommendations are to keep raidz vdevs narrow (1-8 devices) but to have multiple raidz vdevs in a single pool.

If you need pure IOPS, go with multiple mirror vdevs.
If you need raw storage space, go with raidz. Which level of parity protection depends on your needs (raidz3 provides the best redundancy, but the worst performance; raidz1 is the reverse).
 
phoenix said:
Correct. You should not be seeing more than about 100 MB/s of write throughput on any consumer-level SATA harddrive.

Plain Samsung F3 1TB drive is capable of about 150MB/s read and 140MB/s write, sequential of course. While doing zpool scrub I often see 180MB/s in zpool iostat, and about 110MB/s in dd 'test'.

I have 3 such drives in raidz pool.
 
plan of approach

hi all,

Very useful information!
I bought the wd drives mainly due to the low energy consumption, but performance is a bit lagging behind. In the short term I think the easiest way to increase write performance is by adding a 6th wd 1tb disk, so that I can setup the pool with 2 raidz sets of 3 disks each.
If I understand well, then write performance should go up.
 
vermaden said:
Plain Samsung F3 1TB drive is capable of about 150MB/s read and 140MB/s write, sequential of course. While doing zpool scrub I often see 180MB/s in zpool iostat, and about 110MB/s in dd 'test'.

I have 3 such drives in raidz pool.

Hi Vermaden, what tunings are you using? I have 4 Samsung F1s in raidz and don't come near those speeds.
 
Blueprint said:
Hi Vermaden, what tunings are you using? I have 4 Samsung F1s in raidz and don't come near those speeds.
/boot/loader.conf
Code:
# boot delay
autoboot_delay="1"

# modules
zfs_load="YES"
aio_load="YES"
geom_mirror_load="YES"
ahci_load="YES"
coretemp_load="YES"
snd_hda_load="YES"
vboxdrv_load="YES"
vboxnetflt_load="YES"

# firefox HTML5 fix
sem_load="YES"

# zfs tuning
vfs.zfs.arc_max=1024M
vfs.zfs.prefetch_disable=0

# page share factor per proc
vm.pmap.shpgperproc=512

# avoid additional 128 interrupts per second per core
hint.atrtc.0.clock=0

# do not power devices without driver
hw.pci.do_power_nodriver=3

# disable throttling
hint.p4tcc.0.disabled=1
hint.acpi_throttle.0.disabled=1

# reduce sound generated interrupts
hint.pcm.0.buffersize=65536
hint.pcm.1.buffersize=65536
hint.pcm.2.buffersize=65536
hw.snd.feeder_buffersize=65536
hw.snd.latency=7

# ahci power management
hint.ahcich.0.pm_level=5
hint.ahcich.1.pm_level=5
hint.ahcich.2.pm_level=5
hint.ahcich.3.pm_level=5
hint.ahcich.4.pm_level=5
hint.ahcich.5.pm_level=5

This is my whole /boot/loader.conf, but these only directly refer to ZFS:

[cmd=""]vfs.zfs.arc_max=1024M
vfs.zfs.prefetch_disable=0[/cmd]

The more important thing can be that I use AHCI in FreeBSD:
Code:
ada0 at ahcich0 bus 0 target 0 lun 0
ada0: <SAMSUNG HD103SJ 1AJ100E4> ATA/ATAPI-8 SATA 2.x device
ada0: 300.000MB/s transfers
ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada0: Native Command Queueing enabled

In the end its still stock amd64 8.0-RELEASE-p2.

The machine is Intel Q35 MOBO + Intel Q8300 CPU + 4GB RAM.

EDIT:

I have Samsung F3 drives and You have Samsung F1 drives ...
 
I find dd if=/dev/zero very unrealiable, for example the first time I run it I get around 190MB/s write, but the second time is more like this
Code:
[olav@zpool ~]$ dd if=/dev/zero of=/tank/raidz/file1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 1.739134 secs (617400257 bytes/sec)
600MB/s? Yeah right :p

/dev/random gives me around 67MB/s which is not correct either, because I copy files from my Windows computer over samba at 90MB/s
 
olav said:
I find dd if=/dev/zero very unrealiable
That also depends how much data You have written to pool, for example, I got about 110MB/s with 24GB file from /dev/zero (which I am sure will not fit into any cache/RAM since this machine has 4GB RAM.
 
Hmmm if its correct what youre saying then I can write 270MB/s

Code:
[olav@zpool ~]$ dd if=/dev/zero of=/tank/raidz/file3 bs=1M count=24576
24576+0 records in
24576+0 records out
25769803776 bytes transferred in 93.948762 secs (274296364 bytes/sec)

I still find that unbelivable fast, I have 5 2TB 5400rpm disks in raidz1.
 
olav said:
Hmmm if its correct what youre saying then I can write 270MB/s

Indeed strange:
Code:
% dd if=/dev/zero of=FILE bs=1M count=24576
24576+0 records in
24576+0 records out
25769803776 bytes transferred in 219.905149 secs (117185995 bytes/sec)

What are Your system ZFS tunings? ( /boot/loader.conf | /etc/sysctl.conf )

I would assume that You setup should do about 90-100 MB/s at most, maybe ~90 seconds is a bit too short to measure 'true' performance.
 
Well I get almost the same speed when I create a 100GB file
Code:
[olav@zpool ~]$ dd if=/dev/zero of=/tank/raidz/file3 bs=1M count=98304
98304+0 records in
98304+0 records out
103079215104 bytes transferred in 393.421886 secs (262006815 bytes/sec)

/boot/loader.conf
Code:
aio_load="YES"
ahci_load="YES"
vm.kmem_size="3072M"
vm.kmem_size_max="3072M"
vfs.zfs.arc_max="1504M"

/etc/sysctl.conf is empty
 
same here!

Interesting to see the discussion going so, so I've tried to do the same experiment:

Code:
dd if=/dev/zero of=FILE bs=1M count=24576
Code:
24576+0 records in
24576+0 records out
25769803776 bytes transferred in 120.776447 secs (213367792 bytes/sec)

so that's 213MB/sec with the 5-disk WD10EADS zpool1 array!

my loader.conf settings:
Code:
vfs.zfs.prefetch_disable=0
vm.kmem_size="2048M"
vfs.zfs.arc_min="1024M"
vfs.zfs.arc_max="1536M"
vfs.zfs.vdev.min_pending=2
vfs.zfs.vdev.max_pending=8
vfs.zfs.txg.timeout=5
aio_load="YES"

any thoughts?
 
Shuffle said:
Interesting to see the discussion going so, so I've tried to do the same experiment:

Code:
vfs.zfs.vdev.min_pending=2
vfs.zfs.vdev.max_pending=8

does these settings really make any difference?
 
Just to avoid confusion: i made a typo in my previous post!
article was not written by me, just found it via Google.
 
Back
Top