8.2 zfs on i386 tuning help

Hi all,

I've had a stable zfs setup for home use for a year or so now on an old dual xeon box with 2GB of RAM (sadly, it's so old that there's no 64 bit support) and it's been great. I started with 9 250GB PATA drives in a raidz1 array (used 6, 3 were idle) on a 3Ware card and since they are old and crappy (Maxtors from 2002) I went through 3 drive failures. This was good - I got to see zfs in action, I got to practice replacing drives, everything was great. I filled this array to about 90% and my reads were around 100MB/s with writes around 80MB/s (bonnie++).

I recently upgraded to 8.2, moved my root to mirrored (UFS2) CF cards, and added 4 Samsung HD103SJ 1TB drives hanging off of two PCI-X siis cards. 1 drive is a spare, the other 3 are in a raidz1 array.

All changes at once, yay for complicating troubleshooting!

I now see about 36MB/s reads and 29MB/s writes. This seems a bit low - even though I cut my number of spindles in half, I've got a faster controller, marginally better drives and more cache onboard each drive, as well as stuff like NCQ. From seeing others with 3 disk arrays on consumer drives, I'd expect at least 80MB/s reads.

Given that I'm on i386, what are some of these crazy tunables I saw discussed on -stable recently that would be appropriate for this setup? Currently I've just got loader.conf setup like so to prevent panics from memory exhaustion:

Code:
vm.kmem_size_max="1000M"
vm.kmem_size="1000M"
vfs.zfs.arc_max="200M"

I so wish the tunables were documented somewhere - the thread I can't find now from -stable had some interesting stuff for a guy with bad smb performance, but it didn't give me much insight as to why these various tunables need to be changed or how they all interact.

I do still need to grab the spare samsung and see what kind of non-zfs single-drive speeds I get from it with bonnie.

Any tips welcome!
 
So you now have a pool with a 9-drive raidz1 and a 3-drive raidz1?

Or you created a new pool with just the 3-drive raidz1?

Or something else?

Output of # zpool status or # zpool iostat -v would be useful to understand how your pool(s) is/are setup.
 
phoenix said:
So you now have a pool with a 9-drive raidz1 and a 3-drive raidz1?

Had a 6 drive raidz1 pool which ate 3 drives over time. :) Then added the 3 drive raidz1 pool, copied the data over and removed the old 6 drive raidz1 pool.

phoenix said:
Output of # zpool status or # zpool iostat -v would be useful to understand how your pool(s) is/are setup.

I'm still at v14 as I'm not sure I'm committed to 8.2 yet.

Code:
[spork@media ~]$ zpool status

  pool: tank1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on older software versions.
 scrub: none requested
config:

	NAME              STATE     READ WRITE CKSUM
	tank1             ONLINE       0     0     0
	  raidz1          ONLINE       0     0     0
	    gpt/tank1-d0  ONLINE       0     0     0
	    gpt/tank1-d1  ONLINE       0     0     0
	    gpt/tank1-d2  ONLINE       0     0     0

errors: No known data errors

Here's some iostat output, the more repetitive and uniform the data, the shorter the snippet. Probably more than is useful, but I was running bonnie to generate some load, so I figured I might as well include it.

This is while bonnie is writing "a byte at a time":

Code:
[spork@media ~]$ zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.02T  1.70T      4      3   160K  49.0K
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      1     51  4.50K   490K
tank1       1.02T  1.70T      3     56  4.50K   108K
tank1       1.02T  1.70T      0      0      0      0
tank1       1.02T  1.70T      0      0      0      0

And during "writing intelligently":

Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.03T  1.70T      4      3   160K  49.2K
tank1       1.03T  1.70T      0    258      0  31.8M
tank1       1.03T  1.70T      2    521  3.99K  52.5M
tank1       1.03T  1.70T      3    781  7.49K  84.4M
tank1       1.03T  1.70T      2    780  3.99K  84.3M
tank1       1.03T  1.69T      0    295      0  36.2M
tank1       1.03T  1.69T      2    516  3.99K  48.2M
tank1       1.03T  1.69T      3    780  3.99K  84.3M
tank1       1.03T  1.69T      2    781  4.00K  84.4M
tank1       1.03T  1.69T      2    111  5.91K  13.8M
tank1       1.03T  1.69T      3    752  4.99K  71.0M
tank1       1.03T  1.69T      4    792  7.49K  84.4M
tank1       1.03T  1.69T      0    545      0  67.3M
tank1       1.03T  1.69T      4    237  6.49K  17.0M

And "rewriting":

Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.03T  1.69T      4      3   160K  50.9K
tank1       1.03T  1.69T    325      0  40.3M      0
tank1       1.03T  1.69T    201    537  25.0M  67.2M
tank1       1.03T  1.69T    195    253  24.2M  31.4M
tank1       1.03T  1.69T     86    708  9.51M  73.6M
tank1       1.03T  1.69T    284      0  35.4M      0
tank1       1.03T  1.69T    305      0  38.0M      0
tank1       1.03T  1.69T    131    845  16.0M  99.0M
tank1       1.03T  1.69T    242     62  29.2M   106K
tank1       1.03T  1.69T    320      0  39.7M      0
tank1       1.03T  1.69T    240    204  29.8M  25.2M
tank1       1.03T  1.69T     90    701  10.5M  73.9M
tank1       1.03T  1.69T    299      0  37.2M      0
tank1       1.03T  1.69T    312      0  38.7M      0
tank1       1.03T  1.69T    106    926  12.4M  99.2M
tank1       1.03T  1.69T    267      0  33.2M      0

And "reading a byte at at time":

Code:
[spork@media ~]$ zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.03T  1.69T      4      3   162K  52.2K
tank1       1.03T  1.69T      0      0   128K      0
tank1       1.03T  1.69T      0      0      0      0
tank1       1.03T  1.69T      0      0   128K      0
tank1       1.03T  1.69T      0      0   128K      0
tank1       1.03T  1.69T      0      0      0      0
tank1       1.03T  1.69T      0      0   128K      0
tank1       1.03T  1.69T      0      0      0      0

And "reading intelligently":

Code:
[spork@media ~]$ zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.03T  1.69T      4      3   162K  52.2K
tank1       1.03T  1.69T    395      0  49.1M      0
tank1       1.03T  1.69T    392      0  48.7M      0
tank1       1.03T  1.69T    410      0  51.0M      0
tank1       1.03T  1.69T    409      0  50.8M      0
tank1       1.03T  1.69T    429      0  53.3M      0
tank1       1.03T  1.69T    420      0  52.1M      0
tank1       1.03T  1.69T    410      0  51.0M      0

And lastly the "seeker - start 'em" sequence:

Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       1.03T  1.69T      4      3   163K  52.2K
tank1       1.03T  1.69T    176      0  22.1M      0
tank1       1.03T  1.69T    179      0  22.4M      0
tank1       1.03T  1.69T    166      0  20.7M      0
tank1       1.03T  1.69T    169      0  21.0M      0
tank1       1.03T  1.69T    162      0  20.1M      0
tank1       1.03T  1.69T    171      0  21.5M      0
tank1       1.03T  1.69T    174      0  21.9M      0
tank1       1.03T  1.69T    169      0  21.2M      0
tank1       1.03T  1.69T     98    946  10.7M  55.8M
tank1       1.03T  1.69T    154      0  19.1M      0

Maybe that gives some hint as to what's going on. Regardless, it's interesting to see how little data is moved during the per-character read/write. That's expected, looking at other bonnie runs from "real" boxes at work. Also I think I had my numbers reversed originally - my reads are actually slower than my writes in all cases.

The block read/write cpu usage is around 20% system, 50% bonnie. 2 x 2.4GHz xeons from the era before xeons were available with 64 bit extensions. Hyperthreading enabled.
 
If using the "zpool iostat" command to watch throughput, be sure to use 1 second as the granularity. Otherwise, you get an average over the time period. I used to use 15 or even 30 so that I had time to read the output, and couldn't figure out why the numbers were low. Using 1 gave much more accurate numbers.

ZFS is very bursty, in that it accumulates small random writes into a single larger sequential write (transaction group) and then writes that out to disk in a burst. Then waits for the buffer to fill again, then writes it out. And so on.

Due to that, a lot of the "normal" filesystem benchmark tools aren't always accurate.

Are your 1 TB drives 7200 RPM? Or are they "green" 5900 RPM drives? That would (potentially) lower your throughput. Going from 6-drives in a raidz to 3-drives in a raidz would also affect your throughput (less disks to stripe across).

Are you using compression on the filesystem?

ZFS prefetch disabled (it should be by default with less than 4 GB of RAM)?

Why did you set your ARC so low (200 MB)? I can't access my home machine to check loader.conf settings, but my 2 GB i386 box uses around 1 GB or so of ARC, with kmem_size set around 1596 I believe.
 
phoenix said:
If using the "zpool iostat" command to watch throughput, be sure to use 1 second as the granularity.

Yep, these are all with 1S samples.

phoenix said:
ZFS is very bursty, in that it accumulates small random writes into a single larger sequential write (transaction group) and then writes that out to disk in a burst. Then waits for the buffer to fill again, then writes it out. And so on.

Due to that, a lot of the "normal" filesystem benchmark tools aren't always accurate.

Yep. We run a ton of zfs at work. All on amd64 and with gobs of RAM and with no real tweaking to any zfs tunables - it "just works" and works well. I've sort of settled on bonnie as my "how does box X compare to box Y?" benchmark. Something doesn't quite feel right on this one though.

phoenix said:
Are your 1 TB drives 7200 RPM? Or are they "green" 5900 RPM drives? That would (potentially) lower your throughput. Going from 6-drives in a raidz to 3-drives in a raidz would also affect your throughput (less disks to stripe across).

Yep, these are "Spinpoint F3", 7.2K, 32MB cache. Supposedly they do not have the 4K sector size. I am using gpt partitions and did go ahead and do the alignment "fix" for good measure just in case.

phoenix said:
Are you using compression on the filesystem?

Nope.

phoenix said:
ZFS prefetch disabled (it should be by default with less than 4 GB of RAM)?

Yep: vfs.zfs.prefetch_disable: 1

phoenix said:
Why did you set your ARC so low (200 MB)? I can't access my home machine to check loader.conf settings, but my 2 GB i386 box uses around 1 GB or so of ARC, with kmem_size set around 1596 I believe.

That's just the number I arrived at when I first installed 8.0 on this hardware. I was able to reliably panic the box by unarchiving the ports tree. Just kept lowering the ARC until it didn't panic anymore and then lowered another 100MB or so for good measure. Considering how much has changed since 8.0 came out, maybe I can now bump that up higher... Right now I'm waiting on a scrub and then I'll bite the bullet and upgrade the pool to v15. I don't think that's going to make a difference, but it can't hurt.

I still need to benchmark a single drive for comparison I think...

Ah, found the thread that mentions a few of the mystery tunables:

http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/ce59028befdba655#

I'm sure part of the "fix" there was just AIO, samba, and network tuning.
 
Back
Top