Yet another ZFS performance thread (I am sorry)

billli · Jan 19, 2011

Hello:

My sincere apologies for another ZFS thread, I'm sure you guys are sick of seeing these threads.

My hardware:

E5500 CPU
4 Gigs of memory
FreeBSD 8.1
PCI controllers for additional SATA headers
10x1TB drives

I recompiled the kernel with option KVA_PAGES 512, and have set

Code:

vm.kmem_size_max = "1024M"
vm.kmem_size = "1024M"

My purpose:
To create a single RAIDZ storage pool from the 10 drives. (The real total storage would be 8TB)

However, when I copied a file over gigabit network to the server, the ZFS performance is terribly slow. I get around 7mb/s which is worse than USB.

My goal is 50mb/s writing speed. Is this possible?

phoenix · Jan 19, 2011

First, why are you limiting kmem to 1 GB? Are you running the 32-bit version of FreeBSD? Install the 64-bit version, and don't limit your kmem. That way, the ARC will expand to fill the available RAM.

Next, don't put all 10 drives into a single raidz vdev. Performance will be horrible (as you note) and if you ever need to resilver (replace) a drive, it will take a week or more to complete. Use multiple smaller vdevs (2x 5-disk raidz1, or 2x 5-disk raidz2, or 5x 2-disk mirror). mirrors would give the best performance, at the cost of disk space.

Finally, I hope that's a typo, and you aren't using actual PCI SATA controllers, otherwise those will be a bottleneck. You should be using PCI-X or PCIe. PCI is limited to ~100 MB/s for the entire slot.

billli · Jan 19, 2011

Thanks for the reply:

phoenix said:
First, why are you limiting kmem to 1 GB? Are you running the 32-bit version of FreeBSD? Install the 64-bit version, and don't limit your kmem. That way, the ARC will expand to fill the available RAM.

Yes I am using the 32 bit version. The reason that I am limiting kmem to 1GB is because when I put it to anything above 1GB, the OS won't boot, I would have to escape to the recovery console, and manually set those settings to 1GB.

phoenix said:
Next, don't put all 10 drives into a single raidz vdev. Performance will be horrible (as you note) and if you ever need to resilver (replace) a drive, it will take a week or more to complete. Use multiple smaller vdevs (2x 5-disk raidz1, or 2x 5-disk raidz2, or 5x 2-disk mirror). mirrors would give the best performance, at the cost of disk space.

I guess the problem here is that I want to make all the drives to appear as 1 giant drive, redundancy does not matter to me. But at the same time, if I lose a drive, I should still have all the other drives in tact. The data on those drives are not important, if I lose them, I can get them back easily.

phoenix said:
Finally, I hope that's a typo, and you aren't using actual PCI SATA controllers, otherwise those will be a bottleneck. You should be using PCI-X or PCIe. PCI is limited to ~100 MB/s for the entire slot.

Unfortunately, being a poor college student does not enable me the luxuries of all the latest gears

Yeah, I am using a PCI SATA controller.(http://canadacomputers.com/product_info.php?cPath=19_252_254&item_id=028984)

I suppose worse case for me would be just to use the drives as they are, single and individual drives, and avoid ZFS.

xibo · Jan 19, 2011

I'd recommend you to connect more of the raid disks to the Mainboard SATA-controllers, eventually swapping with the system disk(s) which usually need less performance.

billli · Jan 19, 2011

Thank for the reply!

My motherboard have 4 SATA headers, and I used one for the OS. But from where I am seeing things, I am thinking about:
1) tweak around ZFS
2) switch to Ubuntu, and just use the drives as they are. I am happy with 100mbps, since I am the only user.
3) switch to Windows server.

Ruler2112 · Jan 19, 2011

billli said:
3) switch to Windows server.

Didn't you say you wanted better performance???

I've just installed ZFS and haven't played with it much yet, but from what I've learned it's designed for a 64 bit OS.

The PCI controller, as stated above, is going to be a bottleneck. Consider - 10 drives that share 100 Mbps bandwidth. That's only 10 Mbps throughput that can be read/written to each drive at a given time, assuming no overhead. Using a RAID5/raidz configuration, there is data written to all the disks with each write. Limiting the ARC (basically a cache from what I understand) to 1 gig would just exacerbate the problem under a heavy load.

No software tweaks or changes are going to eliminate the limitations of the hardware you're using. If necessary, return one or two drives and use the money to buy another controller card or two, preferably of a type that has a faster bus to the system. Moving the data drives to the motherboard SATA headers should help, but still not provide the performance level you're looking for.

billli · Jan 19, 2011

Ruler2112 said:
The PCI controller, as stated above, is going to be a bottleneck. Consider - 10 drives that share 100 Mbps bandwidth.

My motherboard comes with 4 SATA headers, I purchased two PCI SATA cards with 4 headers each.

4 drives are plugged into my motherboard
4 and 3 drives are plugged into the PCI cards

So essentially I would be better off if I just used the drives as it is. In theory sequential read/write from 1 drive should achieve better results than 7mb/s.

Thanks
B

chrcol · Jan 19, 2011

you havent given a real reason why you using 32bit. I am curious why you have gone against best practices for zfs.

billli · Jan 19, 2011

I'm not using the 32 bit simply because I didn't have a copy on hand, but I'm going to try to install 64 bit and see where it gets me.

But from above, I doubt I will get any writing speed with ZFS due to my PCI card.

chrcol · Jan 19, 2011

a larger ARC will give you faster write response times, so although the actual physical write will not be much faster you will get improved performance as more can be cached.

billli · Jan 19, 2011

Currently my ARC max is set to 512M.

usdmatt · Jan 20, 2011

I'm not sure why the ARC should have much (if any) impact on write speed. As far as I'm aware it's just a read cache.
Writes are cached by the ZIL which you'll only really speed up using a low latency, high IOPS log device (or by disabling it, which isn't recommended as you'll probably lose the pool in the event of a power failure or crash).

First off I would move to 64bit and not bother doing any memory tuning as it should be fine as default with 4GB RAM.
I don't see any compelling reason to use 32bit unless you're running a desktop and need to run closed source 32bit binaries. ZFS is much happier on 64bit.

If you're available mem comes to less than 4GB and the boot process tells you it's disabling prefetch, you can manually re-enable this to help improve read speeds. Just add the following to /boot/loader.conf

Code:

vfs.zfs.prefetch_disable=0

Next I would split the pool in to 2 5-drive raidz vdevs. You will still get 8 drives worth of usable storage space. You can also try 5 mirror vdevs if you are OK with 5TB space and see how that performs. Use the PCI card for os drive and make use of all the mainboard ports for ZFS. It will still show up as one pool with xTB of space, regardless of how many vdevs you split the disks into.

After that, how you are writing will also make a big difference.
NFS performance (when mounted sync) is generally terrible unless you have high end hardware due to the fact that every write request has to be fully flushed to disk. If you are using NFS, adding an SSD to the mainboard as a ZIL LOG device will massively improve write performance, but you'll lose the pool if it fails with the current version of ZFS in FreeBSD (unless you fork out for 2 and mirror them).

In fact, an SSD ZIL device will greatly improve write performance however you are copying to the box but obviously they cost money.

The home NAS I built a few months ago is a D525 Atom with 2GB memory and 4 x 2TB drives in one raidz. The 4 disks are connected to the board, configured as AHCI and running 8.1 amd64.
Over samba (the only way I access files on the system currently), I currently get 30MB/s / 240Mbps write throughput. This could probably be improved as my samba knowledge is pretty mimimal, especially in regard to tuning.

danbi · Jan 20, 2011

If this is purely storage box, you will be much better with the experimental ZFS v28 code. There are already snapshots with this code available for download. It should be integrated in mainstream code very soon, I guess.
If not, by all means get FreeBSD 8.2 (or just, freebsd-current), it's as of now RC2 and will release any time soon -- but is already quite stable. The performance of ZFS v15 (in 8.2) is measurably better than that of ZFS v14 in 8.1. ZFS v28 should be much better in your case, because of the slim ZIL feature (from v23 I think).

Running amd64 FreeBSD with 4GB RAM should not require any tuning, unless you run lots of applications on that server.

_martin · Jan 20, 2011

I guess the problem here is that I want to make all the drives to appear as 1 giant drive

If that is the case, why don't you create a pool from two raidz sets?
For example if I have 10 disks of 512MB capacity, I could create one big pool:

Code:

lab01# zpool create bigone raidz /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da7
lab01#

lab01# zpool list bigone
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
bigone  2.47G   132K  2.47G     0%  ONLINE  -
lab01#

lab01# zpool add bigone raidz /dev/da8 /dev/da9 /dev/da10 /dev/da11 /dev/da12
lab01#

lab01# zpool list bigone
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
bigone  4.94G   141K  4.94G     0%  ONLINE  -
lab01#

lab01# zpool status bigone
  pool: bigone
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        bigone      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da7     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da8     ONLINE       0     0     0
            da9     ONLINE       0     0     0
            da10    ONLINE       0     0     0
            da11    ONLINE       0     0     0
            da12    ONLINE       0     0     0

errors: No known data errors
lab01#

lab01# df -m /bigone
Filesystem 1M-blocks Used Avail Capacity  Mounted on
bigone          3975    0  3975     0%    /bigone
lab01#

It strongly depends on the hardware structure (so you don't hit the bottleneck on motherboard or HBA) - but I think it was already mentioned here.

AndyUKG · Jan 20, 2011

billli said:
I guess the problem here is that I want to make all the drives to appear as 1 giant drive, redundancy does not matter to me. But at the same time, if I lose a drive, I should still have all the other drives in tact. The data on those drives are not important, if I lose them, I can get them back easily.

You can have mutliple RAID vdevs, or mirrors and so long as they are in the same ZFS pool then will (or can) appear as one giant drive. But you will need to use RAID or mirror, otherwise a single drive failure in the pool will result in the whole pool being failed.

WRT to the PCI bottleneck, yes it clearly is a bottleneck. Are both your PCI cards connected to the same PCI bus? Anyway even if they are I think you will be able to achieve overall write speeds of 50MB in theory.
Also you didn't mention how you are writing the data over the network, are you using Samba?

chrcol · Jan 20, 2011

usdmatt I cant be sure on this but all I can say is based on my experience.

if I set ARC low, and then dump a large amount of data to the pool it will take longer as cannot all go in write cache, whilst on a large enough ARC size I get a quicker response time. But of course the hdd is then busy in the background flushing the cache.

So my understanding is that ARC is both for read and write but write will eventually be synced to disk with the regular syncs. Whilst read stays cached until the ram needs freeing up or replaced by newer data. As I said I am not sure on this and if you are confident I am wrong I will accept it. I considered zil to just be a journaling mechanism not caching.

AndyUKG · Jan 21, 2011

chrcol said:
So my understanding is that ARC is both for read and write but write will eventually be synced to disk with the regular syncs. Whilst read stays cached until the ram needs freeing up or replaced by newer data. As I said I am not sure on this and if you are confident I am wrong I will accept it. I considered zil to just be a journaling mechanism not caching.

The ARC is just for read caching as you originally stated, but will normally have an indirect impact on write performance on a system handling read and writes simultaneously. Ie write performance will be improved where the system can perform read operations from cache because this avoids putting additional load on the physical disks.