ZFS / Samba: performance issue

Hi there.

I have a machine (FreeBSD 8.1, 12GB of RAM, Intel i7 920, ZFS pool 14: raidz 6 x 2TB disks).
AHCI is enabled

One of the folder is exported using samba and mounted on Windows 7 clients connected by 100Mbit/s LAN

When saving from the application on the samba shared drive ; it takes between 35 and 55s to save the file. You see the save progress bar getting to 100% in about 2-3s then it sits at 100% for a long time.

During this time, the server is still idling at around 99%, with smbd/nmbd taking about 2% of CPU usage.

Now if the share points to a UFS partition, the time to save the file goes down to 23s and that's a constant time ; doesn't matter what you do, it always takes 23s.

Now, with the share pointing to the ZFS pool again, if I disable ZIL , then the saving time goes down to a constant 6s!

So obviously, ZIL is the culprit here: ZIL active = 55s, zil not active it's 6s (the file being saved is around 5MB)..

I've played with every single loader option I could find, tweaked the sysctl.conf version: it never gets any better quite the opposite.

Here is my loader.conf
Code:
nvidia_load="YES"
hw.ata.to=15
vboxdrv_load="YES"
aio_load="YES"
ahci_load="YES"

# ZFS tuning
kern.maxvnodes=800000
vm.kmem_size_max="4096M"
vfs.zfs.arc_max="1024M"
vfs.zfs.vdev.min_pending=4
vfs.zfs.vdev.max_pending=12
vfs.zfs.cache_flush_disable=1

#vfs.zfs.zil_disable="1"

Here is /etc/sysctl.conf
Code:
vfs.read_max=64

I thought I could try playing with the L2ARC options, but it didn't give me any write benefits (and I've just added it so i didn't expect much there)

So I added a 40GB Intel X25 SSD drive.

Code:
[root@server4 /pool/data/shares/elec/.zfs/snapshot]# zpool status
  pool: pool
 state: ONLINE
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	pool        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ada2    ONLINE       0     0     0
	    ada3    ONLINE       0     0     0
	    ada4    ONLINE       0     0     0
	    ada5    ONLINE       0     0     0
	    ada6    ONLINE       0     0     0
	    ada7    ONLINE       0     0     0
	cache
	  ada1      ONLINE       0     0     0

errors: No known data errors

I also added AIO support to samba and added the following to smb.conf
Code:
aio write size = 16384
aio read size =	16384
write cache size = 262144

Not much difference in the end.

I have been using ZFS for this file system for 18 months now. I do make daily snapshots, not sure if this would affect performance like that?

So what am I supposed to do now ?

Don't really feel like writing to the UFS boot disk for this share, and doing a backup to the zfs pool a few times a day, would be such a waste.

Any help would be greatly appreciated.

Merry Christmas to all !
 
What harddrives do you have? WD Green?
Have you tried without tuning ZFS?

This is my /boot/loader.conf
Code:
aio_load="YES"
ahci_load="YES"
vm.kmem_size="9G"

And I get awesome performance

Have you tuned Samba?

My /usr/local/etc/smb.conf
Code:
[global]
    workgroup = WORKGROUP
    server string = Zpool
    log file = /var/log/samba/log.%m
    max log size = 50
    min receivefile size = 131072
    interfaces 192.168.1.113 127.0.0.1
    socket address = 192.168.1.113
    bind interfaces only = yes
    socket options = SO_KEEPALIVE TCP_NODELAY IPTOS_LOWDELAY IPTOS_THROUGHPUT
    dns proxy = No
    aio read size = 16384
    aio write size = 16384
    aio write behind = true
    use sendfile = Yes
    wins support = Yes
    hosts allow = 192.168.1. 192.168.2.
 
olav said:
What harddrives do you have? WD Green?
Have you tried without tuning ZFS?

I only started to tune ZFS because of poor performance.

Its just always slow.

They are green WD drives, but the RAID edition, they have 64MB of cache each.

How much RAM do you have to be able to afford 9gig allocated to the kernel?

Due to the speed of samba without ZIS activated, I thought it would ruled out samba
 
I see that you have aio write behind activated, this is typically not advised as samba will tell the client that all is good before the write completed. If something goes bad, the client will not be notified
 
Hi,

Try playing with this setting: vfs.zfs.txg.write_limit_override It has to do with how ZFS commits writes to disk, and how much data it will save up to commit in one go. You can make changes via sysctl without a reboot, the optimum value will depend on your system. On my system I have this set:

Code:
vfs.zfs.txg.write_limit_override=2000000000
Using the wrong setting will make write performance horrible, so you will need to play around with it.

I reckon the above should do it, but also, not sure if this will help but I have this set in loader.conf:

Code:
vfs.zfs.vdev.min_pending=4
vfs.zfs.vdev.max_pending=8

Which I think was recommended for slow SATA drives.

Andy.

PS yes I woulnd't recommend the aio write behind either, although it might imporve the lag you are seeing.

PPS

For Samba I set these network tunning settings:

Code:
kern.ipc.maxsockbuf=2097152
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=262144
net.inet.tcp.recvspace=262144
net.inet.tcp.mssdflt=1452
net.inet.udp.recvspace=65535
net.inet.udp.maxdgram=65535
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

FYI more than anything, I don't think that will help your issue.
 
Could you please let know what your system is? mainly RAM size and if you're using gigabit ethernet.

This is mainly to adjust your settings based on my mine.

Regarding aio write behind, this is what was mentioned about it when it was release:
aio write behind was an attempt to see if we could fool Windows
clients into pipelining. If set true, smbd *lies* about writes
being done (and assumes the aio will always succeed) and returns
early "success" to the client. Don't set this if you have *any*
interest in your data :).

So obviously you're going to see great performance with it, because all checks are gone...

Regarding ZIL, I read the Solaris page on how it shouldn't be disabled.. However, this is on a machine on dedicated UPS, so power failure etc.. aren't applicable..
Would it really be unsafe to run without ZIL?
Thanks
 
My system is 4GB RAM with gigabit ethernet. ZFS I have RAIDz with 4 SATA disks. As mentioned, try playing with vfs.zfs.txg.write_limit_override. If you are on 100MBit ethernet then start with a smaller value than I have...
 
jyavenard said:
Regarding ZIL, I read the Solaris page on how it shouldn't be disabled.. However, this is on a machine on dedicated UPS, so power failure etc.. aren't applicable..
Would it really be unsafe to run without ZIL?
Thanks

What happens if you have a fault with your UPS, or the system crashes, or your disk controller fails? ZFS is designed to use the ZIL, advise from Sun is to never disabled it. Therefore I woudn't.
 
AndyUKG said:
What happens if you have a fault with your UPS, or the system crashes, or your disk controller fails? ZFS is designed to use the ZIL, advise from Sun is to never disabled it. Therefore I woudn't.

Sure... but a 10 times speed-up without ZIL is hard to overlook ..
 
I should also mention that throughput isn't the problem here either.

When I run Bonnie, I get well in excess of 200MB/s for bother reading and writing.

Running Bonnie over the samba share mounted on an mac mini (gigabit network) I get about 40MB/s average for both writing and reading.

It's just latency seems awfully off, and samba takes forever to tell the client that the write is over (that's my understanding of the situation anyway)
 
Which network card & switch are you using?
You might want to experiment with setting MTU to 9000 on your internal network, it should improve throughput quite a bit.

For the latency thing you're seeing, it might be caused by client AV scanning every file on the fileshare. May want to disable scanning of network files on the clients, and install a separate antivirus for the fileserver to deal with the scanning there.
 
as I mentioned earlier, the issue is *only* with ZIL on.

I can't set jumbo frames either... most clients wouldn't support it.

I have upgraded the pool to v28 using the stable-8 zfs backport.

Will use a SSD for the ZIL ; will see how it goes
 
So according to the opening post it's taking 6-55 seconds to write data that can be transfered in 3 seconds via a 100BaseT interface when using ZFS and 25 seconds when using UFS.

How are you using the RAID with UFS? Gvinum?

I had some trouble with WD Green Disks some time ago due to their half-hearted support for Native Command Queuing ( http://kerneltrap.org/mailarchive/linux-kernel/2007/10/1/326700 ), though i had a previous revision of your disks.

Another chance is that the samba server is calling sync() way too often ( though while this would slow down the raid alot, it wouldn't increase the time it takes you to close() the files after their content is write()-en, which i guess is indicated by the progress bar ).
Since you said performance is better when using local access, can you try using NFS ( needs win seven professional though ) or FTP for remote access?

Also, I wouldn't recommend using an MTU of 9000. Many so-called Gigabit NICs will perform very poor if the frame size is increased over 6000 Octets. Also, the gains of 9k frames versus 6k are little on a low latency LAN.
 
xibo said:
So according to the opening post it's taking 6-55 seconds to write data that can be transfered via a 100BaseT interface with ZFS and 25 seconds with UFS.

How are you using the RAID with UFS? Gvinum?

not, this is the boot disk , straight UFS with soft upgrade

Since you said performance is better when using local access, can you try using NFS ( needs win seven professional though ) or FTP for remote access?

Also, I wouldn't recommend using an MTU of 9000. Many so-called Gigabit NICs will perform very poor if the frame size is increased over 6000 Octets. Also, the gains of 9k frames versus 6k are little on a low latency LAN.

I did try with NFS, and actually samba is faster..

As for speed of the network, using cifs/smbfs mounted share, I can easily transfer well in excess of 50MB/s (and well over 200MB/s locally from ZFS to /dev/null)

Agreed with the jumbo frame, and most devices on this lan are 100mbit/s anyway
 
A little update: I upgraded to pool v28. It now takes 16.4s consistently.. That's with ZIL enabled. Will try with ZIL on a SSD disk.
Edit2: I added separate zil on a SSD drive (Intel X25-M 40GB), time hasn't changed, around 16s still

booting with zil disabled now also takes 16s !!

Oh well.. At least upgrading to v28 was worth it, it made writing with samba 5 times faster..

And don't try removing the zil on a live device ! it may crash your pool big time

Edit2: Actually it looks like adding
Code:
vfs.zfs.zil_disable="1"
to /boot/loader.conf doesn't actually disable the zil any longer, that sysctl property doesn't even exist anymore
 
yep confirmed here the sysctl is missing in v28, I hope this isnt deliberate as it should be adjustable for diganostic purposes.
 
ZFS v28 includes "improved ZIL", that incurs much less additional IOs (if not using separate ZIL). If you have slow disks, IOP limited disks and disks that do not behave well with NCQ -- as appears to be the case with those WD Green disks, then this will have significant impact on performance.

By the way, ZIL does not have to be on SSD -- this is a SUN 'design' for use in their big storage arrays (tens of disks). For smaller installation it is sufficient to put the ZIL on another disk. The ZIL needs to be very small, no need to waste entire drive for it! For example, you could create a slice on your boot drive for the ZIL -- the boot drive is unlikely to see any interefering activity anyway.

The point in separating the ZIL from the main pool is that when ZIL is allocated in the main ZFS storage it occupies variable lenght records, that are then freed -- incring additional IOP requirements for the drives and fragmentation. The "improved ZIL" makes big difference, because the amount of ZIL is reduced.
 
danbi said:
By the way, ZIL does not have to be on SSD -- this is a SUN 'design' for use in their big storage arrays (tens of disks). For smaller installation it is sufficient to put the ZIL on another disk. The ZIL needs to be very small, no need to waste entire drive for it! For example, you could create a slice on your boot drive for the ZIL -- the boot drive is unlikely to see any interefering activity anyway.

I have read various articles and posts that the ZIL (or the cache) should be located on their own disk and not use slices. That when using slices you significantly reduce performance.

I'm trying to find one of those battery powered RAM disk that sun is using and see if it's any better.
I'd like to get to the same speed as when zil was disabled with v14: 6s or so.. While 16s is much better, it's still a long among of time when saving just 5MB of data.
 
jyavenard said:
The sysctl to limit the speed at which it can write to the disks is also gone

wow, and its a very important sysctl.

Code:
# sysctl vfs.zfs.txg.write_limit_override
sysctl: unknown oid 'vfs.zfs.txg.write_limit_override

x(x(

ok its not gone its renamed.

here.

Code:
vfs.zfs.write_limit_override
vfs.zfs.zil_replay_disable

I think the latter is the zil disable as the latter isnt on a v14 box I checked. :)

others I see renamed are

vfs.zfs.txg.synctime_ms good news is this now defaults to 1000ms so the old 1 value instead of 5.

the write speed actually has some new sysctl's, listed below with default values.

Code:
vfs.zfs.write_limit_inflated: 6389514240
vfs.zfs.write_limit_max: 266229760
vfs.zfs.write_limit_min: 33554432
vfs.zfs.write_limit_shift: 3
vfs.zfs.no_write_throttle: 0
 
jyavenard said:
I have read various articles and posts that the ZIL (or the cache) should be located on their own disk and not use slices. That when using slices you significantly reduce performance.

The storage abstraction layers in Solaris and FreeBSD differ and one thing about ZFS that is stated in the Solaris documentation is not true for FreeBSD: write cache is enabled in FreeBSD irrespective of whether you use entire device, slice or partition.

There is no reason to not use slices or partitions with FreeBSD for the SLOG (ZIL) and L2ARC (cache). In fact, if you have "only" 32GB SSD it would be waste to use it for SLOG only, as the SLOG will never ever grow that large. Few megabytes might be all you need (especially with the reduced slog as found in ZFS v28).

It is not because of performance, people suggest using entire separate devices -- this eases administration and management and in the long run is wiser decision, especially for systems that will be supported/managed by others, not only their designer.
 
I was reading this blog entry http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

And was quite surprised with their result table. I looks like if you're using only one slog device, you get worse performance than using none at all.

It seems to only get faster if you use more than 2 slog device. That blog was written in 2007, SSDs weren't common back then.

Now if only I could get my hand on one of those SRAM disk...

Edit: is there actually a difference between using slices or partitions?

I mean, if I was to boot using another OS (like Solaris-type) would they all be able to work with either slices or partitions? or would they only see partitions or vice versa?
 
jyavenard said:
I was reading this blog entry http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

And was quite surprised with their result table. I looks like if you're using only one slog device, you get worse performance than using none at all.

Its not that surprising if you think about what ZFS is actually doing. The benefit of using separate log disks is firstly to isolate the IO to the ZIL, so that ZIL IO doesn't impact other pool IO and vice verse (ZIL IO will be sequential so is fast if undisturbed). The second idea is that you would normally have a faster smaller disk for the ZIL than for the other disks in the pool (even if not SSD, ie 15K SAS disk or whatever). But the maximum performance you have is always limited by the disks you are writing to, so writing to log disks of the same spec as the rest of the pool isn't going to help you increase your max write speed over what the physical disks are capable of. And if you compare (same disk type) using a dedicated log disk to a pool containing multiple disks without a log then the multiple disks have more raw IO. That is then weighed against the fact they are doing both ZIL and data store which is a negative for real world ZFS IO. Also of course doing artificial benchmarks don't always show all the realities, ie if you tested with separate log disks on a busy system that is also servicing read requests I think you will start to see more benefits of the separation of ZIL IO from other pool IO.

jyavenard said:
Edit: is there actually a difference between using slices or partitions?

I mean, if I was to boot using another OS (like Solaris-type) would they all be able to work with either slices or partitions? or would they only see partitions or vice versa?

About partitions, apart from the terminology etc being specific to each OS I think the main issue is performance really, if you have multiple slices/paritions on a disk ZFS is going to treat these like separate physical devices and therefore will not be well optimised for performance (ZFS will be optimised for dedicated drives).
The other thing is that a Solaris slice isn't going to be readable on FreeBSD I don't think, and I imagine compatibility issues go both ways. Assuming you still actually wanted to use partitions GPT partitioning would be best for compatibility I think.

ta Andy.

PS did you test changing the ZFS TXG settings?
 
AndyUKG said:
Its not that surprising if you think about what ZFS is actually doing. The benefit of using separate log disks is firstly to isolate the IO to the ZIL, so that ZIL IO doesn't impact other pool IO and vice verse (ZIL IO will be sequential so is fast if undisturbed). The second idea is that you would normally have a faster smaller disk for the ZIL than for the other disks in the pool (even if not SSD, ie 15K SAS disk or whatever). But the maximum performance you have is always limited by the disks you are writing to, so writing to log disks of the same spec as the rest of the pool isn't going to help you increase your max write speed over what the physical disks are capable of.

Ok, here I'm using a SSD (Intel X25-M drive)

So I would assume that writing speed on those is much greater than the WD 2TB Green RE3

And if you compare (same disk type) using a dedicated log disk to a pool containing multiple disks without a log then the multiple disks have more raw IO. That is then weighed against the fact they are doing both ZIL and data store which is a negative for real world ZFS IO. Also of course doing artificial benchmarks don't always show all the realities, ie if you tested with separate log disks on a busy system that is also servicing read requests I think you will start to see more benefits of the separation of ZIL IO from other pool IO.



About partitions, apart from the terminology etc being specific to each OS I think the main issue is performance really, if you have multiple slices/paritions on a disk ZFS is going to treat these like separate physical devices and therefore will not be well optimised for performance (ZFS will be optimised for dedicated drives).

Well, here I have a 40GB SSD drive, my plan was to use 8GB for the ZIL partition, and 32GB for the cache..

The other thing is that a Solaris slice isn't going to be readable on FreeBSD I don't think, and I imagine compatibility issues go both ways. Assuming you still actually wanted to use partitions GPT partitioning would be best for compatibility I think.

Well, that's my point, I can either create two slices, or a slice with two partitions.

Now, if Solaris isn't going to be able to read my two slices, I have a problem and considering the nightmare I went through yesterday when FreeBSD failed miserably when I removed the log device, and the only thing that saved me was booting OpenIndiana and re-importing the disk there.. I surely want my ZFS system to work with Solaris/OI *just in facse*

PS did you test changing the ZFS TXG settings?

I did, I can't say I saw much difference, timing varied so much with v14, from 30s to 55s so it's hard to say.

Upgrading to v28 however, did make a massive difference.
Now I just want to try without ZIL and see if the difference between ZIL and no-ZIL is still there.

JY
 
Back
Top