Creating a ZFS network share over NFS

stunirvana21 · Sep 28, 2012

I fell into two Dell Powervault storage arrays that each hold 15 disks. I have 30 750GB drives that I intend to fill them with. I want to create a high availiablity centralized storage server primarily for our Xen servers. I am hoping to use deduplication since the servers contain 32GB of RAM each and we deploy about 10-20 identical servers every couple of months.

I created a single raidz2 array with dedup enabled of 10 disks for testing on a single host. Performance is good for the sequential tests I did (>200MB/s). However if I try to export the array via NFS, performance drops like a rock well under 10MB/s. I can't figure out what is doing it.

I enabled sharing by doing [CMD=""]zfs set sharenfs=on /virt[/CMD] and allowed it in /etc/exports.

My NFS settings in /etc/rc.conf are

Code:

rpcbind_enable="YES"
nfs_reserved_port_only="YES"
nfs_server_enable="YES"
nfs_server_flags="-u -t -n 10"
nfs_client_enable="NO"
nfs_client_flags="-n 4"
rpc_lockd_enable="NO"
rpc_statd_enable="NO"
mountd_enable="YES"

The performance I get is horrible on a Xen server (CentOS based) only one hop away using the following command where ubu is the name of server:
[CMD=""]mount -o async,rsize=128000,wsize=128000,nolock,tcp ubu:/virt /mnt[/CMD]

I turned off the ZFS intent log (ZIL) and performance didn't change.

If I scp something, performance is normal. So I don't think it is the switch. The other odd thing is, if I use the same NFS options from above (minus the nolock) and mount the NFS share locally, the speed is ~80MB/s which is much less than >200MB/s but much higher than <10MB/s. Also, if I create a NFS share on the local gmirror array and mount it on the Xen server, low performance is also observed. Therefore I feel like it isn't ZFS, but NFS is so simple to set up I don't get how I could have screwed that up.

Any ideas? Also, if I am barking up the wrong tree trying to use NFS to share my ZFS array let me know. I am new to the ZFS scene and especially high availability. I think once I get a handle on the NFS performance I will start looking into HAST and CARP. So if anyone sees some red flags or has some advice for me, I would love to hear it. If you need more information, please ask.

Thanks in advance.

Sebulon · Sep 28, 2012

@stunirvana21

One word made all the difference: dedup

/Sebulon

stunirvana21 · Sep 28, 2012

Sebulon said:
@stunirvana21

One word made all the difference: dedup

/Sebulon

I turned off dedup and remounted my nfs share, yet performance is still low.

SirDice · Sep 28, 2012

stunirvana21 said:
I enabled sharing by doing [CMD=""]zfs set sharenfs=on /virt[/CMD] and allowed it in /etc/exports.

If you set sharenfs to on it will already be added to /etc/zfs/exports. There's no need to add it to /etc/exports too.

stunirvana21 · Sep 28, 2012

SirDice said:
If you set sharenfs to on it will already be added to /etc/zfs/exports. There's no need to add it to /etc/exports too.

So I started from scratch and did this:

[CMD=""]zpool create tank raidz2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da7 /dev/da8 /dev/da9[/CMD]
[CMD=""]zfs create tank/xen[/CMD]
[CMD=""]zfs set dedup=on tank/xen[/CMD]
[CMD=""]zfs set sharenfs=xxx.xxx.xxx.xxx:rw,fsid=0,no_subtree_check,async,no_root_squash tank/xen[/CMD]

I also removed the entry from /etc/exports and restarted the mountd service.

/etc/zfs/exports

Code:

# !!! DO NOT EDIT THIS FILE MANUALLY !!!

/tank/xen       xxx.xxx.xxx.xxx:rw fsid=0 no_subtree_check async no_root_squash

But now I can't mount the share on a client and I get a permission denied error supplied as a result. I can't find good documentation about how to do this unfortunately too, otherwise I wouldn't be bothering you all. Any ideas?

SirDice · Sep 28, 2012

Try setting some options instead of just to on. I have:

Code:

dice@molly:~> zfs get sharenfs fbsd0/ports
NAME         PROPERTY  VALUE                          SOURCE
fbsd0/ports  sharenfs  network 2001:xxx:xxx::/64,ro  local

This works, at least for all my other FreeBSD hosts.

Looking at your options it looks like you are using Linux' or Solaris' exports options. That's not going to work. Because it's more or less a hack you need to use the FreeBSD exports(5) options.

stunirvana21 · Sep 28, 2012

SirDice said:
Try setting some options instead of just to on. I have:

Code:

dice@molly:~> zfs get sharenfs fbsd0/ports NAME PROPERTY VALUE SOURCE fbsd0/ports sharenfs network 2001:xxx:xxx::/64,ro local

This works, at least for all my other FreeBSD hosts.

Looking at your options it looks like you are using Linux' or Solaris' exports options. That's not going to work. Because it's more or less a hack you need to use the FreeBSD exports(5) options.

Ok so I did this:
[CMD=""]zfs set sharenfs=xxx.xxx.xxx.xxx:ro tank/xen[/CMD]

Which results in:

Code:

root@ubu:/ # zfs get sharenfs tank/xen
NAME      PROPERTY  VALUE              SOURCE
tank/xen  sharenfs  xxx.xxx.xxx.xxx:ro  local

Still can't mount it on the CentOS client.

Stupid question: Is there a service I should be restarting? I see there is a /etc/rc.d/zfs script.

SirDice · Sep 28, 2012

stunirvana21 said:
Ok so I did this:
[CMD=""]zfs set sharenfs=xxx.xxx.xxx.xxx:ro tank/xen[/CMD]

Which results in:

Code:

root@ubu:/ # zfs get sharenfs tank/xen NAME PROPERTY VALUE SOURCE tank/xen sharenfs xxx.xxx.xxx.xxx:ro local

Still can't mount it on the CentOS client.

Wrong format I think, try something like this:
# zfs set sharenfs='network 192.168.1.0/24,ro' tank/xen

To be honest I've had some trouble getting sharenfs to do what I wanted too. As far as I understood it you can use the options from exports(5) like -network without the dash (-), so it's network or maproot= instead of -maproot=. Different options need to be separated with commas.

Stupid question: Is there a service I should be restarting? I see there is a /etc/rc.d/zfs script.

Not a stupid question. Normally, after you edited /etc/exports, you need to send a SIGHUP to mountd(8) and nfsd(8). ZFS however automatically takes care of this

Also make use of showmount(8) so you can see if your settings are correct. Definitely keep an eye on /var/log/messages, any errors in any of the exports files should show up.

If all else fails, you can always just turn sharenfs off and use /etc/exports the old fashioned way.

stunirvana21 · Sep 28, 2012

SirDice said:
Wrong format I think, try something like this:
# zfs set sharenfs='network 192.168.1.0/24,ro' tank/xen

...

Also make use of showmount(8) so you can see if your settings are correct. Definitely keep an eye on /var/log/messages, any errors in any of the exports files should show up.

If all else fails, you can always just turn sharenfs off and use /etc/exports the old fashioned way.

I tried a bunch of different options but haven't got any success. Whenever I run:[CMD=""]showmount -e[/CMD] , I don't see tank/xen. I do however see my other NFS mounts that are listed in /etc/exports. If I stop mountd, I still see nothing pertinent to my ZFS share. I feel like I must be missing something elementary. The logs seem to verify that my options are fine. If I go back to regular NFS, I will be back at square one with really slow speeds. Perhaps I will try to update my userland and kernel to see what that gets me.

SirDice · Oct 1, 2012

stunirvana21 said:
If I go back to regular NFS, I will be back at square one with really slow speeds. Perhaps I will try to update my userland and kernel to see what that gets me.

There is no difference between the 'regular' NFS and sharenfs from ZFS.

Sylhouette · Oct 1, 2012

Well for me sharenfs does not work either. So I use the /etc/exports file and all is fine. There are more threads and discussions on this topic on the net.

regards
Johan

stunirvana21 · Oct 2, 2012

Sylhouette said:
Well for me sharenfs does not work either. So I use the /etc/exports file and all is fine. There are more threads and discussions on this topic on the net.

regards
Johan

Is there any in particular that you think would be helpful? I find a lot of Solaris docs but I am not sure how much of that pertains to ZFS on FreeBSD.

Sebulon · Oct 2, 2012

stunirvana21 said:
I turned off dedup and remounted my nfs share, yet performance is still low.

Yes, but you need to set sync=disabled and have dedup=off at the same time to notice any difference, though I strongly advise against disabling sync, since it will cause data corruption in case of power failure. I suggest that you buy e.g. a OCZ Deneva 2 240GB MLC because of the builtin super-capacitor(battery) that protects against power failures. I have no affiliation with the company what so ever, it's just a flipping good SLOG

/Sebulon

stunirvana21 · Oct 3, 2012

Sebulon said:
Yes, but you need to set sync=disabled and have dedup=off at the same time to notice any difference, though I strongly advise against disabling sync, since it will cause data corruption in case of power failure. I suggest that you buy e.g. a OCZ Deneva 2 240GB MLC because of the builtin super-capacitor(battery) that protects against power failures. I have no affiliation with the company what so ever, it's just a flipping good SLOG

/Sebulon

I have a few Micron M4s which don't have the extra capacitor that some of the sandforce SSDs do, but I have the systems on UPS and the building generator, so I shouldn't have any issues. Should I use the SSDs to speed up the ZIL?

I will attempt trying both dedup and sync off at the same time and see what performance I get on NFS.

Sebulon · Oct 4, 2012

stunirvana21 said:
Should I use the SSDs to speed up the ZIL?

Yes. But striped? No, not according to recent findings:

t1066 said:
Just found the following article. Basically, it says that sync writing to a seperate ZIL is done at queue depth of 1 (ZFS sent out sync write request one at a time to the log drive. Wait for the write to finish before sent out another one.

Which means that even though you have more than one SLOG, ZFS only writes to one at a time, so you wonÂ´t see any higher throughput from striped SLOGs

And quoted from that article:

It is very fair to say that even if your chosen log device has reportedly extremely high IOPS, we'll never notice it with how we write to it (send down, cache flush, and only upon completing send down more -- as opposed to a write cache utilizing sequential write workload) if it does not ALSO have a very, very low average write latency (we're talking on the low end of microseconds, here).

So I started out thinking "more IOPS", when I perhaps should have been thinking "lower write latency". CanÂ´t really vouch for the MicronÂ´s as I havenÂ´t tested them myself, but they should at least have a lower write latency than the spinning drives in your pool.

I have been able to find the closest thing to relevant regarding the latest SSDÂ´s. The performance for Vertex 3 is probably true for the Deneva 2 as well, since they have the same HW and very similar FW, except that the Deneva has built-in super-cap and more blocks set aside for garbage collection, which is why it looks "smaller" in size. The performance between the Vertex 4 256 and 512GB is also probably similar because the write performance is connected to their built-in RAM module, specifically used as an internal write cache.

The closest(most relevant) benchmark I have been able to find is:

Code:

[U][B]4k Write at QD=1, Maximum I/O Response Time in (ms)[/B][/U]
Vertex 4 512GB MLC:        0.99
Vertex 4 256GB MLC:        (Probably similar score as the 512GB model)
Crucial m4 128GB:          0.99
Corsair Neutron GTX 240GB: 1.55
Samsung 830 256GB MLC:     4.73
Intel 520 240GB MLC:       32.91
Kingston HyperX 240GB MLC: 35.94
Vertex 3 240GB MLC:        64.63
Deneva 2 200GB MLC:        (Probably similar score as the Vertex 3)

Sources:
http://www.storagereview.com/corsair_neutron_gtx_ssd_review
http://www.storagereview.com/ocz_petrol_ssd_review
http://www.tuicool.com/articles/uaiYve

Now, I canÂ´t say for sure if these numbers really translates into better SLOGs, as I havenÂ´t tested all of these disks myself, but it is a rather qualified guess

Also I would advise to partition only a smaller piece, at most 48GB in size. It increases the longevity of the SSD by increasing the number of unused(good) blocks it can swap in internally as heavy writing over time wears them down.

And there are a few other things that you need to make sure of when installing and configuring(aligning, gnop'ing, sysctls, etc) to have it "just right", so feel free to ask more of the technical when the time comes.

/Sebulon

cforger · Dec 2, 2012

stunirvana21 said:
I turned off dedup and remounted my nfs share, yet performance is still low.

Check out my blog post below for a possible solution to your problem. I've been running a modified NFS server for nearly 2 years now in a production environment, without loss or issues (and the servers have had a few crashes due to other items - No file corruption).

It relies on ZFS to keep your writes safe, not NFS, and I assume you have a properly functioning ZIL.

My particular problem was with ESX and Sync writes, but it may help you as well.

I hope to have the time to further streamline (hack out) portions of NFS that are redundant or unnecessary with a ZFS file system.

http://christopher-technicalmusings.blogspot.ca/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

cforger · Dec 2, 2012

Oh.. and FYI, I stripe both my L2ARC and ZIL - But I do it with a hardware controller (LSI9211 is one example), not ZFS. You won't experience any speed increase in using ZFS to stripe these items.

My SANs are usually 48 drive monsters, so a little extra complexity in terms of a hardware RAID stripe isn't a bit deal. Since we're all v28 of ZFS now, the pool accepts the failure of a ZIL or L2ARC device gracefully.

The danger with a striped ZIL is if the server crashes at the same time one of the ZIL drives goes bad, killing the ZIL array. However, I feel that is so rare, that I will risk going back to my backups (on a separate pool of course) for the performance increase.

Another item of note: If you do have a 48 drive monster, you only need a separate ZIL if you expect to put a lot of write load on the SAN. Without a ZIL, ZFS will stripe the write across your SAN, and that's going to be very fast with a lot of drives.

Sebulon · Dec 3, 2012

cforger said:
Check out my blog post below for a possible solution to your problem. I've been running a modified NFS server for nearly 2 years now in a production environment, without loss or issues (and the servers have had a few crashes due to other items - No file corruption).

It relies on ZFS to keep your writes safe, not NFS, and I assume you have a properly functioning ZIL.

My particular problem was with ESX and Sync writes, but it may help you as well.

I hope to have the time to further streamline (hack out) portions of NFS that are redundant or unnecessary with a ZFS file system.

http://christopher-technicalmusings.blogspot.ca/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

I have also been thinking along these lines but no, itÂ´s just not right. You can call it whatever you want but anything that turns off syncing puts your data at risk. I mean, I could disable the breaks on my car and be fine for two years, until IÂ´d need to break. Bad example, but you get the point.

ZFS default sync mode is "standard", which means, "give sync to those who want it, and async to those who want that as well". Then thereÂ´s "always", which means "I donÂ´t care what you say, IÂ´m syncing this regardless". Lastly thereÂ´s "disable", that always gives async. When you modify the NFS server to only serve out async, and have "standard" sync mode, NFS says "IÂ´m going to serve this out async", and ZFS replies "Fine, whatever you want". So by doing this you have practically done the same as setting "sync=disable", or disabling ZIL altogether. A proof that my assumptions are correct would be to set "sync=always" on that dataset and see your "performance" dropping to what it was before you modified NFS. There are no shortcuts. Invest in a damn good SLOG instead.

/Sebulon

Creating a ZFS network share over NFS

stunirvana21

Sebulon

stunirvana21

SirDice

Administrator

stunirvana21

SirDice

Administrator

stunirvana21

SirDice

Administrator

stunirvana21

SirDice

Administrator

Sylhouette

stunirvana21

Sebulon

stunirvana21

Sebulon

cforger

cforger

Sebulon