NFS write performance with mirrored ZIL

peetaur · Feb 2, 2012

peetaur said:
That isn't the behavior I had. I could see the ramdisk [ZIL] being used in gstat every time.

Oh by the way... now gstat doesn't show much load on the ZIL during a sync Linux client write. I don't know why it stopped, but my best guess is that it is because I destroyed the pool and recreated it. The old one was an old version that was upgraded to v28. The new one was created v28. (another quirk I found in upgraded pools vs created as v28 is that you really can't remove the log... You can remove log vdevs, or run with the log OFFLINE, but the last one won't go away.)

Sebulon · Feb 6, 2012

@RusDyr

To find out for yourself the exact difference between ZFS and UFS you can benchmark striped zfs mirrors (without log- or cache-device) and gstriped gmirrors to make a completly accurate comparison. Then, as peetaur is on to, comes the next question, "What do I really need?". Then you can compare your initial results with the results from one raidz vdev, two raidzÂ´s, one raidz2, two raidz2Â´s, and so on. Please post your results in a new thread called like "Comparative benchmark between zfs mirrors and grstriped gmirrors", something or other. Use my MO to create a RAM-disk, fill up a big file of random data and use that with dd to test write-speed. Then itÂ´s ok to read from that random data-file and write towards another file in the ZFS or UFS file system, like:
# dd if=/mnt/ram/randfile of=/foo/bar/randfile bs=1m
Also install benchmarks/bonnie++ from ports and test:
# bonnie++ -d /foo/bar -u 0 -s Xg
-d /foo/bar (the directory with the ZFS or UFS filesystem mounted)
-u 0 (if youÂ´re running as root)
-s Xg ("X" should be double the size of your RAM)
I would love to see those numbers.
Also keep in mind the tips I gave you about gpart and gnop. They have been big performance enhancers for me personally.

@peetaur

I quite like rants. ItÂ´s about the only way to find out what the the support and sales wonÂ´t tell you. How many times have you heard, "Our products are not good at etc, etc" or "Our products have N bugs that muck up this and that in this way"

(Luckily, I am in charge of this, so I can decide whether or not to throw away ESXi; Do you have the same control?)

No. We are running 3-400 VMÂ´s in a HP blade chassis with NetApp NAS for serving NFS to VMWare and SMB to our users, farming about 200TB. We are however planning on a much more price-efficient solution for a gigantic video archive running Supermicro HW and FreeBSD or FreeNAS, which I will be in charge of.

Oh by the way... now gstat doesn't show much load on the ZIL during a sync Linux client write. I don't know why it stopped, but my best guess is that it is because I destroyed the pool and recreated it. The old one was an old version that was upgraded to v28. The new one was created v28.

Aha! So we have the same behaviour. That is so strange. A big regression, IÂ´d say.

(another quirk I found in upgraded pools vs created as v28 is that you really can't remove the log... You can remove log vdevs, or run with the log OFFLINE, but the last one won't go away.)

Big bummer. I wonder how a power-outage would affect the pool if running in that state... Best to have a new pool created with V28 and send/recv between then.

/Sebulon

phoenix · Feb 6, 2012

Just a side note: VMWare ESXi's NFS client has absolutely horrible performance when connected to a FreeBSD NFS server backed by ZFS. Separate ZIL does not help much. Async mount does not help much. Disabling the ZIL doesn't help much. The only way to get good performance from it is to modify the FreeBSD NFS server in such a way that it eliminates all data protection.

There's several threads about this on the -stable and -current mailing lists.

Best solution is to dump ESXi. Second best solution is to dump ZFS.

Sebulon · Feb 6, 2012

@phoenix

WHAT, WHAT, WHAAAT?!

Please post a link or three, for us poor search-impaired people

It would be nice for others reading this to have a direct reference from someone who've read about it. I could start searching for them and post links myself, but chances are I'd just find the "wrong" ones...

/Sebulon

phoenix · Feb 6, 2012

For the searching impaired: ZFS sync / ZIL clarification

Sebulon · Feb 7, 2012

@phoenix

Awesome, thanks!

@peetaur

Wooow! I remember you writing about this earlier but I didnÂ´t know quite how bad it was, and that more people are starting to notice this as well. Fortunately, the article-link you posted in the mail-thread completly solves the problem:
http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html
When you are running ZFS with a ZIL you know you can trust (after extensive testing, of course), youÂ´ll want to depend on that, instead of having the NFS-server flushing the ZIL all of the time.

/Sebulon

AndyUKG · Feb 8, 2012

Sebulon said:
http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

With respect to the link, I would guess disabling NFS flushing will likely cause data corruption of the VMware file system in the event of a powerloss of the NFS server. The article said that this hack had been in use for some time without issues but didn't specify if they had actually tested NFS server power failure. The whole point of NFS flushes and the ZIL etc is to assure data consistency when something goes wrong.
Anyway something to at least investigate before you start to use a hack like this...

cheers Andy.

Sebulon · Feb 8, 2012

@AndyUKG

That is 100% correct, excellent of you to point out! IÂ´m hoping I have a chance to test this personally very soon. IÂ´m hunting for the perfect SLOG at the moment, which I then intend to install in our dev-storage and serve out a NFS datastore to our dev-ESXi, so this can be investigated properly.

/Sebulon

AndyUKG · Feb 8, 2012

I've read that istgt with ZFS provides better performance with ESXi than NFS, have you looked into that/discounted that for any reason?

cheers Andy.

peetaur · Feb 8, 2012

If you use iSCSI for virtual disks, is it easy to export and use the disks in other ways without ESXi?

For example, if I create a VMware ESXi vm that uses iSCSI, and then later change it to VirtualBox, can I simply use it as is? Can I mount it and run # VBoxManage clonehd ... to recover it?

Or can I mount the disk and read the files with the ESXi host machine shut down?

AndyUKG · Feb 8, 2012

peetaur said:
If you use iSCSI for virtual disks, is it easy to export and use the disks in other ways without ESXi??

An iSCSI disk device when connected to an iSCSI client system behaves just like a direct attached disk. So anything you can do with a physical disk attached to VMware, you can do with an iSCSI disk in terms of connecting to other systems. I haven't tried moutning a VMFS volume on any other OS if that's what you want to do, but a quick google did turn up an open source readonly VMFS driver:

http://code.google.com/p/vmfs/

thanks Andy.

peetaur · Feb 8, 2012

I know you can mount a vmdk file in other vms, or mount it with a fuse driver, but does VMware's iSCSI client add some metadata junk in the iSCSI share that is unique to Vmware's implementation and unusable in others?

AndyUKG · Feb 8, 2012

I haven't tried it, but I'd guess it will be mountable to non-VMware systems. As I said an iSCSI volume is treated as a local disk, VMware state this is the case on VMware too so if fuse etc can mount normal VMware devices it should work with an iSCSI devices too. But unless anyone else can confirm I guess you'll have to try to be sure...

thanks Andy.

Sebulon · Feb 27, 2012

[GirlieGiggle] It has arrived... [/GirlieGiggle]

/Sebulon

lockdoc · Mar 8, 2012

Hi Sebulon

Sebulon said:

Hi all!
[...]

Code:

# camcontrol devlist
<WDC WD30EZRS-00J99B0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD103SJ 1AJ10001>         at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG HD103SJ 1AJ10001>         at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD103SJ 1AJ10001>         at scbus5 target 0 lun 0 (ada3,pass3)
<SAMSUNG HD103SJ 1AJ10001>         at scbus6 target 0 lun 0 (ada4,pass4)
<SAMSUNG HD103SJ 1AJ10001>         at scbus7 target 0 lun 0 (ada5,pass5)
<SAMSUNG HD103SJ 1AJ10001>         at scbus8 target 0 lun 0 (ada6,pass6)
<SAMSUNG HD103SJ 1AJ10001>         at scbus9 target 0 lun 0 (ada7,pass7)
<SAMSUNG HD103SJ 1AJ10001>         at scbus10 target 0 lun 0 (ada8,pass8)
<OCZ-VERTEX2 1.29>                 at scbus12 target 0 lun 0 (ada9,pass9)
<OCZ-VERTEX2 1.29>                 at scbus13 target 0 lun 0 (ada10,pass10)

[...]

I also have 6 of those HD103SJ and 2 Spinpoint F1, are you sure they are really 4k blocksize. I think I read somewhere that they are actually 512b. Cannot find the source anymore.

My setup is: 4 mirrors of geli encrypted disks in a stripe without any cache or log. If yours is also encrypted we could do a couple of benchmarks together as we have the same disks.

Sebulon · Mar 8, 2012

@lockdoc

I've posted several HW specs for different servers

Those HD's are in my home-NAS and they are good old 512's, ashift=9.

No geli for me, yet. I've looked up that my CPU (Core i5 560) supports HW assisted AES-256, so I'm very tempted to give it a go some time and see how the performance is affected. But that testing is another topic completely

/Sebulon

lockdoc · Mar 8, 2012

Sebulon said:
Those HD's are in my home-NAS and they are good old 512's, ashift=9.

Hmm they report ashift=12 on mine with geli 4096 blocksize. Could this be a bottleneck?

phoenix · Mar 8, 2012

GELI is setting the minimum block size to 4K. Hence, ZFS configured the vdev to use 4K blocks as the smallest size (aka ashift=12).

Sebulon · Mar 9, 2012

Initial scores from OCZ Deneva 2 R-series MLC (sync). HW is set up to test the Deneva first and will be reconfigured with 4x Vertex 3Â´s in two mirrors, for fault-tolerance later.

Code:

[B][U]HW[/U][/B]
1x  Supermicro H8SGL-F
1x  Supermicro AOC-USAS2-L8i
1x  Supermicro SC111T-560CB
1x  AMD Opteron 8C 6128 2.0GHz
2x  16GB 1333MHZ DDR3 ECC REG
1x  OCZ Deneva 2 R-series 200GB
2x  OCZ Vertex 3 240GB

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD default 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Tue Jan  3 07:46:30 UTC 2012     
root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
[CMD="#"]zpool get version pool2[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default
[CMD="#"]zpool status[/CMD]
  pool: pool1
 state: ONLINE
 scan: resilvered 392K in 0h0m with 0 errors on Wed Mar  7 14:32:33 2012
config:

	NAME          STATE     READ WRITE CKSUM
	pool1         ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    gpt/usb0  ONLINE       0     0     0
	    gpt/usb1  ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
 scan: resilvered 0 in 0h0m with 0 errors on Thu Mar  8 16:42:41 2012
config:

	NAME         STATE     READ WRITE CKSUM
	pool2        ONLINE       0     0     0
	  gpt/disk0  ONLINE       0     0     0
	  gpt/disk1  ONLINE       0     0     0
	logs
	  gpt/log1   ONLINE       0     0     0

errors: No known data errors
[CMD="#"]zdb | grep ashift[/CMD]
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12
[CMD="#"]camcontrol devlist[/CMD]
<ATA D2RSTK251M11-020 E>           at scbus0 target 0 lun 0 (da0,pass0)
<ATA OCZ-VERTEX3 2.15>             at scbus0 target 4 lun 0 (da1,pass1)
<ATA OCZ-VERTEX3 2.15>             at scbus0 target 5 lun 0 (da2,pass2)
<USB Mass Storage Device \001\000\000?>  at scbus7 target 0 lun 0 (da3,pass3)
<USB Mass Storage Device \001\000\000?>  at scbus8 target 0 lun 0 (da4,pass4)
[CMD="#"]gpart show[/CMD]
=>       34  390721901  da0  GPT  (186G)
         34       2014       - free -  (1M)
       2048  390719880    1  freebsd-zfs  (186G)
  390721928          7       - free -  (3.5k)

=>       34  468862061  da1  GPT  (223G)
         34       2014       - free -  (1M)
       2048  468860040    1  freebsd-zfs  (223G)
  468862088          7       - free -  (3.5k)

=>       34  468862061  da2  GPT  (223G)
         34       2014       - free -  (1M)
       2048  468860040    1  freebsd-zfs  (223G)
  468862088          7       - free -  (3.5k)

=>     34  7744445  da3  GPT  (3.7G)
       34       30       - free -  (15k)
       64      128    2  freebsd-boot  (64k)
      192     1856       - free -  (928k)
     2048  7742424    1  freebsd-zfs  (3.7G)
  7744472        7       - free -  (3.5k)

=>     34  7744445  da4  GPT  (3.7G)
       34       30       - free -  (15k)
       64      128    2  freebsd-boot  (64k)
      192     1856       - free -  (928k)
     2048  7742424    1  freebsd-zfs  (3.7G)
  7744472        7       - free -  (3.5k)

Code:

[CMD="#"]iperf -c 10.10.0.12[/CMD]
------------------------------------------------------------
Client connecting to 10.10.0.12, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[  3] local 10.10.0.10 port 63461 connected with 10.10.0.12 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   934 Mbits/sec

Code:

[B][U]LOCAL WRITES[/U][/B]
128k)  284MB/s
4k)    60MB/s

Code:

[B][U]OVER NFS[/U][/B]
sync)   2147483648 bytes transferred in 31.963803 secs (67184860 bytes/sec)
async)  2147483648 bytes transferred in 22.719088 secs (94523321 bytes/sec)

Tests over NFS have been made like:

Code:

[B]sync)[/B]
[CMD="#"]mount 10.10.0.12:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]
[B]async)[/B]
[CMD="#"]mount -o async 10.10.0.12:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]

Next up is Ivan Drago: "I will break you."

/Sebulon

peetaur · Mar 9, 2012

Almost everything running around 60-65 MB/s (even the ZEUS) makes me think there is a software bottleneck: a needless sleep(), bad interrupt handling, lock acquisition, etc.

Maybe it would be interesting to try setting your CPU multiplier or some busses lower (underclock it) to see if it changes the speed. But I don't think server boards have those options. Or run a nice -n -19 100+ thread CPU waster in the background. Or someone else more familiar with low level things could suggest a better test.

Also, have you tried striping across a few SSDs for your log to compare? I see you wrote

At least with ZFS V15, I would say I won about 20% striped logs vs mirrored.

but I found the same effect between mirrored and single SSD (non-stripe).

Sebulon · Mar 11, 2012

Ok, explain to me this...

IÂ´ve spent two days now, trying to fault this non-redundant pool and it just wonÂ´t.

First- with the Deneva as log, I started with the same client that did the performance tests. I basically re-ran the performance tests but I reset the server during every transfer, at which point, the client just paused and waited for the server to come back up again (it rebooted without any issues), the client resumed the transfers and lived happily ever after. After all three transfers were complete, I scrubbed the pool and to my surprise, there were no errors.

# mount 10.10.0.12:/export/perftest /mnt/tank/perftest
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m
(Reset)
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m
(Reset)
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m
(Reset)
# umount /mnt/tank/perftest

# mount -o async 10.10.0.12:/export/perftest /mnt/tank/perftest
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m
(Reset)
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m
(Reset)
# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m
(Reset)
# umount /mnt/tank/perftest

At that time, I realised I needed a sure-fire way of faulting it so that I could have something to compare with, so I switched the Deneva for a Vertex- which shouldnÂ´t work as a SLOG as it lacks the battery-backing the Deneva has.

So the pool now looks like this:

Code:

  pool: pool2
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Fri Mar  9 16:12:29 2012
config:

	NAME         STATE     READ WRITE CKSUM
	pool2        ONLINE       0     0     0
	  gpt/log1   ONLINE       0     0     0
	  gpt/disk0  ONLINE       0     0     0
	logs
	  gpt/disk1  ONLINE       0     0     0

errors: No known data errors

Redid the battery of transfers, but it still passed. All is still well. Why?

I then gave NFS-access to our test-esxÂ´s, mounted it as a datastore and started a Storage V-Motion on of the guests and reset the server during that migration... Still OK. Why?

So from within a guest, I started a buildworld-process and reset the server; the guest just paused, the server came back up ey-ok, and then the guest resumed as if nothing had happened. Why?

I mean, IÂ´m actually trying to do wrong here, how hard can it be?

Suggestions are most welcome!

/Sebulon

danbi · Mar 12, 2012

You are trying to make ZFS lose data? Something it is designed not to do.

About the only way you can achieve this is if you have VERY faulty hardware.

AndyUKG · Mar 12, 2012

Hi,

regarding trying to cause a ZFS data error by resetting the ZFS server or attached storage. You shouldn't need a battery backed disk so long as the disks do not ignore cache flush commands sent from ZFS.
Do you have any reason to believe that the Vertex won't do cache flushes when requested?

thanks Andy.

Sebulon · Mar 13, 2012

@danbi

Really? Ohoy, Captain Obvious!

@AndyUKG

And the operative word in that sentence is "should"

I mean, what good does a cache flush command do when thereÂ´s no power to flush itÂ´s caches with?

How come everyone else, including SUN/Oracle uses battery-backing for their logs? Well, in most cases itÂ´s about battery-backed RAM, which would be useless otherwise, but if a consumer-grade SSD would do the same job, even without battery-backing, then IÂ´m having a hard time understanding why they donÂ´t just use that instead. Imagine the savings for Oracle if they ditched the ZEUSÂ´s for ordinary VertexÂ´s, and getting better performance at the same time.

I have thought that you definitely needed battery-backing to maintain a consistant ZIL. Say you build a database NAS, or a VMWare datastore, export over NFS and once youÂ´ve gone into production and that database/datastore has grown to 50TB- then you have a power-outage, which does happen to everyone from time to time. What happens then?

For everyone that wants to build mission critical systems based on FreeBSD and ZFS, is it really enough relying on a non battery-backed SSD?

/Sebulon

danbi · Mar 14, 2012

Sometimes, the obvious is the answer, no?

You need large capacitors for FLASH storage, in order to prevent the FLASH drive from being *destroyed* by an power failure.

Obeying cache sync is a different thing. If your drive obeys cache sync, then ZFS can be confident, that critical data is already on stable storage and thus can fulfill it's promise to have your data safe.

If you have faulty hardware, that claims to, but does not support cache sync, you will get corrupt data and ZFS can't help you.

All these things are obvious, of course

PS: By the way, at any rate, in any scenario, battery backed RAM for ZIL is way, way faster and more performing than any SSD could be, now or in the future. Perhaps this is why people are using it for the higher-end.