ZFS + NFS performance improve by disabling zfs.cache_flush_disable?

belon_cfy · Mar 24, 2012

I'm experiencing huge performance penalty on ZFS - NFS - ESXI, the write performance only 3MB/s on 4disks server with RAID10, and 10-22 with SSD as ZIL.

After reading the following article, disabling zfs.cache_flush_disable gave me huge performance boosted on virtual machine up to 60-70MB/s even without SSD as ZIL.
http://christopher-technicalmusings.blogspot.com/2010/09/zfs-and-nfs-performance-with-zil.html

I'm wondering whether disabling zfs.cache_flush_disable will result to data corruption after power failure? Does it same as sync=disabled? Will adding SSD as ZIL solve the problem?

Another question, does sync=standard perform quite equivalent to sync=always to prevent data corruption during write operation? My understanding is sync=standard will write it to intent log at the first place and flush it to the disk later, so I can assume it is safe right?

Sebulon · Mar 24, 2012

Hi,

optimizing performance on ESX is as easy as tweeking NFS a little:
http://christopher-technicalmusings.blogspot.se/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

Makes NFS go async while preserving all goodness ZFS has to offer. A FreeBSD-UFS guest on ESX having it's "hardrive" on a tweeked ZFS/NFS datastore gives me 145MB/s write and read on 1Gb/s with bonnie++

/Sebulon

belon_cfy · Mar 24, 2012

Sebulon said:
Hi,

optimizing performance on ESX is as easy as tweeking NFS a little:
http://christopher-technicalmusings.blogspot.se/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

Makes NFS go async while preserving all goodness ZFS has to offer. A FreeBSD-UFS guest on ESX having it's "hardrive" on a tweeked ZFS/NFS datastore gives me 145MB/s write and read on 1Gb/s with bonnie++

/Sebulon

Hi
I have tried the method but just wondering how safe of patching nfs server to ignore o_sync operation.

With the NFS patches, I'm able to get the speed same as sync=disabled with the sync=standard setting (60-80MB/s), but with sync=always I get the same speed as before which is 3-6MB/s again.

With zfs.cache_flush_disable=1, I got around 50MB/s with sync=standard and sync=always.

Which setting is more risky?

Sebulon · Mar 24, 2012

Well, I set(not quite sure of the exact name):
# sysctl kern.da.default_timeout=360
on the guest and redid bonnie again. I then reset the server 3 times while writing, 3 times while rewriting and 3 more times while reading. The guest just paused and resumed when the server had booted up again. The server never had any issues and a scrub afterwards showed no errors. That's enough for me

/Sebulon

belon_cfy · Mar 24, 2012

By the way, is zfs sync=standard consider as sync or async?

belon_cfy · Mar 26, 2012

Sebulon said:
Hi,

optimizing performance on ESX is as easy as tweeking NFS a little:
http://christopher-technicalmusings.blogspot.se/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

Makes NFS go async while preserving all goodness ZFS has to offer. A FreeBSD-UFS guest on ESX having it's "hardrive" on a tweeked ZFS/NFS datastore gives me 145MB/s write and read on 1Gb/s with bonnie++

/Sebulon

I have tried this method however it will cause data loss when power failure occur on storage server.

I did a dd with oflag=direct to ensure the data is flushed to the stable storage on virtual machine and output the filesize simultaneously. After that I simulate power down on storage.

After power on the storage and virtual machine, I found that the size does not match with what it has output on the previous console.

Sebulon · Mar 26, 2012

Yes, well, thatÂ´s to be expected, since you have another layer of complexity (ESX) in between. But hey, thatÂ´s what snapshots are for

I noticed however that a guest with ZFS on its virtual harddrives did withstand at least one power fail. Two consecutive ones and it paniced and rebooted. At that point, the guestÂ´s zpool reported checksum errors. And one time, it refused to boot and said that all block IO was suspended. I needed to start live from an install CD and import -F the pool and rewind three minutes, then reboot like normal. That is a good way of knowing if a rollback is needed.

/Sebulon

usdmatt · Mar 26, 2012

sync=disabled will cause all writes to be handled as if they are async - just stored in memory and flushed to disk with the next transaction. I would expect ESX (sync) writes to perform very similar to having the NFS server modified to make all writes async.

Obviously ESX uses sync writes because it wants to know the data if on disk before telling the client OS the data was written. With either setting above you run the risk of a VM thinking data is written and it being lost by a power failure before the current ZFS transaction is committed. ZFS will be fine, it doesn't guarantee async writes, the client should use sync for that which is why ESX uses it, but you may find the client has corruption or lost data. With the test in the post above (resetting the ZFS server during bonnie) I would expect ZFS to be fine, but the bonnie test data may have been corrupted.

sync=standard is neither considered as sync or async - it's both.
It the client application asks for sync it will go to ZIL, if async it won't.
In this setting you would expect a standard NFS server to perform badly and a modified one to perform as if you have sync=disabled

sync=always treats every write as sync, so everything goes through ZIL. This is great in some ways - ZFS should never lose any data on power loss, even async writes, but ESX will perform badly regardless of whether you modify the NFS server or not. (You'll have modified NFS to make all sync writes async, then configured ZFS to make all async writes sync.)

The cache_flush_disable is an interesting setting (I've not known about this until now).
Unfortunately I can't find much out about it. The page above suggests it forces a flush of the transaction/ZIL but I can't see why ZFS would let client apps force a flush of the entire transaction. The best I can find is this -

Code:

# disable BIO flushes
# disables metadata sync mode and uses async I/O without flushes
# ONLY USE FOR PERFORMANCE TESTING
#vfs.zfs.cache_flush_disable="1"

(From http://zfsguru.com/forum/zfsgurusupport/82)

Interestingly, the tuning above is for FreeBSD, from ZFSguru, which is written by someone who I believe posts on this forum every now and then. They've obviously learnt about this setting from somewhere.

To me it suggests the data is written to disk async and no flushes of the disk cache are done. This *may* be ok if you have battery backed ZIL devices. I would be very interested to have a definitive answer on whether this setting is completely safe with battery backed ZIL. Referencing an earlier topic on here, this sort of optimisation may be why Sun/Oracle always use enterprise ZIL devices. It would make sense for them to use battery/supercap backed ZIL and configure ZFS to not bother flushing them all the time.

usdmatt · Mar 26, 2012

I'd be very interested to see the results of sync NFS testing with the Deneva 2 R-series ZIL and everything else standard other than cache_flush disabled.

It would then be good to perform a test similar to the bonnie test above but by copying a file with a known checksum so that not only can ZFS recovery be checked but also that the copied file is intact.

Sebulon · Mar 26, 2012

@usdmatt

Roger that, I'll whip something up

/Sebulon

belon_cfy · Mar 27, 2012

Something I need to mention about my test result of disabling cache_flush_disable=1 and after applied the NFS patch. Seems the NFS patch perform faster than the cache_flush option. I wrote a script to output and kept writting to the disk with sync simultaneously. Below is the sample:

Code:

$i=0;
While(1)
{
    System("echo $i >> /mnt/count_1.txt;sync;echo $i");
    $i++;
}

I created five processes and wrote to different log files for comparison.

Once it is running for a while, I "hard" power off my storage to simulate power cut. All of the virtual machines were successfully flushed the data safely to the storage. I verified it by stopping all vm, start my storage then power on all of my vm and compare the log with the last screen shot on vm when I hung.

Both options are safe in this scenario, but dd showing inconsistent file size.

belon_cfy · Mar 28, 2012

Just simulated the power cut scenario with zfs.cache_flush_disable and NFS ESXI patched by running my script above. The zfs.cache_flush_disable returned consistent results as what it has been written to disks. On NFS ESXI patched setting, I lost some data compare to the last output.

I will stick with zfs.cache_flush_disable because seems it is much safer during power loss incident with acceptable performance.

Sebulon · Apr 2, 2012

Code:

[CMD="#"]md5 /mnt/ram/test2GB.bin[/CMD]
MD5 (/mnt/ram/test2GB.bin) = 2a5173922c992ee6f2079ace52c7377d


[U][B]vfs.zfs.cache_flush_disable=0[/B][/U]
[CMD="#"]dd if=/mnt/ram/test2GB.bin of=/mnt/tank/perftest/test2GB.bin bs=1m[/CMD]
2048+0 records in
2048+0 records out
2147483648 bytes transferred in 31.803638 secs (67523208 bytes/sec)
[CMD="#"]rm /mnt/tank/perftest/test2GB.bin[/CMD]
[CMD="#"]cp /mnt/ram/test2GB.bin /mnt/tank/perftest/[/CMD]
[B]<- blackout ->[/B]
[CMD="#"]md5 /mnt/tank/perftest/test2GB.bin[/CMD]
MD5 (/mnt/tank/perftest/test2GB.bin) = 2a5173922c992ee6f2079ace52c7377d


[U][B]vfs.zfs.cache_flush_disable=1[/B][/U]
[CMD="#"]dd if=/mnt/ram/test2GB.bin of=/mnt/tank/perftest/test2GB.bin bs=1m[/CMD]
2048+0 records in
2048+0 records out
2147483648 bytes transferred in 30.185444 secs (71143020 bytes/sec) [B]17368 NFS IOPS[/B]
[CMD="#"]rm /mnt/tank/perftest/test2GB.bin[/CMD]
[CMD="#"]cp /mnt/ram/test2GB.bin /mnt/tank/perftest/[/CMD]
[B]<- blackout ->[/B]
[CMD="#"]md5 /mnt/tank/perftest/test2GB.bin[/CMD]
MD5 (/mnt/tank/perftest/test2GB.bin) = 2a5173922c992ee6f2079ace52c7377d

YES YES YES! Both performance increase and absolute resiliency. Suck it!

A lingering feeling in me says that IÂ´d want to know if another disk, like the Vertex 3 as SLOG would give the same results. Since itÂ´s without supercap, my gut tells me it shouldnÂ´t pass md5, but perhaps it does, I donÂ´t know. Someone needs to verify that with an equal test. But at least we now know that the Deneva 2 R-Series is a 100% sweetness

/Sebulon

cforger · Dec 18, 2012

belon_cfy said:
Just simulated the power cut scenario with zfs.cache_flush_disable and NFS ESXI patched by running my script above. The zfs.cache_flush_disable returned consistent results as what it has been written to disks. On NFS ESXI patched setting, I lost some data compare to the last output.

I will stick with zfs.cache_flush_disable because seems it is much safer during power loss incident with acceptable performance.

Interesting that you had data loss with my patch. I haven't experienced that in my testing, or in 2 years of production running that includes a number of crashes.

I find that some people think my patch disables sync on the ZFS side - it doesn't. It disables sync between the NFS client/server.

With my patch, NFS is saying it's sync'ed when it's possibly only living in the ZIL. If you trust your ZIL, you are fine.

Once ZFS gets a hold of your data (and hasn't bee compromised with settings that change it's ZIL/cache behaviour) it's safe.

If you have corruption problems apter a crash, I would suggest it's for a different reason.

Sebulon · Dec 19, 2012

cforger said:
Interesting that you had data loss with my patch. I haven't experienced that in my testing, or in 2 years of production running that includes a number of crashes.

I find that some people think my patch disables sync on the ZFS side - it doesn't. It disables sync between the NFS client/server.

With my patch, NFS is saying it's sync'ed when it's possibly only living in the ZIL. If you trust your ZIL, you are fine.

Once ZFS gets a hold of your data (and hasn't bee compromised with settings that change it's ZIL/cache behaviour) it's safe.

If you have corruption problems apter a crash, I would suggest it's for a different reason.

You are saying that the patching disables sync between the NFS client/server.

ZFSÂ´s syncing modes works like this:

all = Always sync, regardless
disabled = Never sync, regardless
standard = Both. If a client/server wants sync, it will get sync'ed. And if a client/server want async, donÂ´t.

Which means that your patching together with ZFSÂ´s sync parameter "standard" is dangerous, because it doesnÂ´t get sync'ed(never hits the ZIL).

Two pratical tests I can think of are 1) with a patched NFS server, watch gstat to see that the SLOG never gets any IO, and 2) change the sync parameter to "always" and see that the SLOG now gets hit by massive IO and throughput goes down.

There arenÂ´t any silver bullets. If you care about your data and still want decent throughput, invest in a really good SLOG, preferrably two, mirrored.

/Sebulon

usdmatt · Dec 19, 2012

I find that some people think my patch disables sync on the ZFS side - it doesn't. It disables sync between the NFS client/server.

With my patch, NFS is saying it's sync'ed when it's possibly only living in the ZIL. If you trust your ZIL, you are fine.

Just to clarify.
If you modify your NFS server to make all writes ASYNC (which is what you have done), ZFS will not bother putting them in the ZIL at all (unless you set sync=always). It treats them as ASYNC and just stores them in RAM waiting for the transaction to flush to the main pool disks. If the ZFS server crashes in the middle of the transaction, you *will* lose that data. ZFS won't care, it doesn't guarantee ASYNC writes so will boot up as if everything's fine.

When you say you've had no problems even though the ZFS server has crashed, are you sure? Your ZFS pool won't have any errors, even with a scrub, it's the file systems of the VMs that will have the problems.

Just to add: I'm actually running sync=disabled on one of my servers as it's the only way I can get good enough performance at the moment. I'm fully aware of what it's doing though and will live with fsck's or restores if I have to. What worries me is that quite a few people are being given the view that this NFS mod is completely safe because ZFS still has it's ZIL, when it isn't safe. Sure, if the server never crashes you're fine, you may even get away with it if it does, but as Sebulon says it's not a silver bullet that increases performance without drawbacks. The scale of the performance increase should raise some warning signs that something is amiss.

usdmatt · Dec 19, 2012

Also, this from the original linked blog about cache_flush seems wrong.

I believe this is an enhancement in newer ZFS pools anyway, so I'm really not too worried about it. If it's on the ZIL, why do we need to flush it to the drive? A crash at this point will still have the transactions recorded on the ZIL, so we're not losing anything.

I've spent an hour looking into this setting now and still don't have a clear idea of when it's safe to use. It's not about flushing ZIL data to drives though, it's about telling the disks to flush their write cache.

If your ZIL device (either separate or part of the pool) have write caches with no battery backup, turning off cache_flush may damage your pool. Even worse, while the sync/NFS mods keep ZFS happy but may screw up client data, this setting may screw up the zpool itself. This is why the actual source tells you it's for debugging/performance tests only:

Code:

/*
 * Tunable parameter for debugging or performance analysis.  Setting
 * zfs_nocacheflush will cause corruption on power loss if a volatile
 * out-of-order write cache is enabled.
 */
boolean_t zfs_nocacheflush = B_FALSE;

What I still can't find out, is whether this setting is 100% safe if you have a Super-cap ZIL, and standard disks - or whether you would need to disable write cache on the disks. Driving me a bit mad really, no one seems to know. It does appear that Solaris has disk drivers that are designed to ignore cache_flush requests when they know it has protected caches so it's automatic for them:

A recent fix is that the flush request semantic has been qualified to instruct storage devices to ignore the requests if they have the proper protection. This change required a fix to our disk drivers and for the storage to support the updated semantics.

cforger · Dec 19, 2012

Hmm, some interesting points. I'll have to try a few tests and get back to you.

Now that we are stable on v28 of VFS it may be time to review in detail some of these smaller but critical processes to make sure I fully understand how they work under v28.

And I agree - there is no perfect solution. Try designing speaker enclosures.

Sebulon · Dec 19, 2012

usdmatt said:
What I still can't find out, is whether this setting is 100% safe if you have a Super-cap ZIL, and standard disks - or whether you would need to disable write cache on the disks. Driving me a bit mad really, no one seems to know. It does appear that Solaris has disk drivers that are designed to ignore cache_flush requests when they know it has protected caches so it's automatic for them:

Yes, this is something IÂ´ve been thinking about as well. But as I see it, if a power-out comes in the middle of a transaction with a Super-Cap SLOG, ZFS should be able to replay that lastest transaction to correct that upon boot, right? This is speculative, but it would be interesting to test.

/Sebulon

usdmatt · Dec 19, 2012

If the ZIL has no battery backup then I can easily see corruption happening on power loss, as mentioned in the source comment. The ZIL is written as a chain (each entry points to the next) so losing data in the write cache will quite likely cause ZFS to see it as corrupt. I can't remember, is it possible to recover from a corrupt ZIL these days by rolling back to a previous transaction?

With a safe ZIL, the only issue I can see is if the pool disks empty their cache out-of-order and somehow a transaction is marked as being complete and the system loses power with other data from the same transaction still in the caches. I don't know if this is possible or whether it's unlikely enough to not be worth thinking about.

Sebulon · Dec 20, 2012

@usdmatt

From one time while testing stuff in a VM booting from ZFS, it said "I/O suspended" at BTX and I had to boot from an install CD. While in live I tried to import and it said that the pool had been somehow damaged and if I wanted to import, it had to resort to an extreme rewind and that a portion of time would be lost. It went through and it was then possible to boot just like normal. I cannot say that it should play out exactly like that in this particular scenario, IÂ´m just sharing past experience.

/Sebulon