NFS write performance with mirrored ZIL

danbi · Dec 6, 2011

Perhaps the only useful settings in that thread are

Code:

vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

This is because SATA disks don't perform well in concurent load. ZFS has defaults that are probably designed for SAS drives, although the current values were updated to

Code:

vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 10

What is your current ZFS tunning?

peetaur · Dec 6, 2011

My post above (#50) is confusing... I remember writing that, but not in this thread. It is out of context... nothing to do with ZIL from what I can see. I don't even remember going to this thread yesterday. So I'll reply to danbi in a visitor message. (And FYI 1200 MB/s is only when there is cache so it would be misleading to read it like I worded it).

peetaur · Dec 6, 2011

Is your 70 MB/s number over a plain ol' 1GBps link? I've never been able to get FreeBSD to report more than 77 MB/s (using ssh, scp, rsync, etc.) going over 1GBps. Linux commands report 110 sometimes. iperf always shows the same numbers for FreeBSD and Linux though, so I think it is just weird calculations (compression? overhead included? Linux reporting MB and FreeBSD reporting MiB?) rather than different tuning or performance capability.

So I suggest you should repeat some of your tests with 10Gbps... or a local synchronous NFS mount. If NFS refuses to work, use an ssh tunnel to localhost to trick it.

Also, I suggest you test a ramdisk ZIL just to prove it is not a stupid software bug, and to have a practical test of what the real limit of your network might be. My ramdisk ZIL used for an ESXi virtual disk storage host makes it write at only 80 MB/s. That makes me think it is a software bug. But with an SSD, it goes 5-9MB/s unlike yours, so my weird virtual disk case is likely the worst case... maybe your ramdisk would saturate your network and be limited only by that.

BTW what I wanted to do and didn't have time to was:

Share a ramdisk on a few hosts as an iSCSI target.

Connect them together with a dedicated network (we have so many servers with dual/quad network onboard and we only use one port... could use the 2nd for this)

Mirror the local ramdisk and the iSCSI ramdisks together into my ZIL device.

Run off of that and hope at least one of the systems is alive at all times (so the ZIL is not lost, corrupting whatever changes were there).

Based on my lame 80 MB/s ramdisk number, I was discouraged from investing any time in that.
Maybe you have time to try that if you are interested.

I was also thinking of just hooking the machine up to the UPS over the serial port, setting sync=disabled, and when the UPS reports that power went out, have a script set sync=standard again... but there are other ways to fail other than power outage. I don't know if this is a good idea (or if the UPS can do this).

But my dream is for them to come up with something like this cheap consumer quality Gigabyte iRAM thing but have it copy the RAM to flash memory when the power goes out, and not be so expensive. NetApp has such a thing, but for like $5,000. I think this might be one.

Sebulon · Dec 7, 2011

@peetaur
LOL

Too much blood in your caffeine system

My 70MB/s is indeed over a plain ol' 1Gbps link. I have tried setting up server and client with just one switch between, and also tried a wire going directly from server to client. Both server and client were FBSD and with async NFS it actually did perform around 100-110MB/s

About that ramdisk, have you checked if it has write cache dis- or enabled? At least that SSD performing at 5-9MB/s sounds exactly like how the X-25E performed for me when write cache was disabled

/Sebulon

peetaur · Dec 7, 2011

The SSD goes way faster than 5-9MB/s when I write directly to it... it goes over 220 with a simple async dd test (I forget what the bs=4k test says, but it is way higher than 10). I said it only goes that slow when I do this:

on ESXi, mount the zfs share over NFS, create a virtual machine with a virtual disk on that NFS share, install a file system on it (another layer of caching and flushing), and then do writes with dd, copy, scp, etc. to the disk in the running virtual machine's OS.

Another solution I thought of is to use iSCSI to share zvols, but zvols seem dangerously unstable and inefficient. At some point I will test a file on the zfs fs as an iSCSI target instead of a zvol. I tested a UFS zvol the same way with a virtual disk, and it went 60 MB/s, which is 10x as fast, but when doing so, it would put all of my disks in the pool at 100% load.

1 Gbps link, sync NFS client

Code:

# dd if=/dev/zero of=/tmp/testfile bs=128k count=6000
6000+0 records in
6000+0 records out
786432000 bytes (786 MB) copied, 11.8221 s, 66.5 MB/s

Code:

dT: 5.021s  w: 5.000s  filter: gpt/root|label/.ank|gpt/log|gpt/cache
    0    715      0      0    0.0    703  69324   11.8   30.4| gpt/log0
    0    715      0      0    0.0    703  69324   19.2   47.2| gpt/log1

1 Gbps link, virtual disk files (ESXi NFS client) (10 Gbps seems the same)

Code:

# dd if=/dev/zero of=testfile bs=128k count=2000
2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 37,618 s, 7,0 MB/s

Code:

dT: 5.021s  w: 5.000s  filter: gpt/root|label/.ank|gpt/log|gpt/cache
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    372      0      0    0.0    186   6608    0.1   75.5| gpt/log0
    1    372      0      0    0.0    186   6608    0.1   75.5| gpt/log1

The VMs are running Linux.

SSD is a 256GB Crucial m4 SSD CT256M4SSD2 2.5" (6.4cm) SATA 6Gb/s MLC synchron

Code:

root@bcnas1:/tank/bcnasvm1/bcvm01-02# camcontrol modepage da5 -m 0x08
IC:  0
ABPF:  0
CAP:  0
DISC:  0
SIZE:  0
WCE:  1
MF:  0
RCD:  0
Demand Retention Priority:  0
Write Retention Priority:  0
Disable Pre-fetch Transfer Length:  0
Minimum Pre-fetch:  0
Maximum Pre-fetch:  0
Maximum Pre-fetch Ceiling:  0

And no I don't know how to check the write cache on the ramdisk.

# camcontrol modepage md10 -m 0x08

Code:

camcontrol: cam_lookup_pass: CAMGETPASSTHRU ioctl failed
cam_lookup_pass: No such file or directory
cam_lookup_pass: either the pass driver isn't in your kernel
cam_lookup_pass: or md10 doesn't exist

# kldstat

Code:

Id Refs Address            Size     Name
 1   35 0xffffffff80100000 dc8658   kernel
 2    1 0xffffffff80ec9000 203848   zfs.ko
 3    2 0xffffffff810cd000 49d0     opensolaris.ko
 4    1 0xffffffff810d2000 c818     if_ixgb.ko
 5    1 0xffffffff810e1000 30990    mpslsi.ko
 6    1 0xffffffff81112000 117d8    ahci.ko
 7    1 0xffffffff81124000 ab40     siis.ko
 8    1 0xffffffff81212000 40c2     linprocfs.ko
 9    1 0xffffffff81217000 1de21    linux.ko

# ls /boot/kernel/*pass*

Code:

ls: /boot/kernel/*pass*: No such file or directory

Ramdisk was created this way (seems like a hack... unmounting it because for some reason it decides to implicitly format it UFS and mount it for me)

Code:

mkdir /mnt/ramdisk
/sbin/mdmfs -s 4G md10 /mnt/ramdisk
umount /mnt/ramdisk/
zpool add tank log md10

Sebulon · Dec 8, 2011

@peetaur

Oh you ment THAT kind of ramdisk

I thought it was a hardware-based PCIe card or something. Yeah, strangely, IÂ´ve noticed that behavior as well. Quoting myself here:

Odd reflection from these tests was from when I added a 1GB large ram-md disk as a "best possible" disk for ZIL and it didnÂ´t even use it?! I mean, I added md0 as ZIL in the pool, started the syncÂ´ed transfers from the test-client, watched gstat on the server during this time and the md0 drive was never written to. I then removed it from the pool and re-added the Vertex as ZIL instead, and instantly ZFS started using the ZIL as it normally does. Tried restarting the server, destroyed and created with a bigger sized md-device, partitioned it, destroyed the VertexÂ´s partition and label and used that same gpart-label on the md0p1 (gpt/log1) instead. Nothing worked. The only disk it wrote to as ZIL was the Vertex. Has worked in earlier tests though. Very odd.

Ramdisk was created this way (seems like a hack... unmounting it because for some reason it decides to implicitly format it UFS and mount it for me)

Yeah it does that. I used the same approach as you, except I sized the log after a 1Gbps connection:

Code:

# mkdir /mnt/ramdisk
# mdmfs -s 1G md10 /mnt/ramdisk
# umount /mnt/ramdisk/
# zpool add tank log md10

I have only benchmarked the server towards a client, an ESXi for example. This was made to see what kind of performance the ESXi could expect from itÂ´s datastore. You took this one step further, that was interesting. I will try to do that as well and see how it performs inside of a VM.

That Crucial SSD looks like a good performer, I read some benchmarks on it, 4k writes should be around 60-70MB/s. Shame itÂ´s not safe to use as a ZIL, since itÂ´s without any capacitor.

/Sebulon

peetaur · Dec 8, 2011

Isn't an Intel X25-E also without a capacitor? Would you use one as a ZIL without a capacitor? Doesn't synchronous writing prevent problems with not having a capacitor? (doesn't sync mean that the write cache must be fully flushed before the command is complete?)

peetaur · Dec 8, 2011

Yeah, strangely, IÂ´ve noticed that behavior as well. Quoting myself here:
Quote:
... added a 1GB large ram-md disk as a "best possible" disk for ZIL and it didnÂ´t even use it?! [...] was never written to. I then removed it from the pool and re-added the Vertex as ZIL instead, and instantly ZFS started using the ZIL as it normally does. ... Nothing worked. The only disk it wrote to as ZIL was the Vertex. Has worked in earlier tests though. Very odd.

That isn't the behavior I had. I could see the ramdisk being used in gstat every time. The "odd" thing I was talking about was just that when I have a virtual machine with a virtual disk file accessed through ESXi's NFS client, even a ramdisk ZIL would go way below the normal sync nfs client speed.

Here are all my numbers I got with various experiments... to add to your MUCH appreciated data.

Code:

virtual disk comparison tests
    no log device (offline mirror)

        $ sudo dd if=/dev/zero of=/testfile4 bs=128k count=10000
        ^C924+0 records in
        924+0 records out
        121110528 bytes (121 MB) copied, 45.4686 s, 2.7 MB/s

    2 way mirror with ramdisk, gpt/log0 (crazy idea, just to see what happens)

        $ sudo dd if=/dev/zero of=/testfile10 bs=128k count=3280
        3280+0 records in
        3280+0 records out
        429916160 bytes (430 MB) copied, 65.7471 s, 6.5 MB/s

    'striped' SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C2682+0 records in
        2682+0 records out
        351535104 bytes (352 MB) copied, 54.3825 s, 6.5 MB/s

    mirrored SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C2726+0 records in
        2726+0 records out
        357302272 bytes (357 MB) copied, 45.5657 s, 7.8 MB/s

    software striped  (non-ZFS stripe) log device

        zpool remove tank gpt/log0 gpt/log1
        kldload geom_stripe
        gstripe label -v st0 gpt/log0 gpt/log1
        zpool add tank log stripe/st0

        $ sudo dd if=/dev/zero of=/testfile6 bs=128k count=10000
        ^C2816+0 records in
        2816+0 records out
        369098752 bytes (369 MB) copied, 45.2785 s, 8.2 MB/s

    3 way stripe with ramdisk, gpt/log0, gpt/log1 c

        $ sudo dd if=/dev/zero of=/testfile10 bs=128k count=10000
        ^C3174+0 records in
        3174+0 records out
        416022528 bytes (416 MB) copied, 46.7078 s, 8.9 MB/s

    single SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C3130+0 records in
        3130+0 records out
        410255360 bytes (410 MB) copied, 44.9023 s, 9.1 MB/s

    ramdisk log device

        mkdir /mnt/ramdisk
        /sbin/mdmfs -s 4G md10 /mnt/ramdisk
        umount /mnt/ramdisk/
        zpool add tank log md10

        dd if=/dev/zero of=/testfile10 bs=128k count=10000
        10000+0 records in
        10000+0 records out
        1310720000 bytes (1.3 GB) copied, 16.2831 s, 80.5 MB/s

    UFS zvol (bypassing log)

        zfs create -V 110g tank/vmufstest

        45-117 MB/s

    unexpected:
        ZFS-Striped <  mirror
        ZFS-Striped < single disk
        software-striped < single disk

    expected but noteworthy anyway:
        mirror < single disk

2 tests seem to be missing.

I don't see my "spinning disk ZIL" test which I think was somewhere between the no log test, and the mirrored SSD test.

I don't see the test where I used an 8 consumer disk striped mirrored setup with no log device instead of my 16 enterprise disk raidz2, which was the same as the other "no log" test. (and in my 16 consumer vs 16 enterprise tests, the numbers were only a few % different, so I don't think that threw it off)

peetaur · Dec 8, 2011

Here is some text about SSDs that I did not read before, but something like it made me think that with a ZFS ZIL, I don't care about a capacitor. Do you suggest otherwise? And I don't know if MLC vs SLC is an issue. What do you think?

"But the drive may have answered to the OS, that it wrote the data to the non-volatile media. That's really a problem."
"Someone has lied in the chain OS to disk."
"You have to disable the drive caches, to ensure that the data is really on disk"
"ZFS doesn't have that problem because it flushes the cache after every write to the ZIL. It circumvents the problem of the write cache by effectively disabling it due to the frequent flushing of the caches."

http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html

Sebulon · Dec 8, 2011

@peetaur

I was misinformed in that X25-E had a capacitor, I always believed it had. OK, screw that, IÂ´ve dug up something even better; the OCZ Deneva 2 240GB MLC. It has the exact same specs as the Vertex 3 240GB and also equipped with a capacitor. I would love to get my hands on one of those and test to see if it really does what itÂ´s supposed to in case of a power fail.

SLC is definitely better than MLC at the same size. For example 32GB SLC is way better than 32GB MLC. But itÂ´s also a question of marketprice. SLCÂ´s are too expensive for manufacturers to be big in size, so far less people buy SLC and because of that, they are more expensive. In fact, I tried to find SLC drives around in Swedens consumer market and found just one internet shop that still sells X25-E, and it cost more than buying a 240GB MLC that has the same perfomance, if not better.

The Zeus IOPS drives has capacitors. It is a must for any SLOG to ensure that data gets written to the ZIL safely. ItÂ´s either a capacitor that works or disable write cache altogether, and that lands you at about 5-10MB/s

/Sebulon

peetaur · Dec 8, 2011

Did you do a test yet, with an SSD that has a capacitor, in an ESXi virtual disk over NFS test like mine that goes 5-9MB/s?

My SSD has the write cache enabled. My normal NFS clients set to synchronous go 65 MB/s. So why should the ESXi client somehow tell my SSD to not use the write cache? Is this what is really happening, or is it just a client-side bug/braindead configuration?

This page is highly unrelated... about ESXi's iSCSI initiator and "starwind" iSCSI target (probably not ZFS). But sounds like a very similar problem, and then the guy reports better results when using the recommendations on this page. I don't think it applies to NFS though.

But it did make me think of one thing... I can do a test that does not have another file system in there to remove "another layer of caching and flushing" as I said above.

Code:

    mirrored SSD log device. writing to direct virtual disk with no file system

        # dd if=/dev/zero of=/dev/sdc1 bs=128k count=1000
        1000+0 records in
        1000+0 records out
        131072000 bytes (131 MB) copied, 24.0635 s, 5.4 MB/s

So there is something about the ESXi NFS client that really sucks. And you were planning on using it for that, right? So we are in the same boat.

And you didn't answer my question about the necessity of a capacitor on ZIL on zfs.
"ZFS doesn't have that problem because it flushes the cache after every write to the ZIL. It circumvents the problem of the write cache by effectively disabling it due to the frequent flushing of the caches."
I have believed that was true in my initial research, and everything in between, and still now. Do you believe that also? If so, then unlike other SSD applications, using it in a ZIL does not require it have a capacitor, not for performance nor data integrity (but makes it slower compared to non-zfs applications).

danbi · Dec 9, 2011

peetaur said:
I don't see the test where I used an 8 consumer disk striped mirrored setup with no log device instead of my 16 enterprise disk raidz2, which was the same as the other "no log" test. (and in my 16 consumer vs 16 enterprise tests, the numbers were only a few % different, so I don't think that threw it off)

This sort of tests is pretty much useless and very much misleading.

Consumer disks may have the same or even better sequential read/write speeds compared to enterprise drives. Where enterprise drives excel is reliability and multi-threaded performance. Also, it is unwise to compare SATA and SAS disks with sequential loads such as dd, because such scenarios almost never happen in an multitasking OS.
For example, SAS drives have independent read/write data paths, while SATA devices by design have single data path that is switched for read/write -- there are usage patterns, where this significantly impacts performance.

You will be much better doing benchmarks with bonnie++, especially multi-threaded tests.

The SLOG is intended to help with multiple small sync writes, that would typically happen in a database or heavily multitasking setup. I believe in newer versions of ZFS large synchronous writes bypass the ZIL already. The primary purpose of the SLOG is to reduce sync write latency, by using small, fast dedicated device.

Daniel

peetaur · Dec 9, 2011

If the test is misleading, you read it wrong. I didn't say the consumer disks are the same speed. I said that in my testing (zfs file system, caching, sync NFS, in some cases an SSD ZIL), the system performs the same. And in real world tests, I am not testing raw disks without cache, so I don't want to run your benchmarks with cache disabled; that would tell me nothing practical.

And what I was saying is not useless; I was talking about the effect of mirroring it to try to fix the slow ESXi problem, not enterprise vs consumer disks. You need to be wise in how you read results from anywhere, considering the context including benchmarks. And benchmarks are even worse with ZFS, showing obvious fake results due to caching. I can disable the cache or do real world tests to avoid that. So don't tell me it is useless. Try to be constructive instead.

And bonnie++ won't run at all for me in FreeBSD, only Linux machines I tried it on.

Code:

/tank/test# bonnie++ -d /tank/test -c 8 -s 1500 -x 10 -u peter
Using uid:1001, gid:1001.
File size should be double RAM for good results, RAM is 49124M.

/tank/test/bonnie# bonnie++ -d /tank/test -c 8 -s 100000 -x 10 -u peter
Using uid:1001, gid:1001.
format_version,bonnie_version,name,file_size,io_chunk_size,putc,putc_cpu,put_block,
put_block_cpu,rewrite,rewrite_cpu,getc,getc_cpu,get_block,get_block_cpu,seeks,
seeks_cpu,num_files,max_size,min_size,num_dirs,file_chunk_size,seq_create,
seq_create_cpu,seq_stat,seq_stat_cpu,seq_del,seq_del_cpu,ran_create,ran_create_cpu,
ran_stat,ran_stat_cpu,ran_del,ran_del_cpu,putc_latency,put_block_latency,
rewrite_latency,getc_latency,get_block_latency,seeks_latency,seq_create_latency,
seq_stat_latency,seq_del_latency,ran_create_latency,ran_stat_latency,ran_del_latency
Can't open file ./Bonnie.97370

root@bcnas1bak:/tank/test/bonnie# ls -l
total 3
drwxr-xr-x  3 peter  peter  3 Dec  9 08:37 largefile1

The directory called largefile1 was created with filebench, not bonnie++. Filebench seems to work other than reporting bogus CPU numbers, but again, it gives bogus "disk" results due to caching, and basically valid but not real-world "file system" results.

Sebulon · Dec 9, 2011

@peetaur

OK, to answer your question; I believe you MUST have a SSD with a capacitor, or you WILL end up with a corrupt ZIL (possibly klling the entire pool) after a power failure. And IÂ´m far from alone about that:

http://hardforum.com/showthread.php?t=1591181

use SSDs with supercapacitor protection to protect your SLOG from corruption

http://www.nexentastor.org/boards/1/topics/972

in case of a power failure, data in the device's internal cache must not be lost, or the device will at least have to honor ZFS cache flush requests

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30412.html

You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that can't be disabled neither flushed by ZFS.

http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/

The only thing it lacks is a supercapacitor, but that shouldnâ€™t be a problem if you have dual PSUâ€™s connected to dual UPSâ€™s

BUT I havenÂ´t had the opportunity to test this personally, so I cannot say I know this for sure yet. IÂ´m hoping I can build a rig soon where itÂ´s ok to test a power failure, first with a non-capacitor SSD like the Vertex 3, and then again with a capacitor-backed SSD, like the Deneva 2, to see if thereÂ´s any difference. I mean, if the pool will get faulted with the Vertex as SLOG and then the same for the Deneva. That kind of test.
After that, we will know for sure if the capacitor really has an impact or not.

/Sebulon

peetaur · Dec 9, 2011

Sounds like fun. Can't wait to hear the results.

Add this to your fun list: ramdisk ZIL with hard reboot during a sync write.

RusDyr · Jan 27, 2012

Very interesting topic. I have two Intel SSDSA2CW120G3 (120Gb SSD) and can say that UFS really good. I've got double speed on UFS over ZFS:
ZFS partition:

Code:

dd if=/tmp/ram/rand of=/tmp/z/testfile.bin bs=4k
515584+0 records in
515584+0 records out
2111832064 bytes transferred in 32.458456 secs (65062616 bytes/sec)

UFS partition:

Code:

dd if=/tmp/ram/rand of=/tmp/ufs/testfile.bin bs=4k
515584+0 records in
515584+0 records out
2111832064 bytes transferred in 14.829704 secs (142405543 bytes/sec)

And since I wanted to use ZFS, it drives me really crazy.

Sebulon · Jan 27, 2012

@RusDyr

That doesn't quite show the whole picture, I've got a couple of questions to go with with those dd's:

Are you writing to one disk with zfs and one with ufs?
How are the disks partitioned? Output of:
# gpart show
What is the output of:
# zdb | grep ashift
Where you planning on having those two for a system pool and a bunch of other disks for "tank", or are those two it?
Are planning on using some type of redundancy on the SSD's, like gmirror or zfs mirror?

You are writing to a file in a file system, which cancels the bs-flag on dd and uses the block-size the file system wants. If you want to benchmark that difference, you have to write directly to a device, eg: /dev/daX(pX)

/Sebulon

RusDyr · Jan 30, 2012

1. Yes.
2.

Code:

=>       34  234441581  ada4  GPT  (111G)
         34          6        - free -  (3.0k)
         40        128     1  freebsd-boot  (64k)
        168    2097152     2  freebsd-swap  (1.0G)
    2097320   52428800     3  freebsd-zfs  (25G)
   54526120    6291456     4  freebsd-zfs  (3.0G)
   60817576    6291456     5  freebsd-zfs  (3.0G)
   67109032   83886080     6  freebsd-zfs  (40G)
  150995112   83446496     7  freebsd-zfs  (39G)
  234441608          7        - free -  (3.5k)

=>       34  234441581  ada5  GPT  (111G)
         34          6        - free -  (3.0k)
         40        128     1  freebsd-boot  (64k)
        168    2097152     2  freebsd-swap  (1.0G)
    2097320   52428800     3  freebsd-ufs  (25G)
   54526120    6291456     4  freebsd-zfs  (3.0G)
   60817576    6291456     5  freebsd-zfs  (3.0G)
   67109032   83886080     6  freebsd-zfs  (40G)
  150995112   83446496     7  freebsd-zfs  (39G)
  234441608          7        - free -  (3.5k)

3. It was "ashift 9".
4. I would liked to use it for system pool (ZFS mirror), for ZIL (ZFS mirror, per pool), and for L2ARC (ZFS stripe, per pool).
5. Yeah, currently system partition is gmirror'ed, but I'm slightly disappointed what TRIM isn't supported over gmirror.

You are writing to a file in a file system, which cancels the bs-flag on dd and uses the block-size the file system wants. If you want to benchmark that difference, you have to write directly to a device, eg: /dev/daX(pX)

I did it like you did.

Benchmark with direct device (or more accuratly, to partition) pretty close to UFS results.

P.S. Current config:
# camcontrol devlist

Code:

<SAMSUNG HD204UI 1AQ10001>         at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD204UI 1AQ10001>         at scbus1 target 0 lun 0 (pass1,ada1)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (pass2,ada2)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (pass3,ada3)
<INTEL SSDSA2CW120G3 4PC10362>     at scbus4 target 0 lun 0 (pass4,ada4)
<INTEL SSDSA2CW120G3 4PC10362>     at scbus5 target 0 lun 0 (pass5,ada5)

# # zpool list

Code:

NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage  3,53T  1,56T  1,97T    44%  1.00x  ONLINE  -
zstripe   103G  9,73G  93,3G     9%  1.00x  ONLINE  -

# # mount -v

Code:

/dev/mirror/system on / (ufs, local, noatime, journaled soft-updates, fsid 72fe234fbf939f57)
devfs on /dev (devfs, local, multilabel, fsid 00ff007171000000)
procfs on /proc (procfs, local, fsid 01ff000202000000)
storage on /storage (zfs, local, noatime, nfsv4acls, fsid 791cf6fedea1d203)
zstripe on /zstripe (zfs, local, noatime, nfsv4acls, fsid bb656ff7de90fc76)

# # zpool status

Code:

pool: storage
 state: ONLINE
  scan: resilvered 52K in 0h0m with 0 errors on Fri Jan 27 10:39:43 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/73247b20-46b2-11e1-8642-a0369f0010fc  ONLINE       0     0     0
            gptid/cc27ad35-475d-11e1-8383-00259052b005  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/84178b08-46b2-11e1-8642-a0369f0010fc  ONLINE       0     0     0
            gptid/cae28809-475d-11e1-8383-00259052b005  ONLINE       0     0     0
        logs
          mirror-2                                      ONLINE       0     0     0
            gptid/3600f2e2-48eb-11e1-9076-485b39c5c747  ONLINE       0     0     0
            gptid/3b54ceb9-48eb-11e1-9076-485b39c5c747  ONLINE       0     0     0
        cache
          ada4p6                                        ONLINE       0     0     0
          ada5p6                                        ONLINE       0     0     0

errors: No known data errors

  pool: zstripe
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zstripe     ONLINE       0     0     0
          ada3p4    ONLINE       0     0     0
          ada0p4    ONLINE       0     0     0
          ada1p4    ONLINE       0     0     0
          ada2p4    ONLINE       0     0     0
        logs
          mirror-4  ONLINE       0     0     0
            ada4p5  ONLINE       0     0     0
            ada5p5  ONLINE       0     0     0
        cache
          ada4p7    ONLINE       0     0     0
          ada5p7    ONLINE       0     0     0

errors: No known data errors

Sebulon · Jan 30, 2012

@RusDyr

Thank you, that helped me understand your system better.

I did it like you did.
Benchmark with direct device (or more accuratly, to partition) pretty close to UFS results.

DonÂ´t do what I do, do what I say

I know, I did that, and it was wrong of me. I didnÂ´t know that the filesystem interfered that way before. I did however explain it in a later post (#35)

Now, IÂ´m like the worst person at math, so IÂ´m just gonna ask; those partitions, they donÂ´t look to me as if they are aligned to 4k? The partitions should have at least one evenly dividable by 1M, like:
# gpart add -t freebsd-zfs -b 1m -a 4k daX
On some SSDÂ´s IÂ´ve tested, that has made a huge difference, like doubled the performance. Might not have been the case with the 320Â´s though, but it couldnÂ´t hurt anyways.
Also I would suggest you make sure to have ashift set to 12 on a SLOG. Have you used the gnop trick before? I can describe the steps for you if you like.

In the beginning, I tested having two disks with two partitions each, acting as both ZIL and L2ARC, which was a very bad idea. I noticed a performance hit of about 30% that you should be aware of. Using the same disks for multiple things, like boot, ZIL and L2ARC is bad practice. The point is to have separate(undisturbed) ZIL and L2ARC that you provision to the level of your performance, so that the data-disks behind are less important performance-wise, because the ZIL and L2ARC will always guarantee that level.
Or more clearly; if you have good enough ZIL and L2ARC, you can configure the data-disks for biggest possible storage, instead of acceptable performance.

I would suggest you boot off of a USB-stick, set up- and mount root on the pool and have one of the 320Â´s dedicated as ZIL and the other as L2ARC for best performance. That way you only have one big pool with best possible performance. If you run 8.2-STABLE or 9.0-RELEASE you donÂ´t have to worry about mirroring the ZIL any more.

/Sebulon

peetaur · Feb 1, 2012

RusDyr, can we see your /boot/loader.conf and /etc/sysctl.conf?

In particular, I am looking for these settings (and here are my values for my 48GB machine, all in /boot/loader.conf, nothing zfs related in my /etc/sysctl.conf):

Code:

vm.kmem_size="44g"
vm.kmem_size_max="44g"
vfs.zfs.arc_min="80m"
vfs.zfs.arc_max="38g"
vfs.zfs.arc_meta_limit="24g"
vfs.zfs.vdev.cache.size="32m"
vfs.zfs.vdev.cache.max="256m"
kern.maxfiles="950000"

Please note that at every chance they get, many people say you should never set "vm.kmem_size_max". So probably should not set that one. But I have not had any problems with it so far.

peetaur · Feb 1, 2012

RusDyr,

I did some tests on an old computer: Dell PowerEdge 2850

Code:

    device     = 'Expandable RAID Controller (PERC 4e/Si and PERC 4e/Di)'
    class      = mass storage
    subclass   = RAID

Can't tell you exactly what the disks are, but they are 10k RPM SCSI.

Code:

6 disk gstripe UFS:
    write at 161356565 B/s, 
    read at 738257082 B/s (cached)
    unmount, remount and read 37920878 B/s (uncached, but bursty looking in gstat, with disks spending much time at 0 kbps 0% load)

6 disk zfs 'stripe' and write at 159819737 B/s.
    write at 159819737 B/s
    read at 1996571694 B/s (cached)
    playing with "primarycache" setting to read uncached: 224047484 (confirmed this works as uncached in gstat) 
    [EDIT: apparently I repasted the above read number here before, so fixed that with the correct value averaged 
    over 5 tests; and previous edit of 211382222 was with raidz1]

So it would look like generalizing and saying "UFS is twice as fast as ZFS" is not correct. So we should look into why your numbers seem that way. My above post about loader.conf is based on my experience, where ZFS is a terrible performer with low RAM (such as the default settings), even compared to other file systems with low RAM.

I was mainly testing the difference between a raid10 like setup and raidz1. Raidz1 is much faster writing sequentially, and same reading (as I hypothesized). But you were talking about UFS vs ZFS rather than gmirror vs ZFS. Can someone recommend the best way to test random read and write (to comare raidz1 and striped mirror config)? I am probably not interested in avoiding caching (disk test, such as what bonnie++ would do for me), only the 'real world' test (file system test), which would include caching, and sync writes.

Sebulon · Feb 1, 2012

@peetaur

I believe you are confusing RAID-levels:

RAID0 = 1x- or >1x no parity zfs stripe/d vdev/s (gstripe)

RAID1 = 1x zfs mirror vdev (gmirror)
RAID5 = 1x zfs raidz1 vdev
RAID6 = 1x zfs raidz2 vdev

RAID10 = >1x zfs mirror vdev (gstriped gmirrors)
RAID50 = >1x zfs raidz1 vdev
RAID60 = >1x zfs raidz2 vdev

You can, of course, make a raidz1 vdev with only two disks in it, but IIRC performance is better using mirrored vdevs.

/Sebulon

peetaur · Feb 1, 2012

I agree with you on those definitions. What part of my previous post was wrong? [and not listed is RAID 0+1 which as far as I know is not possible with pure zfs.]

I mentioned raid10 in the last paragraph, which I also tested, but didn't include the results in the code blocks above because (1) they are out of context of this "zfs vs UFS" performance question. I wanted to make my post shorter, since your thread is about NFS and ZILs, not comparing UFS.

and (2) I didn't want to bother creating gstriped gmirrors in my test (the UFS comparison of that otherwise incomparable test), because although related to this discussion, it is not related to what I want to do with this server I am building.

Here is my zfs stripe:

Code:

zpool create pool \
    gpt/pool1d1 gpt/pool1d2 \
    gpt/pool2d1 gpt/pool2d2 \
    gpt/pool3d1 gpt/pool3d2

Here is my UFS stripe:

Code:

gstripe create pool \
    gpt/pool1d1 gpt/pool1d2 \
    gpt/pool2d1 gpt/pool2d2 \
    gpt/pool3d1 gpt/pool3d2

newfs /dev/stripe/pool
mkdir /pool
mount /dev/stripe/pool /pool

Here is my zfs RAID10:

Code:

zpool create pool \
    mirror gpt/pool1d1 gpt/pool1d2 \
    mirror gpt/pool2d1 gpt/pool2d2 \
    mirror gpt/pool3d1 gpt/pool3d2

Sebulon · Feb 2, 2012

@peetaur

Yeah, a little OT, but I donÂ´t mind. This is what I reacted to:

I was mainly testing the difference between a raid10 like setup and raidz1.

RAID10 and raidz1(RAID5/0) are nothing alike. YouÂ´re comparing apples and pears. ItÂ´s not useless, but untrue.

What you can benchmark instead is the difference between striped zfs mirrors and gstriped gmirrors. That would be a fun and true comparison.

and not listed is RAID 0+1 which as far as I know is not possible with pure zfs.

Correct. You could perhaps create two gstripe-devices made up of N disks and create a zfs pool with one mirrored vdev using those two gstripes. Might also be a fun test.

/Sebulon

peetaur · Feb 2, 2012

It is not untrue to compare apples and pears. It just depends on what you want. If you simply say "apples are better than pears", then you are mistaken, but if you say "apples suit my needs better than pears", then there is no problem, but not all can say the same statement.

What I am not sure of:

Do I need the excellent random read and write of RAID10?
Do I need the extra space or faster sequential speed from raidz1?

What I was testing was:

Is raidz1 slower than raid10 like everyone says, or does it match my hyphothesis: sequentially it is always faster than raid10; equal to raid0 in read, 80% (1/(disks-1) slower) compared to raid0 in write. And in "random" operations, I expect it to be slower, but I'm not sure how to test... My plan of testing random access is to "make buildworld". I don't like benchmark tools; they seem to just test block level access to a big randomly generated file, not a full file system (creating files, reading directories, etc.) which is more real world.

Others make stupid but intelligent sounding conclusions. So it is very difficult to figure out the whole truth just from reading.

And I am slowly getting fed up with ESXi.

<ESXi rant>
Today, the ESXi server was happily running while it said some vms are down and others are up. Pinging the "down" vms and using their web servers, etc. worked. So ESXi is reporting that they are "down" when really they are up. How is that even possible? One server that was "up" was responding VERY slowly, and others are reporting load >4 when idle. So I rebooted ESXi. Why should I need to reboot just to fix this? I feel like I'm running Windows... And commands like "top" and "vmstat" or even looking in /proc/... to find cpu stats isn't possible in the ESXi command line. (I'm guessing there is a way, but it is a mystery to me). Things like that make me want a real OS, not this incomplete VMware ESXi Busybox, with a semi-well documented GUI and completely obscure proprietary command line. I wanted to know the CPU usage of the vmware-vmx processes (or the ESXi equivalent), because it is a common problem for them to all hog 100% CPU and make the system crawl in other VMware products.

And a week ago, I wanted to create my first non-NFS virtual machine, and to my surprise, there was no local datasource. The path was "dead" it said. But the OS disk worked fine, so clearly the hardware was working. The bios RAID setup said things were optimal A reboot fixed it, so I can only assume the hardware is fine, but how can I trust it in the future?

So my new plan is: Run my NAS and replication/backup server with lots of disk space. Run the VMs on separate FreeBSD + ZFS + VirtualBox machines, with the virtual disk files stored locally. Send replication snapshots to the backup server, and put large volume low latency demanding stuff (which is most of what we do here) on the NAS directly. (Luckily, I am in charge of this, so I can decide whether or not to throw away ESXi; Do you have the same control?). Another option would be to netboot the ESXi, or to run it in a VirtualBox. Both of those sound like bad hacks, and since I'm fed up with ESXi, I am leaning towards a VirtualBox solution.

So my "raid10 apples vs raidz1 pears" comparison is just to decide... do I want the significantly higher space and sequential performance from raidz (raidz1 in the case of this 6 disk machine, and maybe also the 4 disk machine that currently runs ESXi, and raidz2 for larger ones), or do I want the performance characteristics of the raid10 (~50% better random [did not test myself], ~33% slower sequential write (63MB/s vs 92MB/s) and equal read (204MB/s vs 193MB/s) [my own testing on the old PowerEdge 2850]).

So far, I think I will choose raidz1 for the faster sequential writes. I think for my needs, faster transfers over the network are more important than random performance (like for a database, or compiling things). (Maybe I will compare random speeds by running make buildworld)

We are getting further off the topic of ZILs... but I think what you are interested in is virtual machine performance... so I guess we are slightly on topic. And let me know if you prefer less long-winded replies in the future.