ZFS and no ZIL / LOG reported usage

mengesb · Mar 15, 2013

We're attempting a test ZFS+HAST+CARP+iSCSI build, and before we get too far down the pipeline we wanted to just reel back a little and start with ZFS and some base benchmarks with a solo (single head) HAST configuration (not replicating YET!).

Some details:
2 Head nodes (6 disks) - FreeBSD 9.1-RELEASE
2 JBODS (45 disks)

Disk layout:
Head: 2xSATA (zfs root)
Head: 4xSSD (cache)
JBOD: 4xSSD 6gbps (zil/log)
JBOD: 41xSAS 6gbps

For the SAS disks I trimmed a little off, started at -b 2048, and aligned the sectors for 4K (even though the disks are 512 sector, doesn't hurt that much and future disks will likely be 4K aligned). The SSDs are also started at -b 2048, and provisioned to ~180G (80%) - all of them.

During some cursory testing we're finding pretty poor performance (considering the hardware involved) reported from iozone, but the alarming thing discovered was the lack of ZIL usage. I see NOTHING. The zpool I built is 5x8dev raidz2 .. so this should pretty much scream on IO.

I wanted to see what might be going on underneeth. I've built similar setups in Linux, and this is a first attempt to FreeBSD build of this kind.

The hardware is certainly highly capable so I'm trying to understand what is going on underneath, and what potential settings apply to ZIL usage or how to figure out why it isn't being used.

I do see CACHE usage, just not ZIL usage.

Here are some current tuning settings:

Code:

sysctl ---
vfs.zfs.prefetch_disable: 1
vfs.zfs.txg.timeout: 5
kern.maxvnodes: 1605403
vfs.zfs.write_limit_override: 0
vfs.zfs.l2arc_headroom: 8
vfs.zfs.l2arc_write_max: 597688320
vfs.zfs.l2arc_write_boost: 597688320
vfs.zfs.l2arc_noprefetch: 0

zfs properties ---
NAME    PROPERTY  VALUE     SOURCE
zhast1  sync      standard  default
NAME    PROPERTY  VALUE  SOURCE
zhast1  atime     off    local
NAME    PROPERTY              VALUE                 SOURCE
zhast1  zfs:zfs_nocacheflush  1                     local

Certainly up to providing more information per request.

mav@ · Mar 16, 2013

IIRC 8-dev raidz2 is not a recommended configuration. IIRC the recommended are 2**n+2, i.e. 6, 10.

What's about ZIL, they are used only for synchronous write operations that is used a lot by NFS, databases and some other software.

gkontos · Mar 16, 2013

Try using this formula bellow when building RAIDZ pools:

Code:

RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9
RAID-Z2 = 2^n + 2 Disks. IE. 4,6,10
RAID-Z3 = 2^n + 3 Disks. IE. 5,7,11

How much memory does this system have?

mengesb · Mar 17, 2013

mav@ said:
IIRC 8-dev raidz2 is not a recommended configuration. IIRC the recommended are 2**n+2, i.e. 6, 10.

What's about ZIL, they are used only for synchronous write operations that is used a lot by NFS, databases and some other software.

My testing in the past has shown no performance benefit from following the common RAID parity math, and in fact ZFS appears to completely throw it out the window. As for ZIL - yes I am well aware that it does only come into play on a synchronous write... and for my ZFS on Linux system, I see ZIL used during these IOZone tests - there are five test that it runs:

sequential read
sequential write
Re-read
Random write
Random read

I should see some activity on the ZIL device during a portion of these repeated tests, but I do not. Grant it, I'm not familiar with FreeBSD much, and do not fully understand the characterists of UFS, but I would think that I'd see sequential/synchronous writes present on these tests being my other testing system shows this same activity.

RE: @gkontos

Yes, I'm well aware of the math regarding RAID optimization, but as I mentioned above my testing on other ZFS systems has shown this has no benefit. We're certainly running the numbers during my tests so I have a chance to see it again, but thus far the numbers look pretty poor with regards to sequential performance.

Are there flags for dd on FreeBSD to test this? From my initial attempts even with dd using conv=sync or 'sync' at the end of the command do not appear to be doing anything. downloading a DVD to the zvol also doesn't appear to do anything. Is UFS my issue here?

mengesb · Mar 17, 2013

So zpool construction aside, I am re-runing iozone with sync declared:

example: http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013988.html

Code:

        SYNC Mode.
        Include close in write timing
        Record Size 128 KB
        File size set to 4194304 KB
        Command line used: iozone -o -c -t 8 -r 128k -s 4G -F /mnt/io-synctest/test1.txt /mnt/io-synctest/test2.txt /mnt/io-synctest/test3.txt /mnt/io-synctest/test4.txt /mnt/io-synctest/test5.txt /mnt/io-synctest/test6.txt /mnt/io-synctest/test7.txt /mnt/io-synctest/test8.txt
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 8 processes
        Each process writes a 4194304 Kbyte file in 128 Kbyte records

Here's the results of iostat, watching this while the test is running:

Code:

                         capacity     operations    bandwidth
pool                  alloc   free   read  write   read  write
--------------------  -----  -----  -----  -----  -----  -----
zhast1                49.7G  72.5T      0  21.3K      0   167M
  raidz2              9.97G  14.5T      0  4.29K      0  33.7M
    hast/jbod-slot04      -      -      0     69      0  6.33M
    hast/jbod-slot05      -      -      0     70      0  6.33M
    hast/jbod-slot06      -      -      0     69      0  6.33M
    hast/jbod-slot07      -      -      0     69      0  6.33M
    hast/jbod-slot08      -      -      0     69      0  6.33M
    hast/jbod-slot09      -      -      0     69      0  6.33M
    hast/jbod-slot10      -      -      0     69      0  6.33M
    hast/jbod-slot11      -      -      0     69      0  6.33M
  raidz2              9.96G  14.5T      0  4.28K      0  33.7M
    hast/jbod-slot12      -      -      0     67      0  6.32M
    hast/jbod-slot13      -      -      0     67      0  6.32M
    hast/jbod-slot14      -      -      0     66      0  6.32M
    hast/jbod-slot15      -      -      0     66      0  6.32M
    hast/jbod-slot16      -      -      0     68      0  6.32M
    hast/jbod-slot17      -      -      0     66      0  6.32M
    hast/jbod-slot18      -      -      0     67      0  6.32M
    hast/jbod-slot19      -      -      0     67      0  6.32M
  raidz2              9.90G  14.5T      0  4.17K      0  32.7M
    hast/jbod-slot20      -      -      0     57      0  6.15M
    hast/jbod-slot21      -      -      0     57      0  6.15M
    hast/jbod-slot22      -      -      0     57      0  6.15M
    hast/jbod-slot23      -      -      0     56      0  6.11M
    hast/jbod-slot24      -      -      0     57      0  6.15M
    hast/jbod-slot25      -      -      0     57      0  6.15M
    hast/jbod-slot26      -      -      0     57      0  6.15M
    hast/jbod-slot27      -      -      0     56      0  6.15M
  raidz2              9.95G  14.5T      0  4.27K      0  33.6M
    hast/jbod-slot28      -      -      0     59      0  6.31M
    hast/jbod-slot29      -      -      0     60      0  6.31M
    hast/jbod-slot30      -      -      0     61      0  6.31M
    hast/jbod-slot31      -      -      0     61      0  6.31M
    hast/jbod-slot32      -      -      0     61      0  6.31M
    hast/jbod-slot33      -      -      0     60      0  6.31M
    hast/jbod-slot34      -      -      0     60      0  6.31M
    hast/jbod-slot35      -      -      0     60      0  6.31M
  raidz2              9.97G  14.5T      0  4.27K      0  33.5M
    hast/jbod-slot36      -      -      0     72      0  6.29M
    hast/jbod-slot37      -      -      0     72      0  6.29M
    hast/jbod-slot38      -      -      0     71      0  6.29M
    hast/jbod-slot39      -      -      0     72      0  6.29M
    hast/jbod-slot40      -      -      0     72      0  6.29M
    hast/jbod-slot41      -      -      0     73      0  6.29M
    hast/jbod-slot42      -      -      0     73      0  6.29M
    hast/jbod-slot43      -      -      0     72      0  6.29M
logs                      -      -      -      -      -      -
  hast/jbod-slot00       4K   179G      0      0      0      0
cache                     -      -      -      -      -      -
  gpt/h1slot02        5.95G   173G      0    163      0  19.2M
  gpt/h1slot03        5.95G   173G      0    185      0  21.9M
  gpt/h1slot04        6.00G   173G      0    189      0  22.4M
  gpt/h1slot05        5.96G   173G      0    162      0  19.2M
--------------------  -----  -----  -----  -----  -----  -----

I'm not seeing any synchronous write activity on the ZIL device.

mengesb · Mar 17, 2013

gkontos said:
Try using this formula bellow when building RAIDZ pools:

Code:

RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9 RAID-Z2 = 2^n + 2 Disks. IE. 4,6,10 RAID-Z3 = 2^n + 3 Disks. IE. 5,7,11

How much memory does this system have?

96G of memory... plenty

kpa · Mar 17, 2013

What mav@ is saying is that the ZIL is used only for writes that the application or service wants to done in a synchronous manner to guarantee data integrity in case of a system crash. If none of your applications or services request synchronous writes the ZIL won't be used at all.

mengesb · Mar 17, 2013

kpa said:
What mav@ is saying is that the ZIL is used only for writes that the application or service wants to done in a synchronous manner to guarantee data integrity in case of a system crash. If none of your applications or services request synchronous writes the ZIL won't be used at all.

I'm aware of this.

The IOZone tests that I'm performing are synchronous write tests... which should show ZIL activity, but I'm not seeing any.

Is there another test that I can perform that will show this? or show the lack there of?

mav@ · Mar 18, 2013

dd's sync is not related to synchronous write. It is about synchronous handling for some special cases of reading.

I can't say for sure about iozone, but if it just writes into the files with write(2), then it is not a synchronous write, as filesystem code is allowed to postpone writes indefinitely. I guess [cmd=]zfs set sync=... ...[/cmd] may affect this.

AndyUKG · Mar 31, 2013

mengesb said:
I'm not seeing any synchronous write activity on the ZIL device.

Might be worth double checking this via gstat.

Sebulon · Apr 2, 2013

@mengesb

Are you running the iozone tests on the storage server itself or are you running them from a "client" system with a share/LUN attached? If there is any iSCSI involved, just letting you know that istgt isnÂ´t synchronous by default like NFS is, you need to- like @mav@ said- [CMD=]zfs set sync=always foo/bar[/CMD] for that to happen. And as @AndyUKG also pointed out, gstat is a great tool for real-time IO monitoring, but as it shows one row per disk, partition and label, I usually apply this filter to have a better overview of just the gpt-labels: # gstat -f gpt\/

In your case you might want # gstat -f hast\/

/Sebulon

mengesb · Apr 3, 2013

I've git a look into gstat. Indeed, I do understand that it is not locally synchronous, but iSCSI should obey synchronous write requests no? If not, then I don't see a point of having a ZIL/LOG device on a zpool when using iSCSI. With my ZoL box, I definitely see synchronous writes coming across the iSCSI service to the zvols which were exported.

Sebulon · Apr 3, 2013

@mengesb

As all implementations vary, you should set zfs to always sync regardless of what service is using it, if that is what you want.

/Sebulon

mav@ · Apr 5, 2013

mengesb said:
I've git a look into gstat. Indeed, I do understand that it is not locally synchronous, but iSCSI should obey synchronous write requests no? If not, then I don't see a point of having a ZIL/LOG device on a zpool when using iSCSI. With my ZoL box, I definitely see synchronous writes coming across the iSCSI service to the zvols which were exported.

From an iSCSI perspective ZFS ARC is equivalent to HDD cache, just a bit bigger. Unless you disable caching, iSCSI writes will go to the ZFS ARC and wait there for another flush. The iSCSI SYNCHRONIZE CACHE command should (at least theoretically) cause ZFS to initiate an immediate synchronous write and wait for its completion. That should be a more efficient way to use resources, compared to burning your ZIL with every single write operation.

mengesb · Apr 9, 2013

mav@ said:
From an iSCSI perspective ZFS ARC is equivalent to HDD cache, just a bit bigger. Unless you disable caching, iSCSI writes will go to the ZFS ARC and wait there for another flush. The iSCSI SYNCHRONIZE CACHE command should (at least theoretically) cause ZFS to initiate an immediate synchronous write and wait for its completion. That should be a more efficient way to use resources, compared to burning your ZIL with every single write operation.

Yeah, I understand and thought that the initiator would issue the sync and it would show that on the underlying devices. Unfortunately I've seen this to be a problem with the target software, not the initiator. It would seem that open-iscsi (istgt) doesn't support this command, and as such I don't see ZIL/log usage until I set the dataset to [CMD=""]zfs set sync=always pool/dataset[/CMD] (which isn't what I want).

As I saw with some Linux tests, IET iSCSI didn't show sync writes despite the initiator end running sync tests. It wasn't until I changed to targetcli (by RisingTide) that I saw sync tests use the ZIL/log. Are there other target iSCSI software options with FreeBSD?

jkhilmer · Jul 29, 2015

Resurrecting this to say that it's also possible to cause the ZIL to be inactive if you have logbias=throughput. I accidentally left it set as throughput globally during some testing, and the ZIL was completely inactive.

jrm@ · Jul 29, 2015

gkontos said:
Try using this formula bellow when building RAIDZ pools:

Code:

RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9 RAID-Z2 = 2^n + 2 Disks. IE. 4,6,10 RAID-Z3 = 2^n + 3 Disks. IE. 5,7,11

It the thread is being resurrected, it's worth mentioning that a ZFS developer, Matt Ahrens, has different thoughts on this.
ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ.