ZFS Troubleshooting ZFS storage pool

howard.digitech · Dec 12, 2019

I am looking for advice on how to identify performance issues with a ZFS storage server. The server is a prototype that I have built as a proof of concept for a NAS device that will allow me to offload camera memory cards to shared storage. We are talking around 512GB transfers with checksum verification processed by the client. I have a fair bit of Linux experience, but opted for FreeBSD for this build as its native support for ZFS.

Details:
FreeBSD 12.1
Intel Xeon E3 1225
Intel S1200BTS Motherboard
16GB DDR3 Ram
Dell H310 flashed to IT mode (installed in a 16x slot)
Aquantia aqc107 10gbe network card.
6x 2TB HGST Ultrastar SAS drives. (Configured into one RAID 0 storage pool)

Initial Expectations
I was hoping that the server would run at around 900mbps read and write over 10gbe.

Testing and Results
I have found it difficult to measure pool speed over 10gbe as the tools that I usually use have not been able to give me a consistent result. I have tested with AJA speed test and Blackmagic Disk Speed Test and get different results from each.
AJA: 140mbs Write / 80mbs Read
Blackmagic: 450mbs Write / 80mbs Read

Actual use offloading camera memory cards to server using Hedge offloading software 10GB transfer averages around 170mb per second.

I these results were strange and quite low for my expectations. I have tested the transfer speeds with a Ramdisk and got around 1GB per second each way, which rules out the actual connection.

Questions

Why does the benchmarking software report different speeds?
What is the best way of identifying the bottleneck in this situation? My initial thoughts that it maybe the Dell H310
Will configuration and caching help in this case? Or is the issue purely hardware related.

I think I have included everything that will be useful to solve this issue. Please forgive me if anything has been omitted.
H

howard.digitech · Dec 13, 2019

Forgot to mention, connection protocol is Samba.

If anyone can help I would really appreciate it!

Eric A. Borisch · Dec 14, 2019

Looking at gstat -po while running the transfer can be informative, especially the busy %. I’d start there to see what the drives are seeing.

I worry about the fact that you are saving data with client checksum (which sounds like the data is important to you) but with no redundancy. If this is an extra backup copy, perhaps that is ok, but certainly not if this is the main archive. This doesn’t address your question, but I thought I should point it out.

howard.digitech · Dec 14, 2019

Thanks for the reply. This storage pool is one of several backups. As this is a prototype server I wanted to set a baseline speed I could get in raid 0. If it ever reaches a stage where it is good enough to run in production, I will probably run with a raidz configuration of some kind.

I ran your command whilst a transfer was in progress and had the following output:

 

dT: 1.010s  w: 1.000s

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    o/s   ms/o   %busy Name

    0      9      0      0    0.0      9    246    0.5      0    0.0    0.1| ada0

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| ada1

    0    455      0      0    0.0    453  48784   50.7      2   72.9   48.4| da0

    0    445      0      0    0.0    442  47864   48.2      3   59.1   66.1| da1

    0    457      0      0    0.0    454  50593   64.6      3   70.1   59.0| da2

    0    436      0      0    0.0    433  50725   35.7      3   65.4   54.7| da3

    0    407      0      0    0.0    405  47764   50.9      2   72.0   50.7| da4

    0    434      0      0    0.0    432  50652   53.5      2   62.5   49.3| da5

An interesting observation is that it does this every second or so:

 

dT: 1.007s  w: 1.000s

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    o/s   ms/o   %busy Name

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| ada0

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| ada1

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da0

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da1

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da2

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da3

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da4

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| da5

I also ran a test on the pool speed locally with bonnie++ and believe that my initial thoughts of the HBA being the bottleneck maybe wrong:

Am I correct in thinking that the block speed on the sequential output is the local write speed @ 883mb/s

Cheers,
H

howard.digitech · Dec 14, 2019

in comparison a test with gstat -po whilst running bonnie++ gave the following results:

 

dT: 1.005s  w: 1.000s

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    o/s   ms/o   %busy Name

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| ada0

    0      0      0      0    0.0      0      0    0.0      0    0.0    0.0| ada1

    0    169      0      0    0.0    164   3748    1.6      5   23.0   34.8| da0

    0    215      0      0    0.0    210   3609    1.1      5   40.3   40.7| da1

    0    157      0      0    0.0    153   3295    1.8      4   38.1   38.0| da2

    0    203      0      0    0.0    199   3633    1.3      4   47.5   44.5| da3

    0    157      0      0    0.0    153   3567    1.7      4   33.5   37.7| da4

    0    218      0      0    0.0    213   4089    1.6      5   29.7   46.3| da5

When testing this the command did not revert to 0 every second or so like the transfer over ethernet.

howard.digitech · Dec 14, 2019

Also included are the results with the speed test software running on a client machine. Notice how different they are:

Eric A. Borisch · Dec 14, 2019

Start testing with your desired production layout (raidz-n) NOW. The performance characteristics of a RAIDZ-n (with redundancy) are much different than a striped layout.

It looks to me that one (black magic) is doing much larger (~128kB vs ~8kB) block sizes, based on the KB/s and w/s readings from gstat. (I'm assuming these are both writing to the same ZFS dataset; you can set the maximum block write size via recordsize settings, for example.) In addition, AJA is asking for a sync (flush to disk / FUA / fsync(); goes by many names) with each write, while black magic is not (one sync per write, based on the o/s and w/s fields in the snapshot above). If you want better statistics from gstat(8), you can use -I 5s for a 5-second average display, for example.

What your transfer application will use for write sizes and/or fsync() behavior will depend on the application.

Dangerous suggestions ahead.

If you can't adjust the application, you can adjust how zfs(8) and samba (in smb.conf(5)) deal with sync requests, which may help for throughput if you're getting slower performance than some other application on the same hardware. (Note IO size still matters, but small IOs + sync == slow performance without extra hardware like slog devices.)

Option 1: At the zfs dataset level, you can zfs set sync=disabled pool/dataset/path on a per-dataset basis. Syncs are mainly of concern if (1) you care about your data and (2) you are concerned about a power/network outage. (Some would add system crashes, but I would argue if there is a crash, all bets are off, anyway.) This set of concerns is especially present for files that are required to keep your operating system running, or an important database consistent, for example. If this is an *extra* backup copy of bulk data -- and even better, if it an extra copy and you have a battery backup -- then turning off sync to enable getting the data transferred over quicker is a fairly low risk choice. The risks are mitigated by the fact that at most the last few seconds are at risk in the event of an outage, and you have the source data that you can re-copy if something does go south. Also, "going south" here will be obvious -- a power outage or system crash. (Note that once you hit near-100% utilization on the drives, or saturate your network pipe, you'll be slowed down regardless of sync settings as buffers fill.)

Option 2: At the samba level, you can set strict sync = no in smb.conf if you don't want syncs to be honored coming from the client. I haven't played with this command; I think I would go with Option 1 first, and only test with / without this if you didn't get enough benefit. It should enable sync requests to be "short-circuited" earlier in the write path.

Option 3: You can disable ZFS's sending of force-unit-access commands to the drive. This is achieved via sysctl vfs.zfs.vdev.bio_flush_disable=1. This means that a write is considered complete once it has been delivered to the drive, but ZFS will never wait for the disk to flush data out of the write cache. It will still slow down as the write buffer fills -- it's not going to drop bits on the floor. If you have a battery backup (and potentially redundant power supply) this isn't an all bad option, either. I personally would never set this without at least a battery backup. This will impact everything ZFS does with the disk, from directory operations to client sync() requests. (Although a non-disabled sync() will still make sure the data is at least to the on-drive write cache in this case.)

All of these come with risks; I've listed them in what I feel is least-to-highest risk.

Good luck!

Edit: fix zfs set command.

howard.digitech · Dec 15, 2019

Thanks for this. The data that will be stored on here will also be stored on several other production drives too. I work with an NVME raid as a working media raid. That tends to fill up pretty quickly. My plan was to offload from that to this server in order to free up room on the NVME raid. By the time that I start offloading onto this server, there is a strong possibility that the media will have already gone through an LTO backup at another facility.

Having experimented with the different configurations that you provided I am now getting the following.

Just running with zfs set sync=disabled pool/dataset/path helped massively.

Am I right in thinking that running the pool with sync disabled means that any SSD caching is redundant?

You mentioned about the block size that the different applications use to write files to the server. My understanding is that this setting is unique to each application?

zfs get record size pool returns
dit_share recordsize 128K default

If I find out what the block write size of the offloading application is and match that with the ZFS record size will that increase performance?

Could you offer any insights into why the read speed is much slower. I do have lz4 compression enabled, will that effect the read speed?

Thanks again,
H

Eric A. Borisch · Dec 16, 2019

Read performance likewise depends heavily on the application’s behavior; ZFS does lots of caching when memory is available. LZ4 will not be the bottleneck, but it you’re saving video files, it’s highly unlikely it will gain you anything, so turning it off should have some marginal benefit in CPU utilization if nothing else. You can check the compressratio property to see if LZ4 is buying anything with representative data.

You can check read performance with dd(1): dd bs=64k if=<path to test file> of=/dev/null. Note that if the file has been read recently, you’ll just be benchmarking the cache. Watch gstat again during the process to make sure it is hitting the filesystem.

If AJA is doing small transfers for the read benchmark, you may be just seeing the fastest SAMBA can service the requests. Watch gstat and top during the benchmark to see if anything jumps out.

Again, the benchmarks you are doing now do not guarantee any performance with a different drive layout, or with your actual application. Filesystem benchmarks are primarily beneficial to compare configurations, and even then, the variability from a wide variety of factors makes IO benchmarking a black art primarily useful for marketing.