Other Bad performance wih multipath FC

Hello,

As its my first post here on this board, just a little background on myself:
I'm quite new to BSD, but a linux user for about 14 years now and network/sysadmin since ~5 years. I was using mainly debian since the sarge/etch days, now thanks to systemd I switched to devuan. Over the years I gazed on different BSDs several times, but never found the time to do it right.
Thanks to ZFS and the pfSense box I'm using for ~2 months now on my Network, my interest in FreeBSD got quite a boost over the last months, and I finally found the time to dive in a little deeper during the last few weeks - and I really like it :)

I currently have FreeBSD 10.3 installed on a diskless test machine, which is booting from FC targets. The system is an old-ish Intel S5000 with a single Xeon L5410 and 8GB FB-ECC RAM and a Qlogic QLE2462 4Gbit HBA.
The targets are zvols on my devuan storage server with ZFS on Linux, 16GB RAM, 3x mirror vdevs and ZIL + L2ARC on 2 SSDs. The mirror devices and SSDs are spread over 2 SAS-Controllers, so performance of the pool is rather overkill for my small home network...

After a little research, the FreeBSD setup on a multipath FC-target with ZFS went really smooth - actually even without much BSD-experience it was much faster and more stable than with linux, where dm-multipath is constantly blowing up...
The only issue I ran into with FreeBSD is the secondary GPT Table overwriting the gmultipath-labels (or the labels overwriting the secondary GPT...). So as a workaround I had to set up the mp-labels on each partition individually.

Sadly, the performance is quite bad. And by bad I mean only ~30% of what I get with a devuan jessie install on the same box. (Well, until dm-multipath shoots itself in the foot again, and I can only use individual paths...)

For the test I created a new 20GB zvol and exported it to the client, so I'm not measuring against the LUN with the system on it.
For I/O measurement I'm using this tool: https://github.com/cxcv/iops

Performance on devuan jessie with active/active multipath was quite good out of the box:
Code:
# ./iops /dev/mapper/mpathc
/dev/mapper/mpathc,  21.47 GB, 32 threads, random:
 512  B blocks: 138146.1 IO/s,  70.7 MB/s (565.8 Mbit/s)
   1 kB blocks: 139215.5 IO/s, 142.6 MB/s (  1.1 Gbit/s)
   2 kB blocks: 137167.2 IO/s, 280.9 MB/s (  2.2 Gbit/s)
   4 kB blocks: 140778.5 IO/s, 576.6 MB/s (  4.6 Gbit/s)
   8 kB blocks: 89767.0 IO/s, 735.4 MB/s (  5.9 Gbit/s)
  16 kB blocks: 49945.3 IO/s, 818.3 MB/s (  6.5 Gbit/s)
  32 kB blocks: 25116.8 IO/s, 823.0 MB/s (  6.6 Gbit/s)
  65 kB blocks: 12702.6 IO/s, 832.4 MB/s (  6.7 Gbit/s)
 131 kB blocks: 6203.5 IO/s, 813.1 MB/s (  6.5 Gbit/s)
 262 kB blocks: 2008.3 IO/s, 526.5 MB/s (  4.2 Gbit/s)
 524 kB blocks: 1193.9 IO/s, 625.9 MB/s (  5.0 Gbit/s)
   1 MB blocks:  418.8 IO/s, 439.1 MB/s (  3.5 Gbit/s)
   2 MB blocks:  161.9 IO/s, 339.6 MB/s (  2.7 Gbit/s)
   4 MB blocks:   90.3 IO/s, 378.9 MB/s (  3.0 Gbit/s)
   8 MB blocks:   46.8 IO/s, 392.5 MB/s (  3.1 Gbit/s)
  16 MB blocks:   23.2 IO/s, 389.4 MB/s (  3.1 Gbit/s)
This should be the PCIe1.1 x4 interface of the HBA being maxed out at just over 800MB/s, given a theoretical max. throughput of ~1000MB/s without protocol overhead. So absolutely nothing to complain here.

But with Freebsd the metrics are rather bad:
Code:
# ./iops /dev/multipath/test
/dev/multipath/test,  21.47 GB, 32 threads, random:
 512  B blocks: 30555.5 IO/s,  15.6 MB/s (125.2 Mbit/s)
  1 kB blocks: 29949.8 IO/s,  30.7 MB/s (245.3 Mbit/s)
  2 kB blocks: 29921.4 IO/s,  61.3 MB/s (490.2 Mbit/s)
  4 kB blocks: 29601.1 IO/s, 121.2 MB/s (970.0 Mbit/s)
  8 kB blocks: 24925.1 IO/s, 204.2 MB/s (  1.6 Gbit/s)
  16 kB blocks: 17865.4 IO/s, 292.7 MB/s (  2.3 Gbit/s)
  32 kB blocks: 10089.4 IO/s, 330.6 MB/s (  2.6 Gbit/s)
  65 kB blocks: 5067.6 IO/s, 332.1 MB/s (  2.7 Gbit/s)
 131 kB blocks: 2535.4 IO/s, 332.3 MB/s (  2.7 Gbit/s)
 262 kB blocks: 1271.0 IO/s, 333.2 MB/s (  2.7 Gbit/s)
 524 kB blocks:  634.1 IO/s, 332.4 MB/s (  2.7 Gbit/s)
  1 MB blocks:  316.7 IO/s, 332.1 MB/s (  2.7 Gbit/s)
  2 MB blocks:  153.9 IO/s, 322.8 MB/s (  2.6 Gbit/s)
  4 MB blocks:  63.0 IO/s, 264.2 MB/s (  2.1 Gbit/s)
  8 MB blocks:  30.0 IO/s, 251.7 MB/s (  2.0 Gbit/s)

Changing the multipath from active/active to active/read decreases performance even more.
Not only is the throughput much lower, but IOPS are just hopelessly bad - 140k vs 30k at 4kB blocksize...

During the test CPU load is ~20% on both OS. Local caching shouldn't affect the measurement, as the tool is syncing everything directly to disk. RAM-usage is staying low on both installs and only fluctuates over a few MB, so apparently none of the OS is "cheating" here.

One thing I recognized was camcontrol reporting a blocksize of 512bytes for the LUNs:
Code:
# camcontrol readcap /dev/da4 -b
Block Length: 512 bytes
Is the detection by CAM affecting any internal command queuing or how the system is handling IO to/from the device?
If the system tries to access 512byte blocks this might add just enough overhead to explain the bad performance...

Is it possible to manually set the blocksize of a device? camcontrol seems to only offer reading the BS (or I overlooked it in the manpage...)
Are there any other places to look at or knobs to tweak?

Thanks,
Sebastian
 
camcontrol readcap only reports what device tells. It is up to your device what to report. FreeBSD properly detects both logical and physical sector sizes when reported by device. ZFS can live well with sector sizes up to 8KB. There is no way to manually set sector sizes in CAM, but it is possible to tune minimal ashift for ZFS during new pool creation via sysctls.
 
The ZFS pool was set up with 8k sector size, of course. Also the partitions on the installation-LUN were aligned to 8k sectors. But the performance was measured on a "naked" LUN with no partition table or any filesystem.

I manually set the blocksize reported by SCST to 8k, camcontrol is now reporting the correct blocksize:
Code:
# camcontrol readcap /dev/da2
Last Block: 131071, Block Length: 8192 bytes
As it turned out, with blocksizes other than 512 the iops-script is failing on FreeBSD. I already contacted the Author about this issue.

To see if the blocksize is actually the root cause of the performance issues, I used diskinfo -tc to get a rough idea if anything is improving:

Code:
# diskinfo -tc /dev/multipath/test
/dev/multipath/test
  512  # sectorsize
  21474836480  # mediasize in bytes (20G)
  41943040  # mediasize in sectors
  8192  # stripesize
  0  # stripeoffset
  2610  # Cylinders according to firmware.
  255  # Heads according to firmware.
  63  # Sectors according to firmware.
  d404a3d2  # Disk ident.

I/O command overhead:
  time to read 10MB block  0.038979 sec  =  0.002 msec/sector
  time to read 20480 sectors  1.366110 sec  =  0.067 msec/sector
  calculated command overhead  =  0.065 msec/sector

Seek times:
  Full stroke:  250 iter in  0.017895 sec =  0.072 msec
  Half stroke:  250 iter in  0.018014 sec =  0.072 msec
  Quarter stroke:  500 iter in  0.035301 sec =  0.071 msec
  Short forward:  400 iter in  0.028349 sec =  0.071 msec
  Short backward:  400 iter in  0.027527 sec =  0.069 msec
  Seq outer:  2048 iter in  0.135102 sec =  0.066 msec
  Seq inner:  2048 iter in  0.133832 sec =  0.065 msec
Transfer rates:
  outside:  102400 kbytes in  0.377660 sec =  271143 kbytes/sec
  middle:  102400 kbytes in  0.373138 sec =  274429 kbytes/sec
  inside:  102400 kbytes in  0.371721 sec =  275475 kbytes/sec

Code:
# diskinfo -tc /dev/multipath/test
/dev/multipath/test
  8192  # sectorsize
  21474836480  # mediasize in bytes (20G)
  2621440  # mediasize in sectors
  0  # stripesize
  0  # stripeoffset
  163  # Cylinders according to firmware.
  255  # Heads according to firmware.
  63  # Sectors according to firmware.
  d404a3d2  # Disk ident.

I/O command overhead:
  time to read 10MB block  0.038540 sec  =  0.002 msec/sector
  time to read 20480 sectors  2.030732 sec  =  0.099 msec/sector
  calculated command overhead  =  0.097 msec/sector

Seek times:
  Full stroke:  250 iter in  0.025881 sec =  0.104 msec
  Half stroke:  250 iter in  0.025766 sec =  0.103 msec
  Quarter stroke:  500 iter in  0.050826 sec =  0.102 msec
  Short forward:  400 iter in  0.040845 sec =  0.102 msec
  Short backward:  400 iter in  0.039267 sec =  0.098 msec
  Seq outer:  2048 iter in  0.192181 sec =  0.094 msec
  Seq inner:  2048 iter in  0.214216 sec =  0.105 msec
Transfer rates:
  outside:  102400 kbytes in  0.374317 sec =  273565 kbytes/sec
  middle:  102400 kbytes in  0.376855 sec =  271723 kbytes/sec
  inside:  102400 kbytes in  0.375223 sec =  272904 kbytes/sec

The seek times are slightly higher as expected, but transfer rates are still the same (or even slightly lower). So the blocksize FreeBSD is assuming/using on the device is not causing the low performance...
 
Back
Top