Other Are virtual machines really so slow in term of IOPS ?

abishai · Jul 29, 2018

Hello, I've got old Xeon 5530 server (will upgrade to 5680). I have plans to use for virtual machines.
Storage configuration
1. 2Tb SATA disk
2. nvme drive for ZIL/L2ARC

fio run from console:

Code:

abishai@alpha:/test % doas fio --randrepeat=1 --ioengine=sync --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [m(1)][75.0%][r=299MiB/s,w=99.2MiB/s][r=76.5k,w=25.4k IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=83097: Sun Jul 29 17:01:52 2018
   read: IOPS=71.8k, BW=280MiB/s (294MB/s)(768MiB/2738msec)
   bw (  KiB/s): min=138019, max=331536, per=97.63%, avg=280253.00, stdev=81043.62, samples=5
   iops        : min=34504, max=82884, avg=70063.00, stdev=20261.20, samples=5
  write: IOPS=23.0k, BW=93.7MiB/s (98.2MB/s)(256MiB/2738msec)
   bw (  KiB/s): min=46131, max=110634, per=97.47%, avg=93481.40, stdev=26987.84, samples=5
   iops        : min=11532, max=27658, avg=23369.80, stdev=6747.06, samples=5
  cpu          : usr=27.29%, sys=72.71%, ctx=44, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=196498,65646,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=280MiB/s (294MB/s), 280MiB/s-280MiB/s (294MB/s-294MB/s), io=768MiB (805MB), run=2738-2738msec
  WRITE: bw=93.7MiB/s (98.2MB/s), 93.7MiB/s-93.7MiB/s (98.2MB/s-98.2MB/s), io=256MiB (269MB), run=2738-2738msec

Virtual machine with zvol backend (UFS)

Code:

Jul 29 16:54:50: bhyveload -c /dev/nmdm1A -m 4096M -e autoboot_delay=3 -d /dev/zvol/zdata/bhyve/test2/disk0 test2
Jul 29 16:54:53:  [bhyve options: -c 1 -m 4096M -AHP -U 885b274f-934d-11e8-ba9a-bcaec547a985]
Jul 29 16:54:53:  [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,virtio-blk,/dev/zvol/zdata/bhyve/test2/disk0 -s 5:0,virtio-net,tap1,mac=58:9c:fc:02:a1:30]
Jul 29 16:54:53:  [bhyve console: -l com1,/dev/nmdm1A]

Code:

abishai@test2:/tmp % doas fio --randrepeat=1 --ioengine=sync --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=64
fio-3.7
Starting 1 process
test: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=24.4MiB/s,w=8403KiB/s][r=6258,w=2100 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=628: Sun Jul 29 17:03:07 2018
   read: IOPS=4959, BW=19.4MiB/s (20.3MB/s)(768MiB/39620msec)
   bw (  KiB/s): min= 3243, max=30898, per=91.38%, avg=18128.09, stdev=7126.81, samples=74
   iops        : min=  810, max= 7724, avg=4531.61, stdev=1781.73, samples=74
  write: IOPS=1656, BW=6628KiB/s (6787kB/s)(256MiB/39620msec)
   bw (  KiB/s): min= 1114, max= 9657, per=91.24%, avg=6046.36, stdev=2359.85, samples=74
   iops        : min=  278, max= 2414, avg=1511.23, stdev=590.00, samples=74
  cpu          : usr=4.41%, sys=26.93%, ctx=253477, majf=3, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=196498,65646,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=19.4MiB/s (20.3MB/s), 19.4MiB/s-19.4MiB/s (20.3MB/s-20.3MB/s), io=768MiB (805MB), run=39620-39620msec
  WRITE: bw=6628KiB/s (6787kB/s), 6628KiB/s-6628KiB/s (6787kB/s-6787kB/s), io=256MiB (269MB), run=39620-39620msec

I'm not familiar with virtualization, so I'm not sure is it good or bad or where is bottleneck, but it looks unusable

Maybe, you can give me some hints?

nihr43 · Jul 29, 2018

Yes, a vm is going to be a lot slower than the host. Especially IOPS. That said, zfs can be confusing to benchmark; you might set primarycache=metadata first to make sure you're not just hitting memory for reads.

Also, are you running a single disk with a single nvme for l2arc and slog?
and.. is that slog mirrored and battery or capacitor backed? If not, you're using slog wrong and it may hurt you in the long run.

For reference, looks like your virtual machine is a lot faster than my two striped sata ssds... I get 485 iops read and 152 write with your command, primarycache=metadata, and sync=standard. Though that pool is 92% full and 49% fragmented which certainly a bad scenario...

abishai · Jul 29, 2018

nihr43 said:
Also, are you running a single disk with a single nvme for l2arc and slog?

Well, I'm evaluating server, I'll put redundancy later.

I'm a little bit confused - is zvols accelerated with ZIL/L2ARC at all ? gstat shows that data comes from HDD only when I run tests in vm.

Also, I'm very interested what to choose:
1. ZFS on image file
2. UFS on image file
3. ZFS on zvol
4. UFS on ZVOL

I think about #4 with primarycache=all. UFS is simple system without much overheads and it gets benefits of ZFS when lies on ZVOL. I wonder will nvme cache help in this scenario.

Bobi B. · Jul 29, 2018

You can also use RAW devices/partitions for your VM images, but disk space will be preallocated and growing such an image will be much harder. Any particular reason why you insist on using VMs? Jails are much-more lightweight.

nihr43 · Jul 29, 2018

UFS inside a vm with the virtio driver on top of a physical zvol should give you the best performance, as far as freebsd virtual machines go. An image file backed vm will probably be very close, though I need to benchmark this for myself before giving a definitive answer. ZFS inside of a vm on top of ZFS will be very inefficient and provide no benefits.

ZVOLs will use their parent's L2ARC or SLOG, but only if the workload is actually cachable / making sync writes. I once saw a very large system with a whole lot of money spent on SLOGS, only to have the NFS shared as async.

I also have seen a "data-warehousing" type system with over 4TB of L2ARC, never touched - because the data was write-once read-rarely. admin guy just kept adding ram and ssds wondering why it wouldn't get faster...

Like I said, ZFS performance is a tricky subject. It really depends on your workload, and the only way to determine if it will be helpful is with a real workload, with real data.
EDIT: if *slog or l2arc* will be helpful

abishai · Jul 29, 2018

Bobi B. said:
Any particular reason why you insist on using VMs?

I'm not insisting, actually I have plans to use jails as well. However, lack of good management tools and absence of network stack upsets me. Are VIMAGE jails stabilized now? I evaluated them in FreeBSD 9.0 era and there were dragons (panics under load).

nihr43 said:
An image file backed vm will probably be very close, though I need to benchmark this for myself before giving a definitive answer. ZFS inside of a vm on top of ZFS will be very inefficient and provide no benefits

Can you share tests for me to compare numbers? I'm rewriting my setup with ansible, so it would be easy to provision test vms. Looks like all methods actually have use patterns and work very different.

1. UFS in image file (dataset with default settings): random write tests creates pressure on ZIL according gstat, slightly better
2. UFS on ZVOL (dataset with default settings): zero pressure on ZIL.
3. ZFS on ZVOL. This one is tricky. First of all, it's double cache overhead, so primary/secondary caches should (and, probably sync) be switched off or to metadata. (On client? On host?) If provided with enough memory, performance is good, but I don't think it's good practice not to use host cache and flood vms with memory for file cache.

abishai · Jul 29, 2018

Test line:

 doas fio --name=test --iodepth=4 --rw=randrw:2 --rwmixread=70 --rwmixwrite=30 --bs=8k --direct=0 --size=256m --numjobs=8

UFS in image:
startup parameters:

Code:

Jul 29 16:57:14: bhyveload -c /dev/nmdm0A -m 4096M -e autoboot_delay=3 -d /usr/local/bhyve/test/disk0.img test
Jul 29 16:57:17:  [bhyve options: -c 1 -m 4096M -AHP -U c0d20bf9-9346-11e8-ba9a-bcaec547a985]
Jul 29 16:57:17:  [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,virtio-blk,/usr/local/bhyve/test/disk0.img,nocache,direct -s 5:0,virtio-net,tap0,mac=58:9c:fc:0d:05:43]
Jul 29 16:57:17:  [bhyve console: -l com1,/dev/nmdm0A]

Results:

Code:

Run status group 0 (all jobs):
   READ: bw=75.3MiB/s (78.9MB/s), 9650KiB/s-16.4MiB/s (9881kB/s-17.2MB/s), io=1433MiB (1502MB), run=10895-19031msec
  WRITE: bw=32.3MiB/s (33.9MB/s), 4125KiB/s-7269KiB/s (4224kB/s-7444kB/s), io=615MiB (645MB), run=10895-19031msec

big pressure on nvme ZIL/L2ARC according gstat

UFS in ZVOL
startup parameters:

Code:

Jul 29 16:54:50: bhyveload -c /dev/nmdm1A -m 4096M -e autoboot_delay=3 -d /dev/zvol/zdata/bhyve/test2/disk0 test2
Jul 29 16:54:53:  [bhyve options: -c 1 -m 4096M -AHP -U 885b274f-934d-11e8-ba9a-bcaec547a985]
Jul 29 16:54:53:  [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,virtio-blk,/dev/zvol/zdata/bhyve/test2/disk0 -s 5:0,virtio-net,tap1,mac=58:9c:fc:02:a1:30]
Jul 29 16:54:53:  [bhyve console: -l com1,/dev/nmdm1A]

Results:
Run status group 0 (all jobs):
READ: bw=43.2MiB/s (45.2MB/s), 5527KiB/s-68.7MiB/s (5660kB/s-72.1MB/s), io=1433MiB (1502MB), run=2609-33200msec
WRITE: bw=18.5MiB/s (19.4MB/s), 2365KiB/s-29.4MiB/s (2421kB/s-30.8MB/s), io=615MiB (645MB), run=2609-33200msec
zero numbers on nvme.

So, host cache is not used for ZVOLs ?

abishai · Jul 29, 2018

Ahh, the second VM lacks nocache,direct options!
Now, I see nvme activity and numbers are much more better.

Code:

  READ: bw=88.0MiB/s (92.3MB/s), 11.0MiB/s-66.2MiB/s (11.6MB/s-69.4MB/s), io=1433MiB (1502MB), run=2721-16272msec
  WRITE: bw=37.8MiB/s (39.6MB/s), 4825KiB/s-27.9MiB/s (4941kB/s-29.2MB/s), io=615MiB (645MB), run=2721-16272msec

nihr43 · Jul 29, 2018

abishai said:
1. UFS in image file (dataset with default settings): random write tests creates pressure on ZIL according gstat
2. UFS on ZVOL (dataset with default settings): zero pressure on ZIL, numbers are slightly better.

Thats very interesting. Certainly zvol vms don't just write asynchronously all the time? Or, does an image file just always synchronize everything? Are you using thin provisioning anywhere?

edit: oh, i see

I'll definitely have to put together some spare parts next weekend and go through this.

are you aware of `zpool iostat -v 1`?

abishai · Jul 29, 2018

nihr43 said:
are you aware of `zpool iostat -v 1`?

No. Thank you

If I read fio man right, my test means 70% reads, 50% sequential. Obviously, I added vfs.zfs.l2arc_noprefetch="0" to /boot/loader.conf (nvme >>> single spindle drive)