ZFS High system CPU as iscsi storage for VMware

We are using a Freebsd 11.0-RELEASE-p9 server as an iSCSI storage backend for a vmWare ESXi 6 cluster. The server is a Dell PowerEdge R620 with an attached md1220 disk enclosure full of SSDs. Those SSDs are setup in mirrored vdevs and a zvol shared over iSCSI with vmWare which has it formatted with vmfs5. What we are experiencing is high system CPU that seems to correspond with high write activity. This has caused some issues with VM performance including IO timeouts requiring fscks/reboots of VMs.

Another thing to note is that we are relying on lz4 compression to allow for some "oversubscription" -- we present an 8T B volume to vmWare via iSCSI, but there's really only 5 TB available in the pool.

My suspicion is that this is a result of fragmentation in the metaslabs, but it is only at 52% capacity -- previous experience (and what is considered the "traditional" wisdom in ZFS-land) is that 80% capacity is about the magic tipping point. I'm not sure how to definitively prove that is or is not actually the issue. Block size is 4 KB to match the underlying SSDs, and from the zdb -M output, it looks like there are still plenty of 4 KB blocks available. Any pointers or suggestions would be most appreciated.

zdb -M (just the summary
Code:
pool storage fragmentation 71%
  9: 25113486 **************
10: 69352157 **************************************
11: 73587828 ****************************************
12: 73809494 ****************************************
13: 33958358 *******************
14: 16266259 *********
15: 7790609 *****
16: 3767282 ***
17: 1679263 *
18: 507941 *
19:  99539 *
20:  11143 *
21:   4242 *
22:   1517 *
23:    157 *
24:     26 *
25:      1 *

Code:
$ zpool list storage
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  5.09T  2.66T  2.42T         -    71%    52%  1.00x  ONLINE  -

Code:
$ zpool status storage
  pool: storage
 state: ONLINE
  scan: scrub canceled on Fri Jul  7 01:37:15 2017

config:
NAME                    STATE     READ WRITE CKSUM
storage                 ONLINE       0     0     0
  mirror-0              ONLINE       0     0     0
    multipath/0RY81DLA  ONLINE       0     0     0
    multipath/0RY82EYA  ONLINE       0     0     0
  mirror-1              ONLINE       0     0     0
    multipath/0RY82PAA  ONLINE       0     0     0
    multipath/0RY82PGA  ONLINE       0     0     0
  mirror-2              ONLINE       0     0     0
    multipath/0RY82SJA  ONLINE       0     0     0
    multipath/0RY82XHA  ONLINE       0     0     0
  mirror-3              ONLINE       0     0     0
    multipath/0RY82ZGA  ONLINE       0     0     0
    multipath/0RY842VA  ONLINE       0     0     0
  mirror-4              ONLINE       0     0     0
    multipath/0RY842YA  ONLINE       0     0     0
    multipath/0RY84A8A  ONLINE       0     0     0
  mirror-5              ONLINE       0     0     0
    multipath/0RY84BNA  ONLINE       0     0     0
    multipath/0RY84BXA  ONLINE       0     0     0
  mirror-6              ONLINE       0     0     0
    multipath/0RY84D2A  ONLINE       0     0     0
    multipath/0RY84D4A  ONLINE       0     0     0
spares
  multipath/0RY857DA    AVAIL

Code:
$ zfs get all storage/vm-storage

NAME                PROPERTY              VALUE                  SOURCE
storage/vm-storage  type                  volume                 -
storage/vm-storage  creation              Fri Jul  7 13:17 2017  -
storage/vm-storage  used                  2.66T                  -
storage/vm-storage  available             2.26T                  -
storage/vm-storage  referenced            2.66T                  -
storage/vm-storage  compressratio         1.78x                  -
storage/vm-storage  reservation           none                   default
storage/vm-storage  volsize               8T                     local
storage/vm-storage  volblocksize          4K                     -
storage/vm-storage  checksum              on                     default
storage/vm-storage  compression           lz4                    local
storage/vm-storage  readonly              off                    default
storage/vm-storage  copies                1                      default
storage/vm-storage  refreservation        none                   default
storage/vm-storage  primarycache          all                    local
storage/vm-storage  secondarycache        all                    default
storage/vm-storage  usedbysnapshots       0                      -
storage/vm-storage  usedbydataset         2.66T                  -
storage/vm-storage  usedbychildren        0                      -
storage/vm-storage  usedbyrefreservation  0                      -
storage/vm-storage  logbias               throughput             inherited from storage
storage/vm-storage  dedup                 off                    default
storage/vm-storage  mlslabel                                     -
storage/vm-storage  sync                  standard               local
storage/vm-storage  refcompressratio      1.78x                  -
storage/vm-storage  written               2.66T                  -
storage/vm-storage  logicalused           4.65T                  -
storage/vm-storage  logicalreferenced     4.65T                  -
storage/vm-storage  volmode               dev                    local
storage/vm-storage  snapshot_limit        none                   default
storage/vm-storage  snapshot_count        none                   default
storage/vm-storage  redundant_metadata    all                    default

Relevant config from ctl.conf:
Code:
target iqn.2017-03.com.flightaware:target0 {
    alias "VMware Target"
    portal-group pg0
    auth-group ag0
    lun 0 {
        path /dev/zvol/storage/vm-storage
        size 8796093022208
    }
}

Output of top:
Code:
CPU:  0.7% user,  0.0% nice, 64.8% system,  3.7% interrupt, 30.8% idle
Mem: 5880K Active, 39M Inact, 104G Wired, 21G Free
ARC: 90G Total, 10G MFU, 69G MRU, 46M Anon, 7070M Header, 3924M Other
Swap: 16G Total, 1800K Used, 16G Free

  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
    0 root            8    -     0K 12240K -      13  65.8H  36.05% kernel{cblk0 taskq_4}
    0 root            8    -     0K 12240K -      17  65.5H  33.79% kernel{cblk0 taskq_9}
    0 root          -16    -     0K 12240K -       7  88.7H  32.92% kernel{zio_write_intr_7}
    0 root            8    -     0K 12240K -      18  65.7H  32.66% kernel{cblk0 taskq_6}
   12 root          -92    -     0K  1440K WAIT    6  94.8H  31.78% intr{irq316: ix1:q6}
    0 root          -16    -     0K 12240K -      13  88.6H  31.24% kernel{zio_write_intr_5}
    0 root            8    -     0K 12240K -      17  65.7H  31.07% kernel{cblk0 taskq_7}
    0 root            8    -     0K 12240K CPU12  12  65.5H  30.49% kernel{cblk0 taskq_11}
    0 root            8    -     0K 12240K -       6  65.8H  30.09% kernel{cblk0 taskq_8}
    0 root            8    -     0K 12240K -       4  65.8H  29.43% kernel{cblk0 taskq_13}
    0 root          -15    -     0K 12240K -      17 113.9H  29.00% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      13 113.8H  28.95% kernel{zio_write_issue_}
    0 root            8    -     0K 12240K -      19  65.7H  28.86% kernel{cblk0 taskq_0}
    0 root          -16    -     0K 12240K -       3  88.6H  28.84% kernel{zio_write_intr_3}
    0 root            8    -     0K 12240K -       8  65.2H  28.69% kernel{cblk0 taskq_3}
    0 root          -16    -     0K 12240K -       4  88.6H  28.53% kernel{zio_write_intr_2}
    0 root          -15    -     0K 12240K -      13 113.9H  28.46% kernel{zio_write_issue_}
    0 root            8    -     0K 12240K CPU1    1  65.7H  28.42% kernel{cblk0 taskq_5}
    0 root          -16    -     0K 12240K -      13  88.6H  28.29% kernel{zio_write_intr_0}
    0 root          -15    -     0K 12240K -      14 113.8H  28.24% kernel{zio_write_issue_}
    0 root          -16    -     0K 12240K -       5  88.7H  28.24% kernel{zio_write_intr_1}
    0 root          -16    -     0K 12240K -      19  88.6H  28.01% kernel{zio_write_intr_4}
    0 root            8    -     0K 12240K -      23  65.8H  27.92% kernel{cblk0 taskq_12}
    0 root          -15    -     0K 12240K -       2 113.9H  27.84% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      20 113.9H  27.14% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -       8 113.9H  26.90% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      14 113.9H  26.60% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -       9 113.9H  26.58% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -       5 113.9H  26.55% kernel{zio_write_issue_}
    0 root          -16    -     0K 12240K -      15  88.6H  26.20% kernel{zio_write_intr_6}
    0 root          -15    -     0K 12240K -      18 113.9H  26.17% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      23 113.9H  26.07% kernel{zio_write_issue_}
    0 root            8    -     0K 12240K -      17  65.7H  25.62% kernel{cblk0 taskq_10}
    0 root          -15    -     0K 12240K -       1 113.9H  25.41% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      15 113.9H  25.33% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      17 113.9H  25.31% kernel{zio_write_issue_}
    0 root            8    -     0K 12240K -      21  65.7H  25.12% kernel{cblk0 taskq_1}
    0 root            8    -     0K 12240K -      21  65.2H  25.02% kernel{cblk0 taskq_2}
    0 root          -15    -     0K 12240K -       7 113.9H  24.84% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      17 113.8H  23.88% kernel{zio_write_issue_}
    0 root          -15    -     0K 12240K -      20 113.9H  22.09% kernel{zio_write_issue_}
   12 root          -88    -     0K  1440K WAIT    4  36.1H  19.28% intr{irq320: mps1}
   12 root          -88    -     0K  1440K WAIT    9  36.0H  19.05% intr{irq319: mps0}
    4 root          -16    -     0K    96K RUN    11  33.8H  18.50% cam{doneq1}
    4 root          -16    -     0K    96K -      15  35.6H  17.57% cam{doneq3}
    4 root          -16    -     0K    96K -       9  33.9H  17.13% cam{doneq2}
    4 root          -16    -     0K    96K -       3  35.2H  16.59% cam{doneq0}
 1319 root          -16    -     0K   128K -      19  78.4H  15.66% ctl{work0}
    5 root           -8    -     0K   160K tx->tx  5 119.0H  13.98% zfskern{txg_thread_enter}
    4 root          -16    -     0K    96K -       8  25.1H  13.61% cam{doneq4}
    5 root           -8    -     0K   160K arc_us 17  58.1H  11.11% zfskern{arc_user_evicts_}
    0 root          -16    -     0K 12240K icl_tx  3  42.4H  10.84% kernel{cfiscsitx}
    0 root          -16    -     0K 12240K icl_rx  3  41.8H  10.31% kernel{cfiscsirx}
   12 root          -92    -     0K  1440K WAIT    0  63.5H   5.83% intr{irq301: ix0:q0}
 7401 adam.schumach  22    0 26264K  6844K CPU20  20   0:02   4.11% top
 1319 root          -16    -     0K   128K -       2  18.3H   3.80% ctl{work4}
    0 root          -92    -     0K 12240K -      15 354:53   2.88% kernel{ix1:q6}
    0 root          -16    -     0K 12240K icl_rx  7  26.4H   2.83% kernel{cfiscsirx}
34115 root           52    0 30168K  5412K piperd 19 108:19   2.17% zabbix_agentd
    0 root          -16    -     0K 12240K icl_tx  8  18.4H   2.12% kernel{cfiscsitx}
 1319 root          -16    -     0K   128K -       0 691:39   2.08% ctl{work3}
34117 root           52    0 30168K  5436K piperd 19 108:45   1.98% zabbix_agentd
    0 root          -16    -     0K 12240K -      18 143:22   1.46% kernel{zio_read_intr_7_}
    0 root          -16    -     0K 12240K -       2 143:35   1.45% kernel{zio_read_intr_7_}
    0 root          -16    -     0K 12240K -       9 143:35   1.44% kernel{zio_read_intr_7_}
    0 root          -16    -     0K 12240K -       7 143:03   1.43% kernel{zio_read_intr_2_}
    0 root          -16    -     0K 12240K -      19 142:17   1.43% kernel{zio_read_intr_1_}
    0 root          -16    -     0K 12240K -       5 142:13   1.38% kernel{zio_read_intr_3_}
    [....]
 
We are using a Freebsd 11.0-RELEASE-p9
I would recommend updating to 11.1-RELEASE before going any further. It's possible this issue has already been resolved. Besides that, FreeBSD 11.0 will be EoL in a month or so, so you're going to have to update anyway.
 
Okay, I've upgraded to 11.1-RELEASE-p2 and it doesn't seem to have solved the issue. Immediately after a reboot, everything is fast and appears to perform normally. After a couple days (eg, the weekend), even simple things like calling `netstat -m` take a few seconds and cause the system cpu to spike briefly in top. I'm starting to suspect it has something to do with kernel memory and zfs. I noticed that if I start copying large files around, the ARC increases and correspondingly WIRED memory increases. However, if I delete these files, the ARC decreases, but WIRED never does:

Code:
last pid: 79093;  load averages:  1.16,  1.65,  1.89                                                                                                                                                                                                                                                                                            up 2+21:36:56  14:54:18
38 processes:  2 running, 36 sleeping
CPU:  0.8% user,  0.0% nice,  4.7% system,  0.0% interrupt, 94.5% idle
Mem: 7548K Active, 102M Inact, 8044K Laundry, 119G Wired, 5223M Free
ARC: 25G Total, 29M MFU, 24G MRU, 32K Anon, 50M Header, 328M Other
    24G Compressed, 24G Uncompressed, 1.01:1 Ratio
Swap: 16G Total, 16G Free

I ran bonnie++ both immediately after a reboot and then this morning after seeing the problem come back. The difference is striking:

Code:
sudo bonnie++ -u root -r 1024 -s 32768:1048576 -d /storage/ -f -b -n 1 -c 12

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency  12     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
royoh.hou.fl 32G:1m           3405363  99 2142021  99           5249593  99 12124 1662
Latency                        1883us    2580us               597us    3524us
Version  1.97       ------Sequential Create------ --------Random Create--------
royoh.hou.flightawa -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               617us      59us     295us     210us      19us     343us

vs

Code:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-           
Concurrency  12     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--           
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP           
royoh.hou.fl 32G:1m           183368  99 95829  99           214537  99 484.6 1628       
Latency                       23936us   46553us             12381us   66021us             
Version  1.97       ------Sequential Create------ --------Random Create--------           
royoh.hou.flightawa -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--           
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP           
                  1   307  68 +++++ +++   297  67   300  67 +++++ +++   304  66           
Latency              4042us    1682us   14770us   16957us     557us   24299us


vmstat -z output attached.
 

Attachments

  • vmstat-z_output.txt
    18.1 KB · Views: 710
...I noticed that if I start copying large files around, the ARC increases and correspondingly WIRED memory increases. However, if I delete these files, the ARC decreases, but WIRED never does:

If wired memory doesn't decrease when you delete large files -- just a rough guess -- maybe ZFS is not able to unmap these blocks to-be-released. By the way, how large are these large files you are deleting/copying around ?
 
The files are 8GB. They are just random data dd'd from /dev/random for the purposes of testing. Nothing else should be holding them open or using them in any way.
 
Rebuild the zvol with volblocksize 64!
Iscsi does transfers much more than 4k. This worked for me.

I switched back to the default of 128K and am still seeing higher than expected system CPU. I've tried several block sizes, including 1M (which is the block size that VMFS uses) and none of them have made any apparent difference
 
forums.freenas.org have some very interesting topics about iscsi and problems like yours. Especially by jgreco. In the resource section "Why iscsi often requires more resources for the same result" for example. As I remember, for decent performance with iscsi you must use mirrors (like you), 64 GB RAM and the pool have to be around 30% full (max).
 
just a guess: you have too much ARC.

Try limiting it to 4GB by setting vfs.zfs.arc_max="4G" in /boot/loader.conf and perform a reboot.
If that doesn't help, what does CTRL+T produce when you execute netstat -m, and it hangs?
 
ivosevb : yeah, I've read a number of posts on the FreeNAS forums that have at least appeared tangentially related. I'll look closer at jgreco's stuff. I've already seen a few from him that may be helpful. Thanks


Nils Beyer : 4GB seems kind of extreme, given that the box has 128GB of RAM and these files i'm testing with are 8GB. I did have the thought of trying to reduce arc_max earlier and then dismissed it for some reason I cannot recall now. I'll give that a shot though and report back.

netstat -m with CTRL+T during "hang":
Code:
$ netstat -m

load: 1.05  cmd: netstat 98835 [running] 0.27r 0.01u 0.12s 4% 2512k
(regular netstat output here)
 
4GB seems kind of extreme,[...]
yes, it's only a test whether the thing will slow down after a while or not. You can increase step by step if that decrease helps. I don't have a machine with 128GB RAM available to check it myself, I'm sorry.

netstat -m with CTRL+T during "hang"
nothing serious besides the "running" state I'd say. You can press CTRL+T several times in fast succession and check if there are different states during the hang...
 
Back
Top