ZFS ZFS read incredibly slow

wfaulk · Jul 23, 2015

I have a 10.0 system on which I have MySQL running with its InnoDB database files stored on ZFS. It seemed to have been going fine for months, but lately, and seemingly suddenly, performance has bottomed out. After debugging a while surrounding MySQL, I finally decided to just see if just reading the database files from the filesystem was slow.

I picked one table whose database file was about 16GB, ran:
time cat table.ibd > /dev/null

and got:

Code:

cat table.ibd > /dev/null
1.24s user 64.35s system 0% cpu 1:00:34.65 total

In comparison, a copy of the file (with some changes, I'm sure: the database is live) living on UFS on the same system gives me:

Code:

cat table.ibd > /dev/null
0.20s user 9.34s system 5% cpu 9.550 total

Here are non-default ZFS options on this system:
/boot/loader.conf:

Code:

vfs.zfs.arc_max=17179869184

/etc/sysctl.conf:

Code:

vfs.zfs.prefetch_disable=1

zfs get all:

Code:

recordsize      16K
compression      on
atime           off
primarycache    metadata
zfs:zfs_nocacheflush  1

There are 12 snapshots on this filesystem right now. (Which seems excessive; I'm going to see if any of them can be deleted.)

There's no L2ARC for this pool (or any other pool on the system).

I've tried all three values for primarycache, and I've tried re-enabling the prefetch, none of which seem to have had any significant effect.

The zpool is four 2-disk mirrors:

% zpool status mysqlrot

Code:

  pool: mysqlrot
state: ONLINE
  scan: scrub repaired 0 in 9h45m with 0 errors on Fri Jun 26 12:46:33 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mysqlrot    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        mfid9   ONLINE       0     0     0
        mfid10  ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        mfid11  ONLINE       0     0     0
        mfid12  ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        mfid13  ONLINE       0     0     0
        mfid14  ONLINE       0     0     0
      mirror-3  ONLINE       0     0     0
        mfid15  ONLINE       0     0     0
        mfid16  ONLINE       0     0     0
    spares
      mfid19    AVAIL

errors: No known data errors

One oddity is how the raw devices are set up. The storage controller is an MFI controller, and each raw disk is actually configured in the MFI controller as a 1-disk RAID0 volume:

Code:

% sudo mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe  State   Cache   Name
mfid0 (  185G) RAID-1      64K OPTIMAL Disabled <OS>
mfid1 (  558G) RAID-0      64K OPTIMAL Disabled <DB0A>
mfid2 (  558G) RAID-0      64K OPTIMAL Disabled <DB0B>
mfid3 (  558G) RAID-0      64K OPTIMAL Disabled <DB1A>
mfid4 (  558G) RAID-0      64K OPTIMAL Disabled <DB1B>
mfid5 (  558G) RAID-0      64K OPTIMAL Disabled <DB2A>
mfid6 (  558G) RAID-0      64K OPTIMAL Disabled <DB2B>
mfid7 (  558G) RAID-0      64K OPTIMAL Disabled <DB3A>
mfid8 (  558G) RAID-0      64K OPTIMAL Disabled <DB3B>
mfid9 (  558G) RAID-0      64K OPTIMAL Disabled <DB4A>
mfid10 (  558G) RAID-0      64K OPTIMAL Disabled <DB4B>
mfid11 (  558G) RAID-0      64K OPTIMAL Disabled <DB5A>
mfid12 (  558G) RAID-0      64K OPTIMAL Disabled <DB5B>
mfid13 (  558G) RAID-0      64K OPTIMAL Disabled <DB6A>
mfid14 (  558G) RAID-0      64K OPTIMAL Disabled <DB6B>
mfid15 (  558G) RAID-0      64K OPTIMAL Disabled <DB7A>
mfid16 (  558G) RAID-0      64K OPTIMAL Disabled <DB7B>
mfid17 (  558G) RAID-0      64K OPTIMAL Disabled <DB8A>
mfid18 (  558G) RAID-0      64K OPTIMAL Disabled <DB8B>
mfid19 (  558G) RAID-0      64K OPTIMAL Disabled <SPARE0>

Anyone have any ideas where to look?

User23 · Jul 24, 2015

Post the output of
zpool list
please.

wfaulk · Jul 24, 2015

Code:

# zpool list
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
mysql      2.17T   497G  1.69T    22%  1.00x  ONLINE  -
mysqllogs   556G   168G   388G    30%  1.00x  ONLINE  -
mysqlrot   2.17T  1.49T   696G    68%  1.00x  ONLINE  -

t1066 · Jul 25, 2015

I do not run database but you may run into the problem of fragmentation on COW filesystem. Either go for an all flash solution or move the database to UFS instead.

TheDreamer · Aug 8, 2015

I've been looking at slow ZFS read performance problems that I've seen on my machines recently. Feels like they started when I upgraded from FreeBSD 9.2 to 9.3, not sure if its getting worse or my tolerance had dropped to where I started looking. Where looking makes it feel even worse....

Today, I was specifically looking at why there's a large disparity in between read and write I/Os for MySQL, when I came across this thread. For MySQL, I'm using databases/percona56-server.

The zpool is on a mirror'd pair of SanDisk SDSSDHP256G on same SATA-II controller. The machine has 3 other SATA controllers and makes use of port multipliers, as its mainly file/archive server (+ mail). So, MySQL databases on here are for roundcube, Owncloud and phpmyadmin.

I forget which way around at the moment, but the choice between SATA-II or SATA-III port-multiplier capable cards came to do whether they were FIS or command based switching. Where the SATA-III ones ended up being slower. With the PCIe interface being my limiting factor, the difference between 1.0 and 2.0.

In previous release, I've played around with sysctl to change parameters like sysctl vfs.zfs.vdev.min_pending and sysctl vfs.zfs.vdev.max_pending that was felt to have helped. Though having a mix of workloads, throughput or latency, it was a challenge setting these to something that felt good for each system. Now these have been replaced with the many parameters of vfs.zfs.vdev.[a]sync_(read|write)_(min|max)_active, I have no feel on whether changing these makes things worse, better or does nothing at all.

Some of the sysctl's that I have changed (through /etc/sysctl.conf and/or /boot/loader.conf) include:

Code:

vfs.zfs.resilver_delay=1
vfs.zfs.min_auto_ashift=12
vfs.zfs.l2arc_write_max=16777216
vfs.zfs.l2arc_write_boost=16777216

and

Code:

vfs.zfs.arc_max="7680M"

None should impact MySQL. The first was probably when an eSata cable fell off an array while i was adding a disk 2 weeks ago, and wanted it to finish before the end of the weekend. i dreading the day need to replace a disk or replace one or both of the large zpools current on this system, though should do it before I add a third... both the pool of 1.5TB disks and pool of 2TB disks were done with 512-byte sectors. I think I'm out of cold spares in 1.5TB size now, and a couple are starting to get pretty high on reallocated sectors. But, the main thing I meant to say is that the spinning disk pools have ZILs and L2ARC SSDs. Currently an unbalanced pair, a 240G and a 320G SSD...originally only a single SSD, but I was alone at work on the necessity of mirrored ZIL's.

Elsewhere in the datacenter other co-workers had set up an Owncloud server, for SLA customer, with 6 disk raidz plus single SSD for both ZIL and L2ARC. They originally though ZIL only, but given how little was used.... Aside from it still being FreeBSD 9.1, it stopped responding for a while as it it logged controller timeouts and resets for the SSD, a few weeks ago. Not sure where to begin now that I've inherited it.

But, back to my workstation.

The vfs.zfs.arc_max is more to make sure it doesn't prevent mysql and such from being able to get contiguous blocks of memory for their needs at boot time. My average ARC size this past week was ~4.5G. Might get worse as I gave innodb_buffer_pool_size another nudge last night. It is somewhat confusing when 3 different tuning tools give conflicting recommendations, though they all agree that I needed more here. But 2/3's or 80% of memory (depending on tool) doesn't work for me.

Some values from zfs get all zpool0/var/db/mysql:

Code:

recordsize            16K                    local
mountpoint            /var/db/mysql          local
checksum              fletcher4              inherited from zroot
compression           lz4                    inherited from zroot/var
atime                 off                    inherited from zroot
primarycache          metadata               local
logbias               throughput             inherited from zroot

I do innodb_file_per_table, and what prompted this investigation was that I noticed it taking a long time to OPTIMIZE a 2.1GB .ibd file. Using gstat(8), I saw it was reading steadily at ~40MBps, with occasional writes at ~150Mbps. But, it'll easily do ~150MBps reads during periodic scrub. (I've also seen instantaneous reads approach 300Mbps...using sysutils/gkrellm2, though not sure why cf-agent is sustaining or performing such high rate reads.)

My first thought was whether compression was slowing me down in this case or not, but its not a change I'm prepared to make without any facts. Since all my current experiences is it helps.

I don't have any UFS filesystems to do comparisons with. Pretty much since ZFS hit the scene, I almost never get to make non-ZFS filesystems. The only except was for emulators/pipelight, and that backed on a zvol.

The Dreamer.

ZFS ZFS read incredibly slow

wfaulk

User23

wfaulk

t1066

TheDreamer