ZFS How to optimize and tune zfs datastore for databases like postgres or mysql?

Matthew Dresden · May 23, 2015

I have read numerous article in this forum where members claim zfs datasets don't work well for databases.

Some have said that zfs degrades or corrupts or just has poor performance.

In response to some of those post I have seen the forum admins respond that zfs works fine but needs some specific tuning.

The tuning that was needed was not however elaborated on.

My question is how do I determine the optimal tuning for a dataset?

What is a good place to start from with the tuning and what areas need tweaking, etc?

Also, I could not find how to edit my title.

I want to tune a zfs dataset, not the entire pool.

In my specific instance I have the following:

Dual ssd
5 sata3 7200 rpm enterprise drives

Here is my layout:

Code:

# zpool status
  pool: Datastore
state: ONLINE
config:

   NAME  STATE  READ WRITE CKSUM
   Datastore  ONLINE  0  0  0
    raidz1-0  ONLINE  0  0  0
    gpt/Datastore-zfs0  ONLINE  0  0  0
    gpt/Datastore-zfs1  ONLINE  0  0  0
    gpt/Datastore-zfs2  ONLINE  0  0  0
    gpt/Datastore-zfs3  ONLINE  0  0  0
    gpt/Datastore-zfs4  ONLINE  0  0  0
   logs
    mirror-1  ONLINE  0  0  0
    gpt/Datastore-log0  ONLINE  0  0  0
    gpt/Datastore-log1  ONLINE  0  0  0
   cache
    gpt/Datastore-cache0  ONLINE  0  0  0
    gpt/Datastore-cache1  ONLINE  0  0  0

around 160 GB of l2arc cache on ssd
7TB raidz1
8GB mirrored log zil on ssd
16GB ecc ram

Code:

Processor Information
  Family: Xeon
  Signature: Type 0, Family 6, Model 60, Stepping 3
  Version: Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
  External Clock: 100 MHz
  Max Speed: 3800 MHz

zfs-bonnie-bench-results

gkontos · May 23, 2015

Very old but it has proven very correct, at least with my experience which is far off a dba. https://blogs.oracle.com/realneel/entry/mysql_innodb_zfs_best_practices.

Matthew Dresden · May 23, 2015

Thanks, this article looks fairly in depth on the topic.

I will work through this, but expect I will have a few additional questions about the content and it implementation.

I am hoping to narrow the scope of some of the tuneables to just the dataset which stores DB files.

Some things fortunately I already have globally, such as l2arc and log on my pool

gkontos · May 23, 2015

To be honest with you, I only have one large/demanding mysql installation. It is running on CentOS 6.6 2X650GB SAS 15K in RAID1, battery backed controller...

Matthew Dresden · May 23, 2015

I have a monitoring solution at work tracking several hundred thousand items frequently from about 900 servers under vmware on pretty beefy set of Cisco UCS blades, crashing under high IO conditions.

DBAs and other Senior Admins have redone it 3 times, but it still crashes every 2 months.

It comes down to, "If you want something done right, Do it yourself at this point for me."

I would like to move the systems to real hardware, but we have 20000 dollars of very fast ssd storage on the SAN dedicated to these instances, so I am pretty much open to any supportable way setup some Bullet Proof HA postgres servers there.

I will also be needing postgres shortly for puppetDB backend, and forman / hiera enc / puppet project I am planning.

I am exploring setting up 2 BSD servers with DRBD and zfs there perhaps, if I can get something together decent in my home lab first.

I will post back on this when I get further along with my (ZFS, postgres, FreeBSD), experiment.

gkontos · May 23, 2015

DRBD is only available in Linux. In FreeBSD there is HAST. But I would strongly recommend not to use HAST with ZFS.

Matthew Dresden · May 23, 2015

If I wanted clustered storage between 2 zfs pools or datasets, how would you recommend to accomplish this?

storvi_net · May 24, 2015

Do you (must) want a postgres failovercluster?

Probably the application you use is capable to address a replicated DB. Then you can ommit the clusterstuff and work with replication (streaming replication).

Regards
Markus

gkontos · May 24, 2015

Matthew Dresden said:
If I wanted clustered storage between 2 zfs pools or datasets, how would you recommend to accomplish this?

There is only one way, HAST. The problem is that HAST is a network RAID1 (mirror) over individual disks. Assume that you have 3 disks in each node and you want to replicate two RAIDZ1 pools. In order to do that, you would need to create 3 HAST resources, consisted by 3 mirrored disks. Then, you would create your pool by assigning those 3 resources. Now, consider that one disk goes bad. You would expect that a zpool status on the active machine would indicate it. It will not! Because the bad disk has a network mirror, ZFS has no way of knowing that something is wrong.

For DB high availability you will need to look at a different technology. If you were running mysql, things would be easier and I would recommend the use of Percona XtraDB cluster suite.

lkateley · May 27, 2015

This is from my experience with sun systems and oracle databases but...Key tunables for database, set recordsize = the block size in database. Put indexes and logs on faster storage, maybe ssd pool. Don't use zfs cache, most databases already do their own caching, so primarycache=none(or metadata). There is a setting for database specifics, it is called logbias. So you can have one dataset can use latency for logs and indexes, and throughput is for tables.