Solved Tuning Question - ZFS + PostgreSQL(in jail) - High load transfer causes panic: I/O to poll 'zroot' appears to be hung

freebsdinator · Aug 5, 2018

Hello,

I am currently transferring data over the network to a new system/PostgreSQL(in jail) database I've setup. Something that has been happening which really has me scratching my head.

It seems that while the system is under heavy load for this PostgreSQL transfer, only one of the two drives in the mirrored pool will be writing data whereas the other drive remains idle (for up to 5 minutes) and will alternate whereas other times both drives are writing data as normally would be expected.

Bash:

dT: 1.060s  w: 1.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   20      2      0      0    0.0      1    117  15842  152.5| ada0
    0      1      0      0    0.0      0      0    0.0    0.0| ada1

This never occurs when data is randomly written to the zpool, even 500GB randomly generated files do not create this situation. Which leads me to believe it's related to some sort of configuration I need to tweak in ZFS and/or PostgreSQL.

The system will also sometimes feel unresponsive. For example, I signed out of root and it will just hang there for 10-15 minutes before I can go back to a lower level user.

Sometimes the situation will resolve itself whereas others, it will result in a kernel panic:

Bash:

panic: I/O to poll 'zroot' appears to be hung on vdev guid 3397587246704100575 at '/dev/label/encroot0.eli'

The machine has 16GB of ram. In an an attempt to resolve this, I doubled the two variables below:

 

kern.ipc.shmall=262144 #System default was: 131072

kern.ipc.shmmax=1073741824 #System default was: 536870912

As well as increased the postgresql variables shared_buffers to 1638MB as well as effective_cache_size to 3GB which caused the issue to be delayed. The table that I'm transferring has around 100 million rows. Before these changes, it failed at around 30 million. Afterward, 90 million.

I've ruled out the hard drives by running smartmontools as well as unplugging one drive at a time to confirm that the kernel panic still occurred. I know I could split the table and move smaller chunks, but I'd like to try to address the root cause so my server won't kernel panic in the future.

Any suggestions are welcome.

 

pool: zroot

state: ONLINE

config:



        NAME                    STATE     READ WRITE CKSUM

        zroot                   ONLINE       0     0     0

          mirror-0              ONLINE       0     0     0

            label/encroot0.eli  ONLINE       0     0     0

            label/encroot1.eli  ONLINE       0     0     0

Code:

Filesystem                 1K-blocks       Used      Avail Capacity  Mounted on
zroot                     7411268064   74435124 7336832940     1%    /

/boot/loader.conf

Code:

kern.ipc.semmni=256                                                    
kern.ipc.semmns=512                                                    
kern.ipc.semmnu=256

nihr43 · Aug 5, 2018

don't know what is does, but I always see it on pages about pgsql in jails, and you didn't mention it... did you allow sysvipc?

freebsdinator · Aug 5, 2018

nihr43 said:
don't know what is does, but I always see it on pages about pgsql in jails, and you didn't mention it... did you allow sysvipc?

I've been eyeballing that option as well. I believe it allows direct interaction with the hosts memory which defeats the purpose of keeping a jail isolated, but I only have it in a jail to allow easier/safer upgrades.

I did another update in sysctl.conf:
#Total amount of shared memory available (bytes or pages)
kern.ipc.shmall=2147483648 #System default was: 131072
#Maximum size of shared memory segment (bytes)
kern.ipc.shmmax=2147483648 #System default was: 536870912

If it still fails, I'm going to look at flipping that flag or moving it outside of the jail to see if that has any impact. I'm assuming having a SQL instance in a jail is fairly common, but I wouldn't gather that from the lack of details provided for perming the required optimizations. :-(

nihr43 · Aug 5, 2018

alright, dumb question; what version of FreeBSD?
also, in what condition is this hardware? are we talking 10-year old server in a dusty closet .... ?
does it always hang on the same disk?
I've spent a lot of time chasing my tail with kernel panics that turned out to be hardware problems. I know you said SMART is fine, but what else?

freebsdinator · Aug 5, 2018

nihr43 said:
alright, dumb question; what version of FreeBSD?
also, in what condition is this hardware? are we talking 10-year old server in a dusty closet .... ?
does it always hang on the same disk?
I've spent a lot of time chasing my tail with kernel panics that turned out to be hardware problems. I know you said SMART is fine, but what else?

FreeBSD 11.2
2012, Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (3499.06-MHz K8-class CPU)
The issue alternates disks.

I have a mirrored pool, so I've pulled one disk, had the same issue, pulled the other disk, had the same issue.

Both drives are new Barracuda 8TB drives.

I believe it's some sort of memory constraint with PostgreSQL and ZFS as I can generate a random 500GB file multiple times and there is no hanging, no panics, just writes the file.

nihr43 · Aug 5, 2018

Well you've got my interest now. What version of postgres? and, if I'm going to try to recreate this, can you get it to crash with pgbench? it comes with postgres

A short pgbench script that you know to fail - that I can copy and paste into a jail - would be a good way to go about this.

to start, create a database, and run `pgbench --initialize -s 1000 $testdb` to populate 100 million rows.

freebsdinator · Aug 9, 2018

Hey nilhr43;

Sorry for the delay, I managed to get a billion table row to transfer using pgloader3, but the settings made the progress go unbelievably slow.

I ran the benchmark tool that you recommended. Around each 5% mark, it would pause for 10-30 seconds, and gstat would show one drive idle whereas the other was maxed at 100% busy. However, the machine itself would not crash.

To further reduce variables, I created a new PostgreSQL database outside of the jail (mirroring the same PostgreSQL configuration), and found identical results of the table creation taking around 1000 seconds with delays after around each 5% mark. This just goes to show how efficient Jails are if the times are in the same ballpark.

Again, the ZFS pool is just two mirrored, 8TB drives. The performance appears to be related to required ZFS tuning, but I haven't determined what yet.

SirDice · Aug 9, 2018

On the ZFS filesystem that houses the PostgreSQL database make sure you set recordsize to 8K and logbias to throughput.

Link is from OpenZFS but those settings should be fairly universal: http://open-zfs.org/wiki/Performance_tuning#PostgreSQL

whitesnow · Aug 10, 2018

I have no clue. Is this a SMR drive?

freebsdinator · Oct 17, 2018

I was able to isolate the issue. On one pool, I had the encryption level set too high initially (gzip-9e). I changed it after performing the install to lz4 and added more data. When I added data to the existing pool too quickly, it would crash.

When I added 2 more disks for another pool with lz4 to begin with, I had no issues with data transfer speeds, specifically writes. This clued me into destroying and recreating the pool with lz4 compression from the beginning (after transferring my data off). That is what solved it for me.

It was an SMR drive.

Solved Tuning Question - ZFS + PostgreSQL(in jail) - High load transfer causes panic: I/O to poll 'zroot' appears to be hung

Administrator