ZFS Tuning Question - ZFS + PostgreSQL(in jail) - High load transfer causes panic: I/O to poll 'zroot' appears to be hung

freebsdinator

Member

Thanks: 5
Messages: 34

#1
Hello,

I am currently transferring data over the network to a new system/PostgreSQL(in jail) database I've setup. Something that has been happening which really has me scratching my head.

It seems that while the system is under heavy load for this PostgreSQL transfer, only one of the two drives in the mirrored pool will be writing data whereas the other drive remains idle (for up to 5 minutes) and will alternate whereas other times both drives are writing data as normally would be expected.

Bash:
dT: 1.060s  w: 1.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   20      2      0      0    0.0      1    117  15842  152.5| ada0
    0      1      0      0    0.0      0      0    0.0    0.0| ada1
This never occurs when data is randomly written to the zpool, even 500GB randomly generated files do not create this situation. Which leads me to believe it's related to some sort of configuration I need to tweak in ZFS and/or PostgreSQL.

The system will also sometimes feel unresponsive. For example, I signed out of root and it will just hang there for 10-15 minutes before I can go back to a lower level user.

Sometimes the situation will resolve itself whereas others, it will result in a kernel panic:
Bash:
panic: I/O to poll 'zroot' appears to be hung on vdev guid 3397587246704100575 at '/dev/label/encroot0.eli'
The machine has 16GB of ram. In an an attempt to resolve this, I doubled the two variables below:


kern.ipc.shmall=262144 #System default was: 131072
kern.ipc.shmmax=1073741824 #System default was: 536870912


As well as increased the postgresql variables shared_buffers to 1638MB as well as effective_cache_size to 3GB which caused the issue to be delayed. The table that I'm transferring has around 100 million rows. Before these changes, it failed at around 30 million. Afterward, 90 million.

I've ruled out the hard drives by running smartmontools as well as unplugging one drive at a time to confirm that the kernel panic still occurred. I know I could split the table and move smaller chunks, but I'd like to try to address the root cause so my server won't kernel panic in the future.

Any suggestions are welcome.


pool: zroot
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
label/encroot0.eli ONLINE 0 0 0
label/encroot1.eli ONLINE 0 0 0


Code:
Filesystem                 1K-blocks       Used      Avail Capacity  Mounted on
zroot                     7411268064   74435124 7336832940     1%    /
/boot/loader.conf
Code:
kern.ipc.semmni=256                                                    
kern.ipc.semmns=512                                                    
kern.ipc.semmnu=256
 
OP
OP
F

freebsdinator

Member

Thanks: 5
Messages: 34

#3
don't know what is does, but I always see it on pages about pgsql in jails, and you didn't mention it... did you allow sysvipc?
I've been eyeballing that option as well. I believe it allows direct interaction with the hosts memory which defeats the purpose of keeping a jail isolated, but I only have it in a jail to allow easier/safer upgrades.

I did another update in sysctl.conf:
#Total amount of shared memory available (bytes or pages)
kern.ipc.shmall=2147483648 #System default was: 131072
#Maximum size of shared memory segment (bytes)
kern.ipc.shmmax=2147483648 #System default was: 536870912

If it still fails, I'm going to look at flipping that flag or moving it outside of the jail to see if that has any impact. I'm assuming having a SQL instance in a jail is fairly common, but I wouldn't gather that from the lack of details provided for perming the required optimizations. :-(
 

nihr43

Member

Thanks: 18
Messages: 45

#4
alright, dumb question; what version of FreeBSD?
also, in what condition is this hardware? are we talking 10-year old server in a dusty closet .... ?
does it always hang on the same disk?
I've spent a lot of time chasing my tail with kernel panics that turned out to be hardware problems. I know you said SMART is fine, but what else?
 
OP
OP
F

freebsdinator

Member

Thanks: 5
Messages: 34

#5
alright, dumb question; what version of FreeBSD?
also, in what condition is this hardware? are we talking 10-year old server in a dusty closet .... ?
does it always hang on the same disk?
I've spent a lot of time chasing my tail with kernel panics that turned out to be hardware problems. I know you said SMART is fine, but what else?
FreeBSD 11.2
2012, Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (3499.06-MHz K8-class CPU)
The issue alternates disks.

I have a mirrored pool, so I've pulled one disk, had the same issue, pulled the other disk, had the same issue.

Both drives are new Barracuda 8TB drives.

I believe it's some sort of memory constraint with PostgreSQL and ZFS as I can generate a random 500GB file multiple times and there is no hanging, no panics, just writes the file.
 

nihr43

Member

Thanks: 18
Messages: 45

#6
Well you've got my interest now. What version of postgres? and, if I'm going to try to recreate this, can you get it to crash with pgbench? it comes with postgres

A short pgbench script that you know to fail - that I can copy and paste into a jail - would be a good way to go about this.

to start, create a database, and run `pgbench --initialize -s 1000 $testdb` to populate 100 million rows.
 
OP
OP
F

freebsdinator

Member

Thanks: 5
Messages: 34

#7
Hey nilhr43;

Sorry for the delay, I managed to get a billion table row to transfer using pgloader3, but the settings made the progress go unbelievably slow.

I ran the benchmark tool that you recommended. Around each 5% mark, it would pause for 10-30 seconds, and gstat would show one drive idle whereas the other was maxed at 100% busy. However, the machine itself would not crash.

To further reduce variables, I created a new PostgreSQL database outside of the jail (mirroring the same PostgreSQL configuration), and found identical results of the table creation taking around 1000 seconds with delays after around each 5% mark. This just goes to show how efficient Jails are if the times are in the same ballpark.

Again, the ZFS pool is just two mirrored, 8TB drives. The performance appears to be related to required ZFS tuning, but I haven't determined what yet.
 
Top