ZFS Kernel Panics under high ZFS Load with 32GB RAM

Hi,
I am using FreeBSD on four file servers for more than five years now and I am more than happy with it. Besides traditional file services (ZFS, NFS, CIFS with Samba) those servers also host VMs (VirtualBox) and serve web (Apache, MySQL, PHP) as well as mail (DBMail, Postfix) services to a couple of customers.

A few weeks ago I decided to upgrade my primary server from 16 to 32 GB RAM - and that's where my trouble started.

Here are some facts about this server:

I built this server last year starting with 16 GB of ECC RAM (2x 8 GB modules).
About one month ago I added 2 more 8 GB modules of the same type.
They have a different manufacturing date but all 4 modules actually consist of the same Hynix chips.
Because of my problems I have already been in contact with Kingston and they confirmed that there was no change regarding the RAM and those 4 modules should be compatible with each other.

After upgrading the RAM to 32 GB the FreeBSD kernel occasionally panics. Something which did not happen to me at all in all years before.
I noticed that those panics occur when I put the machine under high ZFS load for some time.
For example I am running a monthly backup of the smaller zpool to the larger zpool with simple zfs send $vol > $file commands while $vol is a zfs volume on my smaller zpool and $file usually is an ordinary file on my backup zvol.
During such a backup job the load increases to ~10, most likely due to the compression I enabled on the backup zvol. But basically it does what I want and the job usually completes after 3-4 hours.
I can run this backup script multiple times when I only have 16 GB RAM installed without any problems. I tried all modules in all slots to make sure a memory fault is unlikely. When having all 4 modules installed I am able to run sysutils/memtest86+ for >24 hours and >20 passes without any errors. There are no ECC errors either which I check through IPMI when the kernel panic happened.

The panics itself are always related to ZFS. For example I had one just last weekend while running the backup job described before. I captured a screenshot of this panic through the KVM console (see attachment).

The panic "invalid ab" is thrown in line 2171 of arc.c.
My knowledge about internal ZFS code is quite limited, but for me the condition of this panic sounds like something which should never ever happen.

Code:
if (ab->b_type > ARC_BUFC_NUMTYPES)
panic("invalid ab=%p", (void *)ab);

I do not have any idea or explanation why this does not happen with smaller memory. If it was a software bug, the amount of installed memory should make no difference. Though a larger ARC due to more memory might lead to a higher crash possibility if there should be something wrong with ARC management.

Does anybody ever had similiar problems or an explanation what can cause those panics to happen?

Besides those kernel panics I also noticed a different behaviour when running zdb -v on the larger RAIDz2 pool. With 16 GB memory this command seems to run forever. It scans the whole pool and I aborted it after ~12 hours. With 32 GB it aborts after 1-3 hours with the following error:
error: buffer modified while frozen!

This alert is triggered by line 1103 of arc.c.
According to the source code this looks like a checksum failure; which IMHO should never occur either.

Code:
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
panic("buffer modified while frozen!");

Since this error only crashed zdb and not the whole kernel I was able to confirm with sysutils/mcelog that there where no hardware errors logged; which should be the case if some ECC memory error happened.

So similiar question: Is it possible that ARC messes up itself under certain situations, maybe by writing data into some other still allocated cache buffers?
I found Thread 41880 regarding ZFS/ARC memory leaks but I do not think it is related to my problem because there nobody talked about actual panics. When my problem occurs, I am unable to change terminals, I cannot ping the system, and even the numlock led does not light up when pressing the NumLock key. That's also why this problem is so severe for me. There's nothing worse than a suddenly freezing server..

I appreciate any hints or help how to further troubleshoot this issue!
Thanks in advance, and best regards
David
 

Attachments

  • filer-kernel-panic.png
    filer-kernel-panic.png
    2.1 KB · Views: 218
Back
Top