Rsync push task from FreeNAS causes kernel panic on FreeBSD destination machine

First of all, apologies if this belongs in a different forum. Please advise if I need to move it elsewhere.

Quick summary: I'm running two machines connected by a local network. One running FreeBSD 12.1-p2, the other running FreeNAS-11.3-U1. I have a daily rsync task set up on the FreeNAS machine to push the FreeNAS data disk contents to the FreeBSD machine. When the rsync runs, it gets so far and the destination FreeBSD box crashes.

I have been consulting on the FreeNAS Forum, and this link will provide a few details of the suggestions I received there and some of the things I've tried:

Code:
https://www.ixsystems.com/community/threads/rsync-push-task-causes-kernel-panic-on-freebsd-destination-machine.82965/

The FreeBSD crash is quite repeatable, and results in a kernel panic of this nature. (The addresses change, which CPU is involved changes, but the general form looks like this):

Code:
Fatal trap 12: page fault while in kernel mode                                                                                                                              
cpuid = 0; apic id = 00                                                                                                                                                     
fault virtual address   = 0xfffff80f75d46e00                                                                                                                                
fault code              = supervisor read data, page not present                                                                                                            
instruction pointer     = 0x20:0xffffffff82674048                                                                                                                           
stack pointer           = 0x0:0xfffffe00a0886850                                                                                                                            
frame pointer           = 0x0:0xfffffe00a0886890                                                                                                                            
code segment            = base 0x0, limit 0xfffff, type 0x1b                                                                                                                
                        = DPL 0, pres 1, long 1, def32 0, gran 1                                                                                                            
processor eflags        = interrupt enabled, resume, IOPL = 0                                                                                                               
current process         = 0 (zio_write_intr_6)                                                                                                                              
trap number             = 12                                                                                                                                                
panic: page fault                                                                                                                                                           
cpuid = 0                                                                                                                                                                   
time = 1583081613                                                                                                                                                           
KDB: stack backtrace:                                                                                                                                                       
#0 0xffffffff80c1d297 at kdb_backtrace+0x67                                                                                                                                 
#1 0xffffffff80bd05cd at vpanic+0x19d                                                                                                                                       
#2 0xffffffff80bd0423 at panic+0x43                                                                                                                                         
#3 0xffffffff810a7d2c at trap_fatal+0x39c                                                                                                                                   
#4 0xffffffff810a7d79 at trap_pfault+0x49                                                                                                                                   
#5 0xffffffff810a736f at trap+0x29f                                                                                                                                         
#6 0xffffffff81081a0c at calltrap+0x8                                                                                                                                       
#7 0xffffffff82671979 at arc_access+0x109                                                                                                                                   
#8 0xffffffff8267541e at arc_write_done+0x25e                                                                                                                               
#9 0xffffffff8272b8a1 at zio_done+0x8d1                                                                                                                                     
#10 0xffffffff82726b7c at zio_execute+0xac                                                                                                                                  
#11 0xffffffff80c2fa74 at taskqueue_run_locked+0x154                                                                                                                        
#12 0xffffffff80c30da8 at taskqueue_thread_loop+0x98                                                                                                                        
#13 0xffffffff80b90c23 at fork_exit+0x83                                                                                                                                    
#14 0xffffffff81082a4e at fork_trampoline+0xe                                                                                                                               
Uptime: 23m42s

As discussed on the thread in the FreeNAS forum, I originally was getting an "Unknown error: 122 (122)" in the rsync task log. This proved to be due to corrupted data on the destination pool, but that was cleaned up, but the rsync problem persists. Now the first indication I have of a problem is the notification in the log on the source machine of a broken pipe when the destination machine crashes.

At this point, I'm looking for suggestions for where to look next. Things I've thought of include file size, memory issues, file contents, and file permissions/owner issues. I also updated everything to the latest versions on both FreeBSD and FreeNAS.
 
This forum is mostly populated by users, and is reasonably good at user support. Not very many developers.

Here's what I would do if I were into debugging the code: Compile zio_done, arc_write_done and arc_access with an assembler listing, find out what the code at address arc_access+0x109 does, look what registers (variables) it uses, and figure out how those variables can have incorrect values. But I have a day job (where I do too much of this kind of stuff), so I won't do it. But a FreeBSD developer might. You should post a PR through the usual channels.

Now, if you want to get your machine back to life: My hunch is that the problem is caused by slightly corrupted ZFS metadata. Try setting up a complete new system from scratch, and copy the file system of the crashing FreeBSD machine to the new machine. You can try to do the copy with ZFS send/receive, or you can use rsync. If that works, try the operation on the new machine. If the problem goes away, then (a) you have worked around it, and (b) you have demonstrated that there is something in the metadata (or less likely data) of the old file system that causes the problem.
 
Sounds like a plan. Thanks for the suggestions. I used to do a lot of this (looking at disassembled code) in my day job (I'm retired now), but it's been a while and I'll have to shake off the cobwebs. I've always found it to be a pain to rebuild a machine from scratch and get everything the way I like it, but since the primary purpose of the destination machine is as a backup for the FreeNAS server, I've only done a little bit of tweaking

Thanks again.
 
Trap 12 & varying addresses would indicate some bad hardware.
You probably need to find what in the kernel has barfed on. Look at the instruction pointer address via nm
 
Thanks for the comments/pointers, everyone.

After a little more investigation, I'm thinking it could be due to ZFS ARC parameter tuning, which I haven't touched since installation, and the rsync was attempting to sync an entire pool with about 900GB of data. Splitting the rsync up into pieces has no problem completing.
I admit that memory issues could be at fault, though I did run a memory test before installing FreeBSD.

zpool status <pool>

reports no errors. (Earlier there were some filesystem errors which were cleared by a scrub).

Haven't been able to get a core dump.
 
Perhaps use zfs send instead of rsync.

So much this. ZFS send/recv is what rsync tries (and does quite nicely given the constraints) to be, but with support from the filesystem. It can easily take an hour+ rsync run down to a few seconds. (Depending on the size of filesystem and amount of churn.)
 
I believe the mystery is solved.

I started by re-installing FreeBSD, and re-organizing my disks from 3 drives in a striped pool to a standalone boot disk and a 2 drive mirror pool. Then tried the rsync again and it still crashed. Then I decided to do a buildworld/installworld, thinking I needed to do so to populate the kernel object modules. But the buildworld crashed.

After suggestions from mark_j and gpb, I decided to go back and re-test memory. Turns out I had a flakey memory stick. First test with 4 x 8GB sticks failed. Removed half and ran again enough times to isolate the culprit. So until I get a replacement memory stick, I'm running on half the memory (16GB), which is more than I really need anyway.

As for the suggestion that I use zfs send: I was doing this at first, but being a bit of a newbie with ZFS, it wasn't clear to me how to use the resulting backup to recover accidentally changed or deleted files, so I thought I'd use rsync instead. Now that I've got this problem sorted, I may go back and spend the time to figure out how to make use of the more efficient mechanism.

Many thanks to all who commented.
 
Back
Top