Reboot during ZFS send/recv

I'm having a pretty repeatable problem on FreeBSD 9.1 with my servers either hanging or rebooting during a ZFS send/receive, or under heavy ZFS load.

I have a few OpenSolaris (nexenta) servers, a few NAS4Free servers (based on FreeBSD 9.1), and a few FreeBSD 9.1 servers, all running ZFS with lots of disks behind them.

The issue that I'm seeing is that when I'm sending a ZFS snapshot via SSH to, or from, one of the FreeBSD based machines, it will either hang or reboot somewhere within the transfer. I will say that I'm using current LSI IT mode SAS controllers with commercial Supermicro SAS expanders, and a mixture of SAS and SATA drives. The problem has presented itself on both systems with SAS drives and SATA drives. Of course, the /var/log/* files are empty of any info that would help. What should be my next step to try and troubleshoot the issue? How do I go about collecting crash type information? I can generally crash it in a day or two.

I'm also getting the following on one of the larger machines that crash when doing write I/O out to the disk:
Code:
Jun  6 03:59:46 hostname kernel: (da2:mps0:0:10:0): WRITE(10). CDB: 2a 0 20 8a ed d5 0 0 d3 0 length 108032 SMID 737 terminated ioc 804b scsi 0 state c xfer 0

Thanks!
 
Is it the sending or the receiving side that is crashing?

How much RAM in the box that's crashing?

What is the ARC set to on the box that's crashing?

Do you have dedupe or compression enabled on the box that's crashing?

Have you watched top(1) during the send/recv to see how RAM/ARC is being used?

What's the pool layout on the crashing box?

How full is the pool on the crashing box?

How large are the snapshots you are sending? Are they mostly new data, old data, full of deletes, etc?

Need more information. :)
 
I have all sorts of fun info :)
  • I've seen both the sending and receiving side crash, but always the FreeBSD box, never the OpenSolaris / nexenta boxes. I thought adding swap might help on the sending embedded box that was crashing, but it still rebooted again.
  • One box running FreeBSD (NAS4Free) embedded on USB has 24 GB, another one running FreeBSD 9.1-RELEASE-p3 has 48 GB.
  • I have no special ARC settings defined in sysctl.conf.
  • I do not have dedupe enabled, I do have compression=on.
  • I did not have top running, as this might run for two hours without issue, or 48 hours. I have a looping ZFS send script that's transferring fulls and incrementals to get as close to real time as possible.
  • The pool layout on the smaller machine is 20 drives in two drive mirror VDEVs, with a single cache SSD and two mirrored ZILs.
  • The pool layout on the larger machine is 86 drives in four drive RAID-Z2 VDEVs, with a pair of cache SSDs and a pair of mirrored ZILs, booting from a pair of GEOM mirrored UFS drives.
  • The pools are around 60% full on the smaller machine, 40% on the larger machine.
  • The snapshots are fulls first, then incrementals. I'd say there are ten datasets on the smaller machine, and ten datasets on the larger machine. They are not necessarily 1 to 1, but I was doing ZFS send/recv between the two when it fails the most.
It's definitely dying on initial sends of datasets. Sometimes it makes it through a 3 TB dataset, other times it chokes on a 250 GB dataset. The data itself is millions of smallish files. There's not much churn in the data, but there is definitely churn in the metadata, as they're moved around, renamed, etc by the applications we run.

For another piece of information, I have another machine that's almost identical to the larger machine, and it's running 'better'. It was also crashing when running NAS4Free (based on FreeBSD 9.1-RELEASE-p3) from USB, but seemed to stabilize after doing a full FreeBSD install on it.
 
I agree, but SAS drives on SAS expanders should be pretty straightforward. I have plenty of them running under OpenSolaris/Nexenta without issue, but am trying to get away from that platform.
 
SAS expanders have firmware. So, they put another level of "interaction" between your FS and the disk. This is not good from my point of view (and OpenIndiana forums are full of horror stories about this topic) :)
 
Agreed, and I have a couple of systems with these same SAS expanders that work fine. But then I have these that are crapping out on me. I have an RMA into my vendor on the backplanes, so we'll see if that helps. I'm going to load 8.4 on one of my troubled ones and see if that helps.
 
Back
Top