Solved 11.3 zfs checksum errors

Recently I've begun upgrading to FreeBSD 11.3 on servers I'm the administrator for. I had no issues during testing, and everything seemed okay. However now I've picked up on zpool errors after zpool upgrades. I may need to file a bug report, but right now I'm more concerned about the state of my servers.

All the machines are Dell servers with ECC RAM. They've passed weekly zfs scrubs for years since deployment, and are set up in a mirror configuration (non mirror machines have no issues). I've upgraded all my servers to 11.3, however only upgraded the zpool on two of them (with mirrors), and on these two I'm getting checksum errors. If it was only one, I'd say it was a hardware issue, but the fact that it's on two machines, these machines have never had problems for years, and these are the only two I've upgraded the zpool on leads me to believe it's not a hardware problem. 11.3 machines with mirrored drives without the zpool upgrade still seem fine.

I see the only update to zfs in 11.3 is the addition of spacemap_v2, which seems like it should be pretty harmless. It also looks like activation is one way and I'm stuck with it. It doesn't matter if I keep scrubbing, it just reports more checksum errors. I had similar problem testing an upgrade to FreeBSD 12 (maybe the same problem), but found a lot of other blockers on upgrading to 12, so I haven't looked into it much. Is there anything I can do to work around this?

--

Update: So I've been fighting with this for a while, and I'm pretty sure this isn't a ZFS problem now, but a problem with the mfi driver the drives are connected to. I've filed a bug report, but don't have any insights right now. I can sort of work around it by restarting the machine, and errors will go away for a while and return within a few days. The machines are Dell servers with PERC controllers configured as JBOD.

--

Update 2: I filed a bug and the problem appears to be using the mfi driver with some models of Dell servers using PERC controllers. The mfi driver has been superseded by mrsas, but the servers were set up with whatever FreeBSD detected (mfi). Switching to mrsas driver fixed the problem. This does change the device name, however ZFS is able to figure out where the pools are despite this, so it's a relatively easy fix.
 
Last edited:
Back
Top