Dell Perc H800 controller errors

Hi All,

I've scoured around and can't seem to find others with this exact issue, so I'm hoping someone here will be able to shed light on this...

We have a FreeBSD server running:

Code:
FreeBSD 8.2-PRERELEASE (GENERIC) #0: Thu Dec 16 14:59:46 PST 2010

It's a Dell R610. It has two MD1200 disk arrays on it, SAS chained together. The controller that manages them is a Perc H800, with the latest firmware available.

I have the disks exported JBOD from the controller. And, the disks are roped into a ZFS filesystem, which is exported via NFS to the local net.

Everything works well most of the time, but every once in a while (like once every few days), the filesystem completely hangs and we see these errors on the console:

Code:
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 61793 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 61823 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 61853 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 61923 SECONDS
(this is after the filesystem has been hung for a day)

etc... When I start poking around with mfiutil, it shows everything is OK, the disks are all OK, the volumes are good, the event logs show no errors. The "Patrol" feature is disabled. The battery is fine.

After running a few of these "mfiutil show drives" and "mfiutil show [volumes|config]" commands, the lockup magically frees itself. But, I don't want them to happen in the first place, and I certainly don't want to have to manually run a "mfiutil show disks" or whatever to unlock it every time. Has anyone seen this before?

I've actually tried another H800 controller we had on the shelf as well, just to rule out a hardware problem with the first one, but we see the same behavior on both controllers.

"zpool status" also shows the disks as all OK, and a "zpool scrub" turns up no problems.

Any insight much appreciated!!

-erich
 
Crickets... Maybe I'm posting in the wrong place? Should I be contacting the person who maintains the MFI driver for FreeBSD?
 
I'm looking to do the something similar with Nexenta/openSolaris.

I have a R710 with H800 controller connected to a couple of MD1220 arrays. I'm kicking around the idea of creating a ZFS pool over hardware RAID1 disk pairs. I'm hoping to take advantage of the controller cache and write back caching policy while getting the flexibility of ZFS.

I'm worried about cache flush requests from ZFS causing problems with the NVRAM on the controller. At first glance I thought this could have something to do with the problem you are having, but I'm not sure how the controller cache is used when exported as JBOD.
 
Yeah, from what I understand, JBOD configs don't use cache. Which is fine, because ZFS is doing all the caching. in RAM (and my server has a load of RAM, so I think I'm safe). Not sure about RAID1, it may depend on the controller...
 
I think my problem may be that on in the server BIOS, C-states and C1E control are enabled. I have heard whispers on the net that this is bad for BSD at this time. I will disable these items, and report back later. This may affect your R710/H800 setup as well.
 
Actually you said you were going to use Nexenta/OpenSolaris. Nevermind! My issue will likely not affect your setup.
 
Curses! I spoke too soon. It was working for a week and then the same error started popping up again (mfi0: command timed out).

Is there any way I can get in touch with the author? I'm really thinking it may be a driver issue, related in some way to the hardware I'm using maybe. It looks like the author is Scott Long, but he doesn't seem to be responding to the email address listed on the mfi man page (scottl@FreeBSD.org). If it is a bug I'd be happy to help fix it.

Does anyone know where he hangs his hat? ;)
 
Back
Top