Reports of Hard Error and Vm_fault but SATA RAID (DELL CERC) reports ok

beanman · Feb 12, 2011

One FreeBSD 6.2-p11 system which has been running extremely stable for years has started to have a problem.

DELL PE750
CERC SATA RAID
2x250GB WD SATA HD

The OS sees one hard drive, aacd0

The system has crashed a couple of times a week for the last 2 weeks. Only responding to pings, answers some connections but hangs, and otherwise has to be soft/hard rebooted (and background fsck) to put it back into proper service.

I finally saw the console during one of these crashes. (Other times I simply had the system rebooted and followed up to find nothing of any value in the logs explaining the crash. The hd reported small corruptions that could just as easily been attributed to writing to open files when the crashes occured.)

Console reported (from my jotted down notes):

Code:

aacd0 hard error
vm_fault: pager read error
specifically reported that aacd0s1e had a write issue
specifically reported that aacd0s1f had a read issue
specifically reported that aacd0s1h had a read issue

and was hung.

I rebooted into single user mode, did a proper fsck, and was able to return the machine to service. Before rebooting multi-user I checked the RAID BIOS report again (everything ok, and S.M.A.R.T. was "Y" or in good shape), I did not use the BIOS tool to verify media.

So I have no way to tell if this really was a hd issue, and if so which hard drive. I assumed that when one hard drive began to fail, I would get a report from the RAID device, swap in a new drive and resync. By the way the two drives are mirrored RAID0. Instead, RAID says everything is ok, and the OS says there's an i/o or swap issue when it hangs.

Thoughts and advice?

Thank you.

beanman · Feb 15, 2011

Another Angle

Thanks to the many viewers who read my post, but obviously there wasn't enough to go on for a reply.

To look at this from another point of view, the same server just rebooted, must have been from a panic, and most probably related to the issue I first posted.

The logs (dmesg and /var/log/messages) don't show anything out of the ordinary, just a 2-3 minute gap as the server rebooted. No mention of panic or the cause.

Where else might I be able to look for some kind of indication as to the problem that causeed the panic/reboot?

Can someone explain to me the relationship between the OS seeing the RAID as a single device and how read/write errors are handled?

Thank you

tingo · Feb 15, 2011

No core dump in /var/crash?

Most likely, your hardware (specifically, the disk drives) are starting to fail. Watch out for other things, like fans that need cleaning, or perhaps the machine is so old that the thermal paste between the cpu and the heatsink should be replaced.

Tip: run memtest86+ for 24 hours or more; it runs from memory, and will tell you if there is something wrong with all the other hardware than the disk drives (ok, it probably won't help you if the disk controller is failing).

beanman · Feb 16, 2011

logging and testing, and imap

Thanks for the reply.
It appears /var/crash is disabled as there's ever only been the minfree file in there.

I've got to figure out if it's the raid device failing, or one of the drives, but which one.

I guess I could always but another matching HD, take down the server, choose 1 drive rebuild the raid and wait. If it still happens, swap in the other drive and rebuild again. But I was hoping for more than a shot-in-the-dark solution.

On a related note: IMAP (UW) connections take forever, and sometimes don't sync for a long time after reboot. Understandably there is a background fsck running, but it's really my initial indicator that something is up or about to go wrong with the server. Processes are active, but the clients spin their wheels for a long time or don't sync at all. I upgraded to imapd2007 many months ago, and that seemed to make the issue go away for a little while, but it's back with a vengence. I presume it has something to do with the HD issue but want to float that for any responses. I have no way to tell if a file is locked or what the processes are doing during this long sync period.

kisscool-fr · Feb 16, 2011

If you think you have an issue with one hdd, you can check your drives with sysutils/smartmontools.

It is supposed to check smart statues of drives directly connected to the motherboard. It also supports some raid cards, but don't know if it supports yours.

beanman · Feb 16, 2011

Smartmon and CERC/PERC on FreeBSD

I've just installed smartmon, but I don't think it supports DELL CERC/PERC

Would anyone be able to confirm this?

Code:

smartctl -a --device=3ware,0 /dev/aacd0
smartctl 5.40 2010-10-16 r3189 [FreeBSD 6.2-RELEASE-p11 i386] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/aacd0: 3ware controller type unknown, use /dev/tweX or /dev/twaX devices
=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N, cciss,N, auto, test <=======

beanman · Feb 16, 2011

Smarttools says:

Code:

LSI MegaRAID SAS RAID controller /
Dell PERC 5/i,6/i controller 
Use: -d megaraid,N

But it doesn't seem to work for me.

Code:

/dev/aacd0: Unknown device type 'megaraid,0'
=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N, cciss,N, auto, test 
<=======

kisscool-fr · Feb 16, 2011

Yes, I wasn't sure but it confirms what I thought. You have an LSI card.

There is a lack of support for LSI's cards in FreeBSD's smartmontools version. I had this problem too.

Do you have the possibility to start a LiveCD like systemrescuecd? It is based on gentoo so with a linux kernel. It has smartmontools integrated with support for LSI's card (it should).

Just one thing I'm not sure about, is if it has module for this card integrated/loaded by default.

beanman · Feb 16, 2011

CERC RAID Monitoring in FreeBSD

Thank you,

I was trying to avoid bringing the server down, obvsiously, until I knew exactly what the problem was. But I will schedule some maintenance. Upon bootup the RAID BIOS doesn't indicate a problem but I'll come equipped with a LiveCD or similar, and maybe even run the Verify media tool in the RAID BIOS.

I'll pick up one extra hard drive, expecting to do a swap and rebuild regardless and that should cover it.

I now know that the RAID device isn't going to be reporting problems to the OS (unless there's a smartmon or equivalent). I'm just surprised that a problem is manifesting itself in this way as this is why I went with RAID on this server to begin with.

beanman · Feb 16, 2011

MFIP driver

Ok, so apparently there is FreeBSD Support for Smartmon LSI / Dell CERC/PERC via something called mfip, a pass-through driver.

From Sourceforge:
Support of LSI MegaRaid on FreeBSD is implemented with mfip.ko module and /dev/passX devices.

http://lists.freebsd.org/pipermail/freebsd-hackers/2010-January/030351.html

I don't know:
What mfip is, how to get it or configure it for (FreeBSD 6.x). I figure it gets loaded with kldload and detected with kldstat. It's the first time I would be using either of those commands.

beanman · Feb 16, 2011

MFI vs. AMR

Well it looks like my CERC version 4/SC isn't supported by the mfi driver anyways according to the man page mfi(4) I'm supposed to use the driver amr(4).

And smartmon doesn't work with amr so I think that's the end of the line.

If anyone knows of any other FreeBSD compatible s.m.a.r.t. drive monitoring software that works with DELL PE750 with CERC 4/SC kindly send it my way.