zpool degraded, drive faulty using lsi3008 mpr0 11.2-R latest firmware&driver

TFAiSO · Aug 21, 2018

Hi there,

This is my first post on the forum

I have just moved from FreeNAS to FreeBSD. All is great and I simply love the flexibility of FreeBSD!

I'm having an issue and I wondered if you can spot what is wrong...
zpool status indicates the zpool is degraded due to one faulty drive with 9 read and 120 write errors.
The problem is these drives are fine. It happens over and over again that I remove the drive, scan it with the manufacturer scantool for defects and no problems are found. I reattach the drive, resilver/scrub it and all is back to normal again.

I'm running the 11.2-RELEASE and use the Avago/LSI3008 HBA in IT-mode using the mpr0 default driver. This driver is the very latest driver from the manufacturer. The card is flashed with the latest firmware:

mpr0: <Avago Technologies (LSI) SAS3008> port 0x7000-0x70ff mem 0xb8640000-0xb864ffff,0xb8600000-0xb863ffff irq 42 at device 0.0 numa-domain 0 on pci9
mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

The drives are Seagate SATA 10TB-drives. The currently drive in error is:
dmesg | grep -B1 'Serial Number'

da0p2:
serial: ZA21****
model: ATA ST10000NM0016-1T SNB0

dmesg shows this:

Code:

ums1 numa-domain 0 on uhub1
ums1: <vendor 0x0557 product 0x2419, class 0/0, rev 1.10/1.00, addr 4> on usbus0
ums1: 3 buttons and [Z] coordinates ID=0
    (da0:mpr0:0:51:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 268 Aborting command 0xfffffe0001f8e140
mpr0: Sending reset from mprsas_send_abort for target ID 51
mpr0: Unfreezing devq for target ID 51
(da0:mpr0:0:51:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da0:mpr0:0:51:0): CAM status: Command timeout
(da0:mpr0:0:51:0): Retrying command
(da0:mpr0:0:51:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da0:mpr0:0:51:0): CAM status: SCSI Status Error
(da0:mpr0:0:51:0): SCSI status: Check Condition
(da0:mpr0:0:51:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpr0:0:51:0): Error 6, Retries exhausted
(da0:mpr0:0:51:0): Invalidating pack

Are there any settings you are aware of that can remedy the issue?
* Note: powerd is not running.
* Complete build: https://forums.freenas.org/index.php?threads/xeon-gold-6144-my-first-nas.58422/

Kind Regards,
Rick

abishai · Aug 21, 2018

This is a faulty minisas cable. I had the same issue. Try to poke it slightly to confirm

TFAiSO · Aug 21, 2018

Thanks Abishai, I will try removing/reattaching the cables going from the HBA to the motherboard to see if the error persists. If it does, I will replace all miniSAS cables.

Wouldn't it be weird if new original Supermicro cables are shipped bad?!

VladiBG · Aug 21, 2018

Maybe power supply issue. Do you have another PSU to test with?

ralphbsz · Aug 22, 2018

Read the message. This is not an error per se. The drive is simply reporting that it had its power cycled, or that it was reset. Check power delivery to the drive, and check whether there is any reason a part of the stack should have reset it. Most likely is a power delivery problem. It might be workload-dependent: the power supply can handle the drives idle, or only a few running, but not all running.

VladiBG · Aug 22, 2018

In the past i had similar problems with oxidized contact pads between the hard-disk controller and actuator. That cause the hard-disk to power-cycle itself. The problem is solved with pencil rubber and alcohol.

phoenix · Aug 22, 2018

As mentioned above, these kinds of spurious, random, inconsistent errors are generally caused by two things:

bad cables / bad connection to cables, or
insufficient power

Double check all the cables connected to all the drives, both power and data. Check both ends. Unplug/replug them to make sure they are connected properly and snugly. Replace any that wobble.

If the problem persists, then replace the cables plugged into the drive having issues.

If the problem persists, then check the PSU. It may not be able to provide enough power when all the drives are being accessed or when drives are spinning up from suspend/sleep modes. Consider replacing it with a better / more powerful one. Check the power cables coming off the drives to make sure you're not overloading any one connection to the PSU. Spread the drives around on as many separate connections to the PSU as possible.

If the problem persists, check the power going into the system. You may be experiencing brownouts or voltage drops or even spikes on the incoming power. Consider getting an inline UPS (meaning the power going out of the UPS comes from the batteries; power going into the UPS just charges the batteries) so that you get clean, consistent power.

TFAiSO · Aug 25, 2018

Hi there,

The incoming power is pure sinewave and the machine runs a dual redudant PSU.
The chassis is a new SuperChassis 847BE1C4-R1K23LPB with only a third of the drive bays filled up. It could of course be the case that both PSU's are broken at the same time, but I find it more likely there is something wrong with the cabling somewhere. I will also check for oxidation (bad connections).

The situation is now so that when there is a drive issue, I restart the server. Upon reboot there's a scrub which is resilvering and after the resilvering process the full array is restored to ONLINE. The irritating part is that some data is probably recorded somewhere in the zfs filesystem as "bad" until the disk is wiped/replaced and put it back...

I will wait looking for more issues and see if this is only affecting the very same drive, or if more than the same drive show up as erroneous.

TFAiSO · Aug 27, 2018

I just logged in to the machine only to discover that two completely different drives are marked as "FAULTY" and luckily, these two drives belong two separate arrays. I think it is now without doubt there's a faulty backplane, bad cables or broken HBA. I think it's less likely the HBA has broken down but... I'll check the wiring and connections first.
Server is powered off now.

SirDice · Aug 27, 2018

I've had this happen fairly recently, we bought two identical servers and one of the servers had constant issues. The other server ran without issues from the start. On the bad server random disks would go in an error state, hosing the pool. Turned out to be a faulty port extender (builtin on the backplane).

Terry_Kennedy · Sep 1, 2018

SirDice said:
I've had this happen fairly recently, we bought two identical servers and one of the servers had constant issues. The other server ran without issues from the start. On the bad server random disks would go in an error state, hosing the pool. Turned out to be a faulty port extender (builtin on the backplane).

I remember that... Christmas a few years ago...

zpool degraded, drive faulty using lsi3008 mpr0 11.2-R latest firmware&driver

Administrator