Hard drive error questions

cjr · Feb 24, 2021

Good morning. I have two hard drives logging some errors. I've done some research, but can't find exactly what the errors mean. Both the drives that are logging errors are plugged in to a pci to sata expansion card. The errors are below:

Code:

Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): CAM status: ATA Status Error
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): ATA status: 71 (DRDY DF SERV ERR), error: 04 (ABRT )
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): RES: 71 04 00 00 00 00 00 00 00 00 00
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): Retrying command, 0 more tries remain

I'd suspect it was the drive, it wasn't both drives plugged in to the same card. Thanks. --jake

SirDice · Feb 24, 2021

Check with smartctl(8) and run the tests. I'm betting they're just at the end of their lifetime.

cjr · Feb 24, 2021

They're refurbs from Amazon. I've ran the smart tests. No issues. I'm on 12.2-RELEASE btw.
Edit: Power On Time is 5943 hours, 247 days.

SirDice · Feb 24, 2021

Did you try them on another controller? I had an old Promise ATA300 card that decided to die one day, the controller chip on the card got really hot (as in can't even touch it anymore
hot) and it gave all sorts of problems when there was more than one drive attached to it. So it could be the controller that's causing the problems.

cjr · Feb 24, 2021

I have not yet. I will try this weekend. Can you explain what the error codes mean? ATA status 71 for instance? I googled around a bit, but haven't found much. Thanks. --jake

SirDice · Feb 24, 2021

Don't know, it's usually just some status from the driver. More often then not the fault is actually with the drive itself.

cjr said:
Edit: Power On Time is 5943 hours, 247 days.

That's fine, it could still be broken. I have drives that worked for years on end and drives that broke within a few months, power on time is just an indication of the age of the drive. What do 197 Current_Pending_Sector and 198 Offline_Uncorrectable tell you?

olli@ · Feb 24, 2021

First thing is to check the cables. For SATA-III drives, be sure to use SATA-III-certified cables. Also, older cables don’t have clips, and sometimes they slowly lose contact due to vibration inside the PC case. Also check the drives’ power connectors.

BTW, can you please post the output from smartctl -a /dev/ada0?

diizzy · Feb 24, 2021

SirDice said:
Did you try them on another controller? I had an old Promise ATA300 card that decided to die one day, the controller chip on the card got really hot (as in can't even touch it anymore
hot) and it gave all sorts of problems when there was more than one drive attached to it. So it could be the controller that's causing the problems.

If it's PCI it's most likely ancient and pre AHCI so that wouldn't be too surprising.

SirDice · Feb 24, 2021

diizzy said:
If it's PCI it's most likely ancient and pre AHCI so that wouldn't be too surprising.

It had been running with 4 drives attached for a number of years. No stellar performance off course but it worked without problems. Until it decided it couldn't take it any more and just burned out. Started getting a LOT of status errors and time-outs. When I touched the controller chip I knew it's time was up. Bought a second hand LSI based SAS/SATA card to replace it. Much better performance (PCIx4 slot). It's still running with those 4 drives attached. Not the same 4 drives though, I think I replaced all of them at least once since then.

ralphbsz · Feb 25, 2021

I think "DRDY DF SERV ERR" is some sort of communications error, not a head/platter error. So all the discussion above about cables and chips seems more important than smartctl.

olli@ · Feb 25, 2021

ralphbsz said:
I think "DRDY DF SERV ERR" is some sort of communications error, not a head/platter error. So all the discussion above about cables and chips seems more important than smartctl.

Yes, the status is a bit mask: The value “71” from the driver message is a hexadecimal number (0x71) that consists of:

bit 6 (0x40 = DRDY) “drive ready” – This is normal. It means that the drive is ready to receive commands.
bit 5 (0x20 = DF) “device fault” – This is not good.
bit 4 (0x10 = SERV) “overlapped mode service request” – This flag depends on the command, in this case FLUSHCACHE48. Not sure what it means in this context.
bit 0 (0x01 = ERR) – And error condition, further details are in the error value.

The error value in this case is 0x04 = ABRT. That means, the error condition was caused by an aborted command, in this case that was a FLUSHCACHE48 command. The ABRT error can be caused by an invalid command or by a device error.

Are these “WD Red” drives (e.g. WD40EFRX or WD60EFRX), by any chance?

cjr · Feb 27, 2021

olli@ said:
Yes, the status is a bit mask: The value “71” from the driver message is a hexadecimal number (0x71) that consists of:

bit 6 (0x40 = DRDY) “drive ready” – This is normal. It means that the drive is ready to receive commands.

bit 5 (0x20 = DF) “device fault” – This is not good.

bit 4 (0x10 = SERV) “overlapped mode service request” – This flag depends on the command, in this case FLUSHCACHE48. Not sure what it means in this context.

bit 0 (0x01 = ERR) – And error condition, further details are in the error value.

The error value in this case is 0x04 = ABRT. That means, the error condition was caused by an aborted command, in this case that was a FLUSHCACHE48 command. The ABRT error can be caused by an invalid command or by a device error.

Are these “WD Red” drives (e.g. WD40EFRX or WD60EFRX), by any chance?

They are HGST Ultrastar 7K4000s that I bought as refurbs off Amazon to, hopefully, avoid the whole SMR debacle. Also, re-seating all the cables and the PCI card seems to have done the trick. Thanks, everyone.

Snurg · Feb 27, 2021

cjr said:
re-seating all the cables and the PCI card seems to have done the trick

Hehe I always excorcize these as first, because evil demons love to hide in this area.
Saves a lot of headache and time.

aponomarenko · Mar 5, 2021

Run diagnostics by https://www.freshports.org/sysutils/hw-probe/ and post returned ID here for investigation.

Hard drive error questions

cjr

SirDice

Administrator

cjr

SirDice

Administrator

cjr

SirDice

Administrator

olli@

diizzy

SirDice

Administrator

ralphbsz

olli@

cjr

Snurg

aponomarenko