camcontrol issues when adding initiating drive as zfs pool. Kernel Panics

Hello, my issue is first faced on FreeNAS-11.2-RELEASE-U1, but it can be reproduced on plain FreeBSD 11.2, and 12.0 (even worse). Also posted on FreeNAS forum, but no replies there.
So here is my setup:
Here is my specs:
Dell T310 (Intel(R) Xeon(R) CPU X3470 @ 2.93GHz, 8GB RAM)
Adaptec 3405 (3GB/s for SAS drives in front "bucket")
Dell Perc H200e crossflashed to IT p20 firmware
HP D2700 25x2.5 drives expander with 2 controller cards connected to 2 ports on H200e
1x Hitachi Travelstar 5K160 (as boot drive in SATA0 on MB)
1x Seagate ST1000LM049-2GH172 1TB drive
1x HGST Travelstar Z5K1000 1TB drive
1x PNY 240G SSD
Here is my issue:
When I'm creating pool in webui/cli (on plain FreeBSD) on that Seagate ST1000LM049-2GH172 I get a lot of errors in dmesg, and sometimes system stuck or sometimes it finnaly adds pool, but almost any activity on that disk hangs system for minutes. I tried same in plain 12.0 FreeBSD - its even worse, system crashes. Any other drive works perfectly.
What I tried:
  • Installing freenas 9, ubuntu 18, centos 7, FreeBSD 11.2, 12.0. Ubuntu and centos can create ZFS pool, has a good write speed and no errors, no warnings.
  • Tried putting drive onto motherboard SATA port (same machine) - same issues, no difference
  • Put drive in other machine with onboard SATA controller - same issue when AHCI disable, if AHCI enabled - no issue, then in D2700 + HBA - issues came back
  • SMART tests run with no issues on this drive
Here is drive info:
Code:
=== START OF INFORMATION SECTION ===
Device Model:     ST1000LM049-2GH172
Serial Number:    WGS20TGN
LU WWN Device Id: 5 000c50 0bd9d0328
Firmware Version: SDM1
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Feb 16 20:54:43 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Here is my log (just creating zfs pool on drive:
Code:
    (da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00 length 4096 SMID 391 Aborting command 0xfffffe0001023130
mps0: Sending reset from mpssas_send_abort for target ID 10
mps0: Unfreezing devq for target ID 10
(da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00
(da0:mps0:0:10:0): CAM status: Command timeout
(da0:mps0:0:10:0): Retrying command
    (da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00 length 4096 SMID 398 Aborting command 0xfffffe0001023a60
mps0: Sending reset from mpssas_send_abort for target ID 10
mps0: Unfreezing devq for target ID 10
(da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00
(da0:mps0:0:10:0): CAM status: Command timeout
(da0:mps0:0:10:0): Retrying command
    (da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00 length 4096 SMID 405 Aborting command 0xfffffe0001024390
mps0: Sending reset from mpssas_send_abort for target ID 10
    (pass1:mps0:0:10:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 420 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 10
(da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00
(da0:mps0:0:10:0): CAM status: Command timeout
(da0:mps0:0:10:0): Retrying command
    (pass1:mps0:0:10:0): INQUIRY. CDB: 12 00 00 00 40 00 length 64 SMID 518 Aborting command 0xfffffe000102d7e0
mps0: Sending reset from mpssas_send_abort for target ID 10
    (da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00 length 4096 SMID 517 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 10
(da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00
(da0:mps0:0:10:0): CAM status: CCB request completed with an error
(da0:mps0:0:10:0): Retrying command
    (da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00 length 4096 SMID 525 Aborting command 0xfffffe000102e110
mps0: Sending reset from mpssas_send_abort for target ID 10
mps0: Unfreezing devq for target ID 10
(da0:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 08 00 00 00 00 00 00 40 06 00
(da0:mps0:0:10:0): CAM status: Command timeout
(da0:mps0:0:10:0): Error 5, Retries exhausted
(da0:mps0:0:10:0): READ(10). CDB: 28 00 74 70 6a 90 00 00 10 00
(da0:mps0:0:10:0): CAM status: SCSI Status Error
(da0:mps0:0:10:0): SCSI status: Check Condition
(da0:mps0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mps0:0:10:0): Retrying command (per sense data)
(da0:mps0:0:10:0): READ(10). CDB: 28 00 74 70 6a 90 00 00 10 00
(da0:mps0:0:10:0): CAM status: SCSI Status Error
(da0:mps0:0:10:0): SCSI status: Check Condition
(da0:mps0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mps0:0:10:0): Retrying command (per sense data)
GEOM_ELI: Device da0p1.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: software
 
You have communications issues with the drive, that's what your errors above show. The reset of the mpsas stack is because a command is in limbo, and the device driver doesn't know how to get it to finish other than by resetting some part of the communications stack (the SAS HBA, perhaps some SAS expanders, or the drive). The timeout error indicates that a command was sent and never finished. The actual errors you then get from the drive (the "check condition" entries) are not drive errors, but the drive indicating that it was reset (look at the asc/ascq, which is correctly translated to "power on, reset...").

SMART tests probably won't help. We have no reason to believe that the drive is internally faulty; all problems are communications problems. I would check all cabling, check the power supply to the drive (perhaps put it on a power bus of its own temporarily, in case the problems are caused by power supplies being too weak). The next step may have to be replacing the drive, since it is possible that it has a damaged SAS port.
 
You have communications issues with the drive, that's what your errors above show. The reset of the mpsas stack is because a command is in limbo, and the device driver doesn't know how to get it to finish other than by resetting some part of the communications stack (the SAS HBA, perhaps some SAS expanders, or the drive). The timeout error indicates that a command was sent and never finished. The actual errors you then get from the drive (the "check condition" entries) are not drive errors, but the drive indicating that it was reset (look at the asc/ascq, which is correctly translated to "power on, reset...").

SMART tests probably won't help. We have no reason to believe that the drive is internally faulty; all problems are communications problems. I would check all cabling, check the power supply to the drive (perhaps put it on a power bus of its own temporarily, in case the problems are caused by power supplies being too weak). The next step may have to be replacing the drive, since it is possible that it has a damaged SAS port.
I'm sorry, but you probably didn't read my post. My T310 have 2 PSU, my D2700 have 2, other computer with AMD A6 has it own. I even tried on different phases in my house, and also with different UPS. And for some reason this "cabling/PSU" issues occur only in FreeBSD based systems.
 
That may all be true ... but all the errors in the log above are communications problems between the computer and the drive.

I admit that it makes no sense that these communications problems only occur with FreeBSD. And given your hardware, it seems indeed unlikely that weak power supplies or bad cabling is the issue. But what other theories of where the problem lies can we formulate?
 
That may all be true ... but all the errors in the log above are communications problems between the computer and the drive.

I admit that it makes no sense that these communications problems only occur with FreeBSD. And given your hardware, it seems indeed unlikely that weak power supplies or bad cabling is the issue. But what other theories of where the problem lies can we formulate?
My theory is a bug/feature of camcontrol or underneath drivers. There was such bug as "extended sleep" back in the days, where LSI cards can't see that the "super efficient" drive is asleep. Also as we may see FreeBSD has no troubles with this drive connected to "AMD A88X" chipset SATA controller set to AHCI mode. So some extended AHCI capabilities that this drive utilizes doesn't work on non-AHCI FreeBSD drivers.
 
Back
Top