ZFS VMWare ESXi 6.0U2 fails during iscsi access to ISTGT based LUN (FreeBSD10.1 )

Two days ago ESXi's management failed during regular operation and the general reason for this was SCSI problems coming from one of the iSCSI connected LUNs (we have around 10 LUNs like that presented by FreeBSD system to the ESXi host, but only one of them only caused the issue).

The host accidentally become inaccessible from management point of view, management services were not restartable (hanged up). VMs living on all the rest iSCSI based datastrores were fortunately alive and not affected.
Code:
ESXi Host - Vsphere 6.0U2
SuperMicro FreeBSD 10.1 with istgt-20121028

Vmkernel.log in ESXi host was flooded:
========================
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0

2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
=======================================
Based on the information from the vmkernel.log we suspected that the issue was caused from the below lun which was FreeBSD.
Code:
t10.FreeBSD_iSCSI_DISK______10000084________________________
vmhba32:C0:T9:L0
We disconnected only that single volume from BSD side and host immediately connected back to the cluster and to the vcenter.

Our network team doesn't find any network based events during this period.

We don't find clear reason from BSD side.

We inspected /var/log/messages where istgt writes - no success to find the reason.
We also don't find anything unusual regarding FreeBSD system from our monitoring system (zabbix) where we have statistics about CPU, Memory, ZPOOL status, ZFS-STATS counters, etc.
We don't have failed disks or zfs objects, everything is OK.
System's dmesg shows some minor disk errors, but smartctl doesn't sent us anything wrong from the regular checks and it seems that this disk is ok :

===================dmesg=========
Code:
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d09
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
(da94:mps1:0:75:0): Info: 0x45dd2d38
(da94:mps1:0:75:0): Field Replaceable Unit: 6
(da94:mps1:0:75:0): Actual Retry Count: 0
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): READ(10). CDB: 28 00 02 e4 30 78 00 00 20 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
(da94:mps1:0:75:0): Info: 0x45dd2d39
(da94:mps1:0:75:0): Field Replaceable Unit: 6
(da94:mps1:0:75:0): Actual Retry Count: 0
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d94
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): READ(10). CDB: 28 00 39 fc e5 bd 00 00 20 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d95
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2dc2
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
Code:
[root@sm1 ~]# smartctl -a /dev/da94
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.1-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3600057SS
Revision:             0008
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        15000 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500684c450b
Serial number:        6SL6FJNK0000N3442EQW
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Aug 15 16:28:58 2016 EEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        68 C

Elements in grown defect list: 7

Vendor (Seagate) cache information
  Blocks sent to initiator = 1097810090
  Blocks received from initiator = 2848461732
  Blocks read from cache and sent to initiator = 379368814
  Number of read and write commands whose size <= segment size = 1027613802
  Number of read and write commands whose size > segment size = 4

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 24404.17
  number of minutes until next internal SMART test = 29

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2771573510       88         0  2771573598   2771573598      92921.056           0
write:         0        0         1         1         12      14715.353          11

Non-medium error count:       43


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   24363                 - [-   -    -]
# 2  Background short  Completed                   -   24196                 - [-   -    -]
# 3  Background short  Completed                   -   24028                 - [-   -    -]
# 4  Background short  Completed                   -   23860                 - [-   -    -]
# 5  Background short  Completed                   -   23692                 - [-   -    -]
# 6  Background short  Completed                   -   23524                 - [-   -    -]
# 7  Background short  Completed                   -   23356                 - [-   -    -]
# 8  Background short  Completed                   -   23188                 - [-   -    -]
# 9  Background short  Completed                   -   20501                 - [-   -    -]
#10  Background short  Completed                   -   20333                 - [-   -    -]
#11  Background short  Completed                   -   20165                 - [-   -    -]
#12  Background short  Completed                   -   19997                 - [-   -    -]
#13  Background short  Completed                   -   19829                 - [-   -    -]
#14  Background short  Completed                   -   19661                 - [-   -    -]
#15  Background short  Completed                   -   19493                 - [-   -    -]
#16  Background short  Completed                   -   19325                 - [-   -    -]
#17  Background short  Completed                   -   19157                 - [-   -    -]
#18  Background short  Completed                   -   18989                 - [-   -    -]
#19  Background short  Completed                   -   18821                 - [-   -    -]
#20  Background short  Completed                   -   18653                 - [-   -    -]

Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
=================================================
[root@sm1 ~]# smartctl -H /dev/da94
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.1-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

VMWare was involved and the vsphere logs uploaded. They tell us that power-on reset happens if there are pending reservations on the lun as the lun would not be able to process more commands from this hosts at this time the host would still look for the commands execution and would exhaust the hostd thread which is responsible for this action.

During this scenario Vmware support want us to have check on the storage side what was the lun doing at this time as we did not have the logs after "2016-08-12T19:58"

We need for your help to proceed investigating.
 
You have "uncorrectable" errors on the disk, replace it.
 
Thank you for your replay - I was not sure should I replace it (despite the errors the disk's smartctl status was OK and smartctl daemon - which is enabled and configured - didn't send me any information about these errors ). I am now going to replace the disk (and I also just made trigger in zabbix to find out such errors as they appear ).

Regarding my initial question about the host's fail and the reason for it - since istgt volumes are ZFS volumes and zpool status is ok - I think these uncorrectable errors can't be suspected as possible reason, and we need to proceed to investigate for the actual reason for the fail, right?
 
No, I'm thinking this is the reason. The write failures causes SCSI errors which tends to hang up the bus, causing the whole pool to slowdown. It will tend to do that until the disk has been replaced. I've actually had this happen on my home server not too long ago. A bad disk caused all sorts of stuttering and general slowness of the pool. Once the disk was replaced everything returned to normal.

I'm assuming you're using RAID-Z(2?) so replacing it shouldn't cause too much problems. Try to remove as much of the load as possible while it's resilvering, that'll shorten the time needed.
 
Code:
Drive Trip Temperature:        68 C

Seagate ST3600057SS
Temperature, Operating (°C) 5 to 55

If HDDs getting too hot for a longer period, they will fail earlier and more often.
 
The disks has been replaced (actually we had one more with such errors so we replaced both of them). Thank you, We will monitor them from now for errors and for temperature.
 
Back
Top