Two days ago ESXi's management failed during regular operation and the general reason for this was SCSI problems coming from one of the iSCSI connected LUNs (we have around 10 LUNs like that presented by FreeBSD system to the ESXi host, but only one of them only caused the issue).
The host accidentally become inaccessible from management point of view, management services were not restartable (hanged up). VMs living on all the rest iSCSI based datastrores were fortunately alive and not affected.
Based on the information from the vmkernel.log we suspected that the issue was caused from the below lun which was FreeBSD.
We disconnected only that single volume from BSD side and host immediately connected back to the cluster and to the vcenter.
Our network team doesn't find any network based events during this period.
We don't find clear reason from BSD side.
We inspected /var/log/messages where istgt writes - no success to find the reason.
We also don't find anything unusual regarding FreeBSD system from our monitoring system (zabbix) where we have statistics about CPU, Memory, ZPOOL status, ZFS-STATS counters, etc.
We don't have failed disks or zfs objects, everything is OK.
System's dmesg shows some minor disk errors, but smartctl doesn't sent us anything wrong from the regular checks and it seems that this disk is ok :
===================dmesg=========
VMWare was involved and the vsphere logs uploaded. They tell us that power-on reset happens if there are pending reservations on the lun as the lun would not be able to process more commands from this hosts at this time the host would still look for the commands execution and would exhaust the hostd thread which is responsible for this action.
During this scenario Vmware support want us to have check on the storage side what was the lun doing at this time as we did not have the logs after "2016-08-12T19:58"
We need for your help to proceed investigating.
The host accidentally become inaccessible from management point of view, management services were not restartable (hanged up). VMs living on all the rest iSCSI based datastrores were fortunately alive and not affected.
Code:
ESXi Host - Vsphere 6.0U2
SuperMicro FreeBSD 10.1 with istgt-20121028
Vmkernel.log in ESXi host was flooded:
========================
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.352Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
2016-08-12T19:58:16.353Z cpu12:4956844)ScsiCore: 1609: Power-on Reset occurred on vmhba32:C0:T9:L0
=======================================
Code:
t10.FreeBSD_iSCSI_DISK______10000084________________________
vmhba32:C0:T9:L0
Our network team doesn't find any network based events during this period.
We don't find clear reason from BSD side.
We inspected /var/log/messages where istgt writes - no success to find the reason.
We also don't find anything unusual regarding FreeBSD system from our monitoring system (zabbix) where we have statistics about CPU, Memory, ZPOOL status, ZFS-STATS counters, etc.
We don't have failed disks or zfs objects, everything is OK.
System's dmesg shows some minor disk errors, but smartctl doesn't sent us anything wrong from the regular checks and it seems that this disk is ok :
===================dmesg=========
Code:
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d09
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
(da94:mps1:0:75:0): Info: 0x45dd2d38
(da94:mps1:0:75:0): Field Replaceable Unit: 6
(da94:mps1:0:75:0): Actual Retry Count: 0
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): READ(10). CDB: 28 00 02 e4 30 78 00 00 20 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
(da94:mps1:0:75:0): Info: 0x45dd2d39
(da94:mps1:0:75:0): Field Replaceable Unit: 6
(da94:mps1:0:75:0): Actual Retry Count: 0
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d94
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): READ(10). CDB: 28 00 39 fc e5 bd 00 00 20 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2d95
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
(da94:mps1:0:75:0): Retrying command (per sense data)
(da94:mps1:0:75:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da94:mps1:0:75:0): CAM status: SCSI Status Error
(da94:mps1:0:75:0): SCSI status: Check Condition
(da94:mps1:0:75:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
(da94:mps1:0:75:0): Info: 0x45dd2dc2
(da94:mps1:0:75:0): Field Replaceable Unit: 8
(da94:mps1:0:75:0): Actual Retry Count: 24
Code:
[root@sm1 ~]# smartctl -a /dev/da94
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.1-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3600057SS
Revision: 0008
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500684c450b
Serial number: 6SL6FJNK0000N3442EQW
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Aug 15 16:28:58 2016 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 35 C
Drive Trip Temperature: 68 C
Elements in grown defect list: 7
Vendor (Seagate) cache information
Blocks sent to initiator = 1097810090
Blocks received from initiator = 2848461732
Blocks read from cache and sent to initiator = 379368814
Number of read and write commands whose size <= segment size = 1027613802
Number of read and write commands whose size > segment size = 4
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 24404.17
number of minutes until next internal SMART test = 29
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2771573510 88 0 2771573598 2771573598 92921.056 0
write: 0 0 1 1 12 14715.353 11
Non-medium error count: 43
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 24363 - [- - -]
# 2 Background short Completed - 24196 - [- - -]
# 3 Background short Completed - 24028 - [- - -]
# 4 Background short Completed - 23860 - [- - -]
# 5 Background short Completed - 23692 - [- - -]
# 6 Background short Completed - 23524 - [- - -]
# 7 Background short Completed - 23356 - [- - -]
# 8 Background short Completed - 23188 - [- - -]
# 9 Background short Completed - 20501 - [- - -]
#10 Background short Completed - 20333 - [- - -]
#11 Background short Completed - 20165 - [- - -]
#12 Background short Completed - 19997 - [- - -]
#13 Background short Completed - 19829 - [- - -]
#14 Background short Completed - 19661 - [- - -]
#15 Background short Completed - 19493 - [- - -]
#16 Background short Completed - 19325 - [- - -]
#17 Background short Completed - 19157 - [- - -]
#18 Background short Completed - 18989 - [- - -]
#19 Background short Completed - 18821 - [- - -]
#20 Background short Completed - 18653 - [- - -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
=================================================
[root@sm1 ~]# smartctl -H /dev/da94
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.1-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
VMWare was involved and the vsphere logs uploaded. They tell us that power-on reset happens if there are pending reservations on the lun as the lun would not be able to process more commands from this hosts at this time the host would still look for the commands execution and would exhaust the hostd thread which is responsible for this action.
During this scenario Vmware support want us to have check on the storage side what was the lun doing at this time as we did not have the logs after "2016-08-12T19:58"
We need for your help to proceed investigating.