Hi all,
I have a FreeBSD 9.1 VM on ESXi5.1 with vmware-tools with:
ZFS:
PSU is:
Motherboard/CPU/Memory:
My question is how to verify what is failing and why since it could be:
Then checking across the controllers:
The following messages during disk activity:
A bit of that:
Last but not the least, this type of message across 9 drives...
Either I am cursed or there is something going on, but I have to admit I have no idea where to start and certainly no confidence I am actually going to isolate any issue...
Current action:
I have removed all the 1 TB drives and kept only the 8 x 3 TB distributed across the 3 controllers. In other words, that is no more than 2 drives per backplane (6 in total on the RPC4224). I have then launched a scrub on my 5-drive raidz1 and towards the end:
then
I did a second scrub and got a few like this:
then
Now the interesting bit is last time (3 weeks ago) I ran a long smartmon test, I got a clean report for all the 3 TB drives. Before the smartmon test one of them was reporting issues when a lot of drives were present in the chassis. Going back to 5 drives (at that time) I would have an error-free smartmon long test report on each drive.
If you are still on this thread (thanks a lot for reading), at one point, I had 6 x 10k RPM 2.5 72 GB SAS drives in this NORCO RPC4224 chassis and they were failing (the famous click click noise at some stage not too far after startup). I put them back in the DL380 they came from, and they were all fine. I am more and more thinking of a power related issue since I use only two of the 5 rails dedicated 'Peripheral IDE/SATA' shown on this photo of my PSU
If anybody has got any idea on best place to start to get a reliable setup, let me know.
Thanks.
I have a FreeBSD 9.1 VM on ESXi5.1 with vmware-tools with:
- 8 x 3 TB and recently added 12 x 1 TB
- 3 x IBM M1015 flashed with 2118IT.bin as PCI Device passthrough under ESXi (used exclusively for ZFS arrays since the system is on an SSD used by ESXi).
Code:
dev.mpslsi.0.firmware_version: 15.00.00.00
dev.mpslsi.0.driver_version: 15.00.00.00
dev.mpslsi.1.firmware_version: 15.00.00.00
dev.mpslsi.1.driver_version: 15.00.00.00
dev.mpslsi.2.firmware_version: 15.00.00.00
dev.mpslsi.2.driver_version: 15.00.00.00
ZFS:
- 1 x ZFS raidz1 of 5 x 3 TB drives (I want to expand to a raidz2 with 8 drives or whatever I will get as recommendation from the other topic I started here)
- 2 x ZFS raidz1 of 6 x 1 TB drives to copy the data from the 5 x 3 TB array (not fully used today) so I can destroy and recreate a larger pool with more 3 TB and more resilience
- 7 x 3 TB WD RED (0.60 Amp at +5 V / 0.45 A at +12 V)
- 1 x 3 TB Seagate (no idea of the power requirements but will eventually check)
- 12 x 1 TB Seagate 2.5 inch (0.80 Amp at +5 V / 0.20 Amp at +12 V)
PSU is:
- Seasonic X Series 750 W (currently using only 2 of the 6-pin to molex rail at the back for HDD, I think this is enough but I have started to doubt on everything)
Motherboard/CPU/Memory:
- Supermicro X9SRL-F / E5-2665 / 128GB-ECC
My question is how to verify what is failing and why since it could be:
- one of the 3 controllers
- one line in one of the 6 SFF cables
- one of the 6 backplanes on the NORCO case (RPC4224)
- FreeBSD mpslsi driver v15.00.00.00
- power lines to respective backplane of the NORCO case
Code:
(probe230:mpslsi2:0:232:0): INQUIRY. CDB: 12 0 0 0 24 0
(probe230:mpslsi2:0:232:0): CAM status: Invalid Target ID
(probe230:mpslsi2:0:232:0): Error 22, Unretryable error
Then checking across the controllers:
Code:
grep probe /var/run/dmesg.boot | grep mpslsi2 | wc -l
924
[root@softimage ~]# grep probe /var/run/dmesg.boot | grep mpslsi0 | wc -l
0
[root@softimage ~]# grep probe /var/run/dmesg.boot | grep mpslsi1 | wc -l
0
The following messages during disk activity:
Code:
grep ATTENTION /var/log/messages
Jun 2 10:19:48 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:19:56 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:19:57 softimage kernel: (da2:mpslsi0:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:08 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:08 softimage kernel: (da17:mpslsi2:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:08 softimage kernel: (da18:mpslsi2:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:09 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:11 softimage kernel: (da1:mpslsi0:0:21:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:11 softimage kernel: (da6:mpslsi0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:31 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:42 softimage kernel: (da4:mpslsi0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:20:43 softimage kernel: (da2:mpslsi0:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:21:51 softimage kernel: (da3:mpslsi0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:21:53 softimage kernel: (da1:mpslsi0:0:21:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:21:55 softimage kernel: (da4:mpslsi0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:22:12 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 10:22:29 softimage kernel: (da5:mpslsi0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:23:53 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:23:58 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:25:21 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:25:22 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:25:23 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:25:25 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:25:25 softimage kernel: (da13:mpslsi1:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 11:26:03 softimage kernel: (da20:mpslsi2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
A bit of that:
Code:
Jun 2 11:33:48 softimage kernel: (da12:mpslsi1:0:25:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
[root@softimage ~]# grep iuCRC /var/log/messages | wc -l
2302
Last but not the least, this type of message across 9 drives...
Code:
Jun 2 10:20:03 softimage kernel: (da3:mpslsi0:0:25:0): WRITE(10). CDB: 2a 0 0 76 d9 0 0 0 d8 0 length 110592 SMID 553 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:03 softimage kernel: (da2:mpslsi0:0:22:0): WRITE(10). CDB: 2a 0 0 76 e3 d8 0 0 d8 0 length 110592 SMID 599 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:04 softimage kernel: (da1:mpslsi0:0:21:0): WRITE(10). CDB: 2a 0 0 79 99 48 0 0 d8 0 length 110592 SMID 289 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:05 softimage kernel: (da17:mpslsi2:0:5:0): WRITE(10). CDB: 2a 0 0 5b 55 60 0 0 d8 0 length 110592 SMID 496 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:05 softimage kernel: (da18:mpslsi2:0:6:0): WRITE(10). CDB: 2a 0 0 5b cf 20 0 0 d8 0 length 110592 SMID 151 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:05 softimage kernel: (da4:mpslsi0:0:26:0): WRITE(10). CDB: 2a 0 0 7a 2b f8 0 0 d0 0 length 106496 SMID 653 terminated ioc 804b scsi 0 state c xfer 106496
Jun 2 10:20:05 softimage kernel: (da6:mpslsi0:0:28:0): WRITE(10). CDB: 2a 0 0 7a 26 f0 0 0 d8 0 length 110592 SMID 805 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:05 softimage kernel: (da5:mpslsi0:0:27:0): WRITE(10). CDB: 2a 0 0 7a b7 f8 0 0 d8 0 length 110592 SMID 817 terminated ioc 804b scsi 0 state c xfer 110592
Jun 2 10:20:08 softimage kernel: (da3:mpslsi0:0:25:0): WRITE(10). CDB: 2a 0 0 77 bc 80 0 0 d8 0
Either I am cursed or there is something going on, but I have to admit I have no idea where to start and certainly no confidence I am actually going to isolate any issue...
Current action:
I have removed all the 1 TB drives and kept only the 8 x 3 TB distributed across the 3 controllers. In other words, that is no more than 2 drives per backplane (6 in total on the RPC4224). I have then launched a scrub on my 5-drive raidz1 and towards the end:
Code:
pool status
pool: zstuff
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jun 2 11:57:30 2013
3.74T scanned out of 5.86T at 978M/s, 0h37m to go
141M resilvered, 63.75% done
config:
NAME STATE READ WRITE CKSUM
zstuff ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 2 (resilvering)
gpt/disk2 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0 (resilvering)
gpt/disk4 ONLINE 0 0 0
gpt/disk5 ONLINE 0 0 0
errors: No known data errors
then
Code:
zpool status -v
pool: zstuff
state: ONLINE
scan: resilvered 141M in 0h19m with 0 errors on Sun Jun 2 12:17:09 2013
config:
NAME STATE READ WRITE CKSUM
zstuff ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0
gpt/disk2 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0
gpt/disk4 ONLINE 0 0 0
gpt/disk5 ONLINE 0 0 0
I did a second scrub and got a few like this:
Code:
Jun 2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): READ(10). CDB: 28 0 4 89 0 a0 0 0 c0 0
Jun 2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): CAM status: SCSI Status Error
Jun 2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): SCSI status: Check Condition
Jun 2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jun 2 16:19:57 softimage kernel: (da5:mpslsi1:0:25:0): Retrying command (per sense data)
then
Code:
Jun 2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): READ(10). CDB: 28 0 5 75 d1 60 0 0 40 0
Jun 2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): CAM status: SCSI Status Error
Jun 2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): SCSI status: Check Condition
Jun 2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jun 2 16:21:02 softimage kernel: (da1:mpslsi0:0:9:0): Retrying command (per sense data)
Now the interesting bit is last time (3 weeks ago) I ran a long smartmon test, I got a clean report for all the 3 TB drives. Before the smartmon test one of them was reporting issues when a lot of drives were present in the chassis. Going back to 5 drives (at that time) I would have an error-free smartmon long test report on each drive.
If you are still on this thread (thanks a lot for reading), at one point, I had 6 x 10k RPM 2.5 72 GB SAS drives in this NORCO RPC4224 chassis and they were failing (the famous click click noise at some stage not too far after startup). I put them back in the DL380 they came from, and they were all fine. I am more and more thinking of a power related issue since I use only two of the 5 rails dedicated 'Peripheral IDE/SATA' shown on this photo of my PSU
If anybody has got any idea on best place to start to get a reliable setup, let me know.
Thanks.