Hello all, We have replaced a disk controller on a FreeBSD server which has been running for some time... The old disk controller Areca 1680i-16, It kept getting time-out errors and disks were unexpectedly dropping from the controller. We have the 16 disks configured in a ZFS pool as well. The disks in the zpool are Western Digital 2TB Green Drives WD20EARX. We decided to get a controller with less features and run it in Legacy mode (No RAID or JBOD) just pass through hopefully eliminating any Green Drive/controller problems.
We decided to install the HighPoint RocketRAID 2760. We were able to simply disconnect the Areca Controller and connect the HPT card into the server. Everything just worked.
We also made sure that the controller had the latest firmware installed on it as well.
Part of the upgrade process included upgrading from FreeBSD 9.1 to 9.2 and recreating the zpool.
We migrated the active virtual machines back to the pool and when the process was complete we decided to run a run a zpool scrub. A few interesting messages popped up on syslog but the server continued to function normally.
During the scrub a single checksum error appeared on da7:
We let the server run with its usual workload and a few similar errors appear in the syslog over time.
I also received an alert e-mail from the HPT server utility as well, however no reporting problems on ZFS or in syslog regarding this.
We also performed full SMART selftests on each disk and everything turned up clean.
I am trying to determine if we need to worry about these problem and take action or perhaps the WD20EARX Green Drives with the lacking TLER feature are causing alerts. This behavior is better then the behavior of the Areca controller of simply dropping the disk during a time-out. If it's the TLER of a single disk why would more then one disk complain?
If anyone out there has some ideas or experience with this let me know.
Cheers!
Code:
[root@storage1 ~]# uname -a
FreeBSD storage1 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255898: Thu Sep 26 22:50:31 UTC 2013 root bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
Code:
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WMAZA6398371
LU WWN Device Id: 5 0014ee 25c17f0ec
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Dec 18 14:53:59 2013 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
We decided to install the HighPoint RocketRAID 2760. We were able to simply disconnect the Areca Controller and connect the HPT card into the server. Everything just worked.
We also made sure that the controller had the latest firmware installed on it as well.
Part of the upgrade process included upgrading from FreeBSD 9.1 to 9.2 and recreating the zpool.
We migrated the active virtual machines back to the pool and when the process was complete we decided to run a run a zpool scrub. A few interesting messages popped up on syslog but the server continued to function normally.
Code:
Dec 11 08:43:13 storage1 kernel: hpt27xx: Device error information 0x8000000080000000
Dec 11 08:45:05 storage1 kernel: (da11:hpt27xx0:0:11:0): WRITE(10). CDB: 2a 00 0e 54 82 58 00 00 10 00
Dec 11 08:45:05 storage1 kernel: (da11:hpt27xx0:0:11:0): CAM status: SCSI Status Error
Dec 11 08:45:05 storage1 kernel: (da11:hpt27xx0:0:11:0): SCSI status: OK
Dec 11 08:47:42 storage1 kernel: (da10:hpt27xx0:0:10:0): WRITE(10). CDB: 2a 00 0e 56 52 4c 00 00 10 00
Dec 11 08:47:42 storage1 kernel: (da10:hpt27xx0:0:10:0): CAM status: SCSI Status Error
Dec 11 08:47:42 storage1 kernel: (da10:hpt27xx0:0:10:0): SCSI status: OK
Dec 11 08:49:18 storage1 kernel: (da13:hpt27xx0:0:13:0): WRITE(10). CDB: 2a 00 0e 4c 44 74 00 00 08 00
Dec 11 08:49:18 storage1 kernel: (da13:hpt27xx0:0:13:0): CAM status: SCSI Status Error
Dec 11 08:49:18 storage1 kernel: (da13:hpt27xx0:0:13:0): SCSI status: OK
Dec 11 08:49:43 storage1 kernel: (da10:hpt27xx0:0:10:0): WRITE(10). CDB: 2a 00 0e 58 e6 d9 00 00 10 00
Dec 11 08:49:43 storage1 kernel: (da10:hpt27xx0:0:10:0): CAM status: SCSI Status Error
Dec 11 08:49:43 storage1 kernel: (da10:hpt27xx0:0:10:0): SCSI status: OK
Dec 11 08:51:47 storage1 kernel: (da11:hpt27xx0:0:11:0): READ(10). CDB: 28 00 0a b2 63 4a 00 00 80 00
Dec 11 08:51:47 storage1 kernel: (da11:hpt27xx0:0:11:0): CAM status: SCSI Status Error
Dec 11 08:51:47 storage1 kernel: (da11:hpt27xx0:0:11:0): SCSI status: OK
Dec 11 08:52:55 storage1 kernel: (da2:hpt27xx0:0:2:0): WRITE(10). CDB: 2a 00 0f d3 a5 e4 00 00 18 00
Dec 11 08:52:55 storage1 kernel: (da2:hpt27xx0:0:2:0): CAM status: SCSI Status Error
Dec 11 08:52:55 storage1 kernel: (da2:hpt27xx0:0:2:0): SCSI status: OK
Dec 11 08:53:47 storage1 kernel: (da10:hpt27xx0:0:10:0): WRITE(10). CDB: 2a 00 0e 5d da f5 00 00 08 00
Dec 11 08:53:47 storage1 kernel: (da10:hpt27xx0:0:10:0): CAM status: SCSI Status Error
Dec 11 08:53:47 storage1 kernel: (da10:hpt27xx0:0:10:0): SCSI status: OK
During the scrub a single checksum error appeared on da7:
Code:
[root@storage1 ~]# zpool status
pool: export
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Wed Dec 11 07:19:22 2013
265G scanned out of 447G at 51.8M/s, 0h59m to go
128K repaired, 59.34% done
config:
NAME STATE READ WRITE CKSUM
export ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
label/disk1 ONLINE 0 0 0
label/disk2 ONLINE 0 0 0
label/disk3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
label/disk4 ONLINE 0 0 0
label/disk5 ONLINE 0 0 0
label/disk6 ONLINE 0 0 1 (repairing)
mirror-2 ONLINE 0 0 0
label/disk7 ONLINE 0 0 0
label/disk8 ONLINE 0 0 0
label/disk9 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
label/disk10 ONLINE 0 0 0
label/disk11 ONLINE 0 0 0
label/disk12 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
label/disk13 ONLINE 0 0 0
label/disk14 ONLINE 0 0 0
label/disk15 ONLINE 0 0 0
spares
label/disk16 AVAIL
errors: No known data errors
We let the server run with its usual workload and a few similar errors appear in the syslog over time.
I also received an alert e-mail from the HPT server utility as well, however no reporting problems on ZFS or in syslog regarding this.
Code:
Wed, 11 Dec 2013 16:52:00 GMT:
An error occured on the disk at 'WDC WD20EARX-00PASB0-WD-WMAZA8904653' at Controller1-Channel12.
Code:
Dec 12 15:31:14 storage1 kernel: (da0:hpt27xx0:0:0:0): WRITE(10). CDB: 2a 00 11 c9 fb 27 00 00 08 00
Dec 12 15:31:14 storage1 kernel: (da0:hpt27xx0:0:0:0): CAM status: SCSI Status Error
Dec 12 15:31:14 storage1 kernel: (da0:hpt27xx0:0:0:0): SCSI status: OK
Dec 16 21:15:53 storage1 kernel: (da7:hpt27xx0:0:7:0): WRITE(10). CDB: 2a 00 18 84 f4 19 00 00 10 00
Dec 16 21:15:53 storage1 kernel: (da7:hpt27xx0:0:7:0): CAM status: SCSI Status Error
Dec 16 21:15:53 storage1 kernel: (da7:hpt27xx0:0:7:0): SCSI status: OK
Dec 16 21:15:53 storage1 kernel: (da8:hpt27xx0:0:8:0): WRITE(10). CDB: 2a 00 18 84 f4 19 00 00 10 00
Dec 16 21:15:53 storage1 kernel: (da8:hpt27xx0:0:8:0): CAM status: SCSI Status Error
Dec 16 21:15:53 storage1 kernel: (da8:hpt27xx0:0:8:0): SCSI status: OK
We also performed full SMART selftests on each disk and everything turned up clean.
I am trying to determine if we need to worry about these problem and take action or perhaps the WD20EARX Green Drives with the lacking TLER feature are causing alerts. This behavior is better then the behavior of the Areca controller of simply dropping the disk during a time-out. If it's the TLER of a single disk why would more then one disk complain?
If anyone out there has some ideas or experience with this let me know.
Cheers!