Replacing ZFS Drive on SAS Controller

churchi · May 14, 2012

Hi all,

Quick question on my ZFS setup.

I have the following set up:

Code:

[root@server-01 /home/churchi]# zpool status
  pool: storage1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        storage1          ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  ONLINE     103     1     0
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     0
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0

errors: No known data errors
[root@server-01 /home/churchi]#

As you can see I have a drive that is starting to give errors. I believe it has not fully failed however it is starting to cause issues on the array and accessing the share.

I would like to replace the drive that is starting to go faulty. I have labelled all of my drives so I am not seeing the /dev/XXX name. 8 of the dives are connected to my SAS card, and the failed drive is on the SAS card.

So a few questions.

Do I need to replace this drive right away?
All the current drives are 1.5TB in capacity and the new drive I have is a 2Tb. Will this cause any issues if I replace the failed drive with this one?
Is this the correct way to replace my failed drive:
1. zpool offline storage1 label/disk02
2. shutdown and replace drive
3. zpool replace storage1 label/disk02
Will this allow me to maintain the current label or will I have to recreate the new label when the new disk is inserted?
Is there anything else I am missing or this will the above do the job for me?

Here is an output from the command: glabel list

Code:

[root@server-01 /home/churchi]# glabel list
Geom name: da1
Providers:
1. Name: label/disk02
   Mediasize: 1500301909504 (1.4T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 2930277167
   length: 1500301909504
   index: 0
Consumers:
1. Name: da1
   Mediasize: 1500301910016 (1.4T)
   Sectorsize: 512
   Mode: r1w1e2

I am writing here as this is my first time that I have had a failed drive and would really like some reassurance from you guys that I have the process down pat before I replace the disk.

If there is anything else you may need to help out or I can provide then let me know.

Thank you.

churchi · May 15, 2012

Hi all,

Well it looks like the disk is fully dead today.

Code:

[churchi@server-01 ~]$ zpool status
  pool: storage1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        storage1          DEGRADED     0     0     0
          raidz2          DEGRADED     0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  UNAVAIL     23   321     0  experienced I/O failures
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     0
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0

errors: No known data errors
[churchi@server-01 ~]$

So this replacement just got a bit more serious.

Does anyone want to comment on how I am going to replace my drive and let me know if that is a good way to proceed?

Thank you.

phoenix · May 15, 2012

Your replacement process is correct:

# zpool offline storage1 label/disk02
physically remove disk
insert new disk
# glabel label disk02 <disknode>
# zpool replace storage1 label/disk02 label/disk02

You have to manually create the new label on the disk before adding it to the pool.

It being a 2 TB disk in a vdev with 1.5 TB disks will not affect anything. And, if you later replace the rest of the drives with 2 TB disks, you'll get a larger vdev.

churchi · May 17, 2012

Cheers thanks Phoenix for the reply. I have kicked off the process and now the waiting game takes place.

Code:

[root@server-01 /home/churchi]# zpool status
  pool: storage1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 27h36m, 74.41% done, 9h29m to go
config:

        NAME                    STATE     READ WRITE CKSUM
        storage1                DEGRADED     0     0     0
          raidz2                DEGRADED     0     0     0
            label/disk01        ONLINE       0     0     0  405M resilvered
            replacing           DEGRADED     0     0    12
              label/disk02/old  UNAVAIL      0 1.58M     0  cannot open
              label/disk02      ONLINE       0     0     0  854G resilvered
            label/disk03        ONLINE       0     0     0  405M resilvered
            label/disk04        ONLINE       0     0     1  398M resilvered
            label/disk05        ONLINE       0     0     0  405M resilvered
            label/disk06        ONLINE       0     0     0  398M resilvered
            label/disk07        ONLINE       0     0     1  405M resilvered
            label/disk08        ONLINE       0     0     0  398M resilvered
            label/disk09        ONLINE     144     0     0  407M resilvered
            label/disk10        ONLINE       0     0     0  398M resilvered

errors: No known data errors
[root@server-01 /home/churchi]#

From that screen shot, you can see that I have relabelled the new disk to be /label/disk02, however I am seeing the old disk still there. Will that old disk just disappear when this disk has been rebuilt?

Is it normal for the resilvering of a new disk to take around 35-40 hours? Seems like a very long time.

Also I have now noticed that disk09 has some errors. Should I be worried about these? Should I just replace the disk now before it's fully gone? Or since I have a RAID6 with zfs, I should just wait?

gkontos · May 17, 2012

churchi said:
From that screen shot, you can see that I have re labelled the new disk to be /label/disk02 however I am seeing the old disk still there. Will that old disk just disappear when this disk has been re-built?

Yes, it will once resilvering has finished.

churchi said:
Is it normal for the re silvering of a new disk to take around 35-40 hours? Seems like a very long time.

It depends on how much data you have in this array! But if you look at those readings:

Code:

scrub: resilver in progress for 27h36m, 74.41% done, 9h29m to go
...
label/disk02      ONLINE       0     0     0  854G resilvered

Something is wrong. 27 hours for 854GB is a lot.

churchi said:
Also I have now noticed that disk09 has some errors. Should I be worried about these? Should I just replace the disk now before its fully gone? or since I have a raid 6 with zfs, I should just wait?

There you go, something is wrong. The good thing about RAIDZ2 is that it can withstand 2 disk failures. At this point you have to make sure that the problem is in your drives and not in your controller.

/var/log/messages should have logged something weird. This is a must look. Don't rush into replacing the second drive before you make sure where the problem is.

churchi · May 17, 2012

Hey gkontos, thanks for the reply.

Here is the latest:

Code:

[root@server-01 /usr/home/churchi]# zpool status
  pool: storage1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 34h0m, 90.65% done, 3h30m to go
config:

        NAME                    STATE     READ WRITE CKSUM
        storage1                DEGRADED     0     0     0
          raidz2                DEGRADED    14     0     0
            label/disk01        ONLINE       0     0     0  524M resilvered
            replacing           DEGRADED     0     0    23
              label/disk02/old  UNAVAIL      0 2.31M     0  cannot open
              label/disk02      ONLINE       0     0     0  1.02T resilvered
            label/disk03        ONLINE       0     0     0  524M resilvered
            label/disk04        ONLINE       0     0     1  512M resilvered
            label/disk05        REMOVED      0 11.5M     0  413M resilvered
            label/disk06        ONLINE       0     0     0  512M resilvered
            label/disk07        ONLINE       0     0     1  524M resilvered
            label/disk08        ONLINE      57     0     0  512M resilvered
            label/disk09        ONLINE     144     0     0  525M resilvered
            label/disk10        ONLINE       0     0     0  512M resilvered

errors: No known data errors
[root@server-01 /usr/home/churchi]

Not looking good. Seems as though I have lost disk05. I have no idea why it says removed, however I am guessing that's the second drive that is not online in the array

Am I allowed to reboot my machine while I am resilvering? Will this kill the resilvering? Should I avoid rebooting during this process?

In the /var/log/messages I see a lot of things like this:

Code:

May 17 17:49:00 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:00 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:00 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:00 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:03 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:03 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:03 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:03 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:07 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:07 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:07 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:07 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:11 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:11 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:11 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:11 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:15 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path=/dev/label/disk08 offset=1456926883840 size=65536 error=5
May 17 17:49:15 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:15 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:15 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:15 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:15 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path=/dev/label/disk08 offset=1456926947328 size=512 error=5
May 17 17:49:15 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path= offset=14569227530240 size=6144 error=5
May 17 17:49:15 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path=/dev/label/disk08 offset=1456926934528 size=512 error=5
May 17 17:49:18 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:18 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:18 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:18 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:22 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:22 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:22 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:22 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:26 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:26 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:26 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:26 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:30 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:30 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:30 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:30 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:33 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path=/dev/label/disk08 offset=1456926883840 size=65536 error=5
May 17 17:49:33 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:33 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:33 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:33 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:33 server-01 root: ZFS: vdev I/O failure, zpool=storage1 path=/dev/label/disk08 offset=1456926947328 size=512 error=5
May 17 17:49:40 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:40 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:40 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:40 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:44 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:44 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:44 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:44 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:48 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:48 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:48 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:48 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:51 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:51 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:51 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:51 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
May 17 17:49:54 server-01 kernel: (da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b ce 0 0 0 80 0 
May 17 17:49:54 server-01 kernel: (da7:mpt0:0:7:0): CAM status: SCSI Status Error
May 17 17:49:54 server-01 kernel: (da7:mpt0:0:7:0): SCSI status: Check Condition
May 17 17:49:54 server-01 kernel: (da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error

The funny part is that this has nothing to do with the drive 5 that has also failed.

How would I know if this is a SAS card issue or HDD?

phoenix · May 17, 2012

Do no reboot during a resilver! Especially one that is over 90% complete! It will restart the resilver from 0%.

Just let this resilver finish.

Then you can try to offline/online label/disk05. And look into replacing label/disk08.

However, since you are experiencing multiple disk failures at once, I would also look into reconnecting all your cables, reseating your controller, blow out any dust in the case, etc.

gkontos · May 17, 2012

Like Phoenix said, do not reboot during resilver!

I would really have to see your dmesg output but I already have seen 3 clients with problems on 9.0-RELEASE and LSI controllers.

The similarities in all cases are that:

All of them had an LSI SAS2008 controller.
The were using 4K drives not properly aligned.

The solution was to:

Upgrade to FreeBSD 9.0-STABLE that includes the official LSI driver <Driver: 13.00.00.00-fbsd>
Destroy the pool and recreate it 4K aligned.

None of them however had to deal with the issues you are. They were all experiencing very bad performance and deadlocks.

So, please report back your full system information, check your cables and we can see how we can fix this.

churchi · May 18, 2012

Hi all,

The resilver has been completed. The drive labelled disk05 came back alive when I powered off the machine and powered it back on. However when the server started back up I can instantly see three checksum errors.

Code:

[root@server-01 /home/churchi]# zpool status
  pool: storage1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        storage1          ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  ONLINE       0     0     0
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     3
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0

errors: No known data errors
[root@server-01 /home/churchi]#

I have not copied anything up to the server yet so I'll run it through its paces some time today.

gkontos I am not sure what you mean about the 4k drives, however I am currently running FreeBSD 8.1-RELEASE. I am wanting to upgrade to 8.3 later today since I am so far behind on the releases. Although the plan is to get onto release 9 ASAP.

I run 10 disks in my raidz2 and I have been reading that I may not have the best setup. I have heard that having 10 disks is not recommended as I should not go past 6 to 8. Well if that is the case, then I will look into getting another SAS card and splitting up the zpool if it comes to that.

I do also have an LSI controller. The speed I am getting on my server is good enough for me. When I copy up a file I am maxing out the gigabit link for most of the file copy.

I will take the server apart today and blow out all the dust and reseat all the cables.

What do you guys recommend?

churchi · May 18, 2012

Here is a copy of my dmesg from a clean boot:

Code:

[root@server-01 /home/churchi]# dmesg
Copyright (c) 1992-2010 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC 2010
    root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (2402.42-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x6f6  Family = 6  Model = f  Stepping = 6
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,
ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0xe3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant
real memory  = 8589934592 (8192 MB)
avail memory = 8263286784 (7880 MB)
ACPI APIC Table: <GBT    GBTUACPI>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
ioapic0: Changing APIC ID to 2
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <GBT GBTUACPI> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, dfde0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <PCI bus> on pcib1
mpt0: <LSILogic SAS/SATA Adapter> port 0xa000-0xa0ff mem 0xf9010000-0xf9013fff,0xf9000000-0xf900ffff
 irq 16 at device 0.0 on pci1
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.19.0
uhci0: <Intel 82801I (ICH9) USB controller> port 0xe000-0xe01f irq 16 at device 26.0 on pci0
uhci0: [ITHREAD]
uhci0: LegSup = 0x2f00
usbus0: <Intel 82801I (ICH9) USB controller> on uhci0
uhci1: <Intel 82801I (ICH9) USB controller> port 0xe100-0xe11f irq 21 at device 26.1 on pci0
uhci1: [ITHREAD]
uhci1: LegSup = 0x2f00
usbus1: <Intel 82801I (ICH9) USB controller> on uhci1
uhci2: <Intel 82801I (ICH9) USB controller> port 0xe200-0xe21f irq 18 at device 26.2 on pci0
uhci2: [ITHREAD]
uhci2: LegSup = 0x2f00
usbus2: <Intel 82801I (ICH9) USB controller> on uhci2
ehci0: <Intel 82801I (ICH9) USB 2.0 controller> mem 0xfc104000-0xfc1043ff irq 18 at device 26.7 on pci0
ehci0: [ITHREAD]
usbus3: EHCI version 1.0
usbus3: <Intel 82801I (ICH9) USB 2.0 controller> on ehci0
pci0: <multimedia, HDA> at device 27.0 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pci3: <ACPI PCI bus> on pcib3
vgapci0: <VGA-compatible display> port 0xb000-0xb07f mem 0xf6000000-0xf6ffffff,0xe0000000-
0xefffffff,0xf4000000-0xf5ffffff irq 17 at device 0.0 on pci3
pcib4: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pci4: <ACPI PCI bus> on pcib4
atapci0: <JMicron JMB363 UDMA133 controller> port 0xc000-0xc007,0xc100-0xc103,0xc200-0xc207,
0xc300-0xc303,0xc400-0xc40f mem 0xfc000000-0xfc001fff irq 19 at device 0.0 on pci4
atapci0: [ITHREAD]
atapci1: <AHCI SATA controller> on atapci0
atapci1: [ITHREAD]
atapci1: AHCI v1.00 controller with 2 3Gbps ports, PM supported
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]
ata4: <ATA channel 0> on atapci0
ata4: [ITHREAD]
pcib5: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci5: <ACPI PCI bus> on pcib5
re0: <RealTek 8168/8111 B/C/CP/D/DP/E PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem
 0xfb000000-0xfb000fff irq 16 at device 0.0 on pci5
re0: Using 1 MSI messages
re0: Chip rev. 0x38000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211B media interface> PHY 1 on miibus0
rgephy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
re0: Ethernet address: 00:1d:7d:ae:6e:03
re0: [FILTER]
uhci3: <Intel 82801I (ICH9) USB controller> port 0xe300-0xe31f irq 23 at device 29.0 on pci0
uhci3: [ITHREAD]
usbus4: <Intel 82801I (ICH9) USB controller> on uhci3
uhci4: <Intel 82801I (ICH9) USB controller> port 0xe400-0xe41f irq 19 at device 29.1 on pci0
uhci4: [ITHREAD]
usbus5: <Intel 82801I (ICH9) USB controller> on uhci4
uhci5: <Intel 82801I (ICH9) USB controller> port 0xe500-0xe51f irq 18 at device 29.2 on pci0
uhci5: [ITHREAD]
usbus6: <Intel 82801I (ICH9) USB controller> on uhci5
ehci1: <Intel 82801I (ICH9) USB 2.0 controller> mem 0xfc105000-0xfc1053ff irq 23 at device 29.7 on pci0
ehci1: [ITHREAD]
usbus7: EHCI version 1.0
usbus7: <Intel 82801I (ICH9) USB 2.0 controller> on ehci1
pcib6: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci6: <ACPI PCI bus> on pcib6
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci2: <Intel ICH9 SATA300 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f,
0xfc00-0xfc0f at device 31.2 on pci0
ata0: <ATA channel 0> on atapci2
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci2
ata1: [ITHREAD]
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
atapci3: <Intel ICH9 SATA300 controller> port 0xe700-0xe707,0xe800-0xe803,0xe900-0xe907,
0xea00-0xea03,0xeb00-0xeb0f,0xec00-0xec0f irq 19 at device 31.5 on pci0
atapci3: [ITHREAD]
ata5: <ATA channel 0> on atapci3
ata5: [ITHREAD]
ata6: <ATA channel 1> on atapci3
ata6: [ITHREAD]
atrtc0: <AT realtime clock> port 0x70-0x73 on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: [FILTER]
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppc0: [ITHREAD]
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
plip0: [ITHREAD]
lpt0: <Printer> on ppbus0
lpt0: [ITHREAD]
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
acpi_perf0: <ACPI CPU Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 927092706000927
device_attach: est1 attach returned 6
p4tcc1: <CPU Frequency Thermal Control> on cpu1

churchi · May 18, 2012

Part 2. Maybe I should pastebin it next time.

Code:

Timecounters tick every 1.000 msec
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 12Mbps Full Speed USB v1.0
usbus2: 12Mbps Full Speed USB v1.0
usbus3: 480Mbps High Speed USB v2.0
usbus4: 12Mbps Full Speed USB v1.0
usbus5: 12Mbps Full Speed USB v1.0
usbus6: 12Mbps Full Speed USB v1.0
usbus7: 480Mbps High Speed USB v2.0
ad0: 476938MB <WDC WD5000AAKS-00A7B0 01.03B01> at ata0-master UDMA100 SATA 3Gb/s
ugen0.1: <Intel> at usbus0
uhub0: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <Intel> at usbus1
uhub1: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus1
ugen2.1: <Intel> at usbus2
uhub2: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
ugen3.1: <Intel> at usbus3
uhub3: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
ugen4.1: <Intel> at usbus4
uhub4: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4
ugen5.1: <Intel> at usbus5
uhub5: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus5
ugen6.1: <Intel> at usbus6
uhub6: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus6
ugen7.1: <Intel> at usbus7
uhub7: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus7
ad2: 1430799MB <Seagate ST31500341AS CC1H> at ata1-master UDMA100 SATA 3Gb/s
acd0: DVDR <HL-DT-STDVD-RAM GSA-H55N/1.03> at ata4-master UDMA66 
ad12: 1430799MB <Seagate ST31500341AS CC1H> at ata6-master UDMA100 SATA 3Gb/s
uhub0: 2 ports with 2 removable, self powered
uhub1: 2 ports with 2 removable, self powered
uhub2: 2 ports with 2 removable, self powered
uhub4: 2 ports with 2 removable, self powered
uhub5: 2 ports with 2 removable, self powered
uhub6: 2 ports with 2 removable, self powered
uhub3: 6 ports with 6 removable, self powered
uhub7: 6 ports with 6 removable, self powered
da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da1 at mpt0 bus 0 scbus0 target 1 lun 0
da1: <ATA ST2000DM001-9YN1 CC4C> Fixed Direct Access SCSI-5 device 
da1: 300.000MB/s transfers
da1: Command Queueing enabled
da1: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da2 at mpt0 bus 0 scbus0 target 2 lun 0
da2: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da2: 300.000MB/s transfers
da2: Command Queueing enabled
da2: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da3 at mpt0 bus 0 scbus0 target 3 lun 0
da3: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da3: 300.000MB/s transfers
da3: Command Queueing enabled
da3: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da4 at mpt0 bus 0 scbus0 target 4 lun 0
da4: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da4: 300.000MB/s transfers
da4: Command Queueing enabled
da4: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da5 at mpt0 bus 0 scbus0 target 5 lun 0
da5: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da5: 300.000MB/s transfers
da5: Command Queueing enabled
da5: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da6 at mpt0 bus 0 scbus0 target 6 lun 0
da6: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da6: 300.000MB/s transfers
da6: Command Queueing enabled
da6: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
da7 at mpt0 bus 0 scbus0 target 7 lun 0
da7: <ATA ST31500341AS CC1H> Fixed Direct Access SCSI-5 device 
da7: 300.000MB/s transfers
da7: Command Queueing enabled
da7: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad0s1a
ZFS filesystem version 3
ZFS storage pool version 14
re0: link state changed to UP
pid 1158 (httpd), uid 0: exited on signal 11 (core dumped)
re0: watchdog timeout
re0: link state changed to DOWN
re0: link state changed to UP
pid 1308 (php), uid 107: exited on signal 11
[root@server-01 /home/churchi]#

gkontos · May 18, 2012

@churchi,

I would recommend that you dust out the server and check your SATA cables. Then you can use sysutils/smartmontools to check the status of your drives. Make sure you perform the short test on all the drives.

For the time being FreeBSD 8.3-RELEASE will be your best choice. If you want to go with 9.0 then I would recommend to follow 9.0-STABLE as it has accumulated many bug fixes.

churchi · May 18, 2012

Hi gkontos, sounds good to me. I will take the case and things apart tomorrow and blow all the dust out. Ill then run the smartmontools you have suggested.

So you think for the moment stick with FreeBSD 8.3-RELEASE? No need to go to 9.0 yet? Is there an update to ZFS in the new FreeBSD 9.0 that would be good to run? Or you think waiting until, say, FreeBSD 9.1 is released would be the best?

gkontos · May 18, 2012

churchi said:
So you think for the moment stick with FreeBSD 8.3-RELEASE? No need to go to 9.0 yet? Is there an update to ZFS in the new FreeBSD 9.0 that would be good to run?

Only in FreeBSD 9.0-STABLE, which eventually will become FreeBSD 9.1-RELEASE. If you are experienced with the OS then you can follow that path. Otherwise stick to FreeBSD 8.3-RELEASE for the moment.

churchi · May 20, 2012

No worries at all. I have upgraded to FreeBSD 8.3-RELEASE without issues this afternoon. First time I have upgraded as I built the server on 8.0-RELEASE.

It was interesting though, when I rebooted the system immediately three drives went into resilvering. Is this normal? It lasted about 6 hours or so, much quicker than when I replaced the drive a few days ago. It seems that disk05 was fixing itself up since it was off-line for a day or so this week. However I still had heaps of issues with disk08.

Here are some extracts:

Code:

[root@server-01 /home/churchi]# zpool status
  pool: storage1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun May 20 13:56:38 2012
        11.1T scanned out of 11.3T at 1/s, (scan is slow, no estimated time)
        266G resilvered, 98.48% done
config:

        NAME              STATE     READ WRITE CKSUM
        storage1          ONLINE       0     0     0
          raidz2-0        ONLINE       0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  ONLINE       0     0    11  (resilvering)
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     0  (resilvering)
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0  (resilvering)
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0

errors: No known data errors
[root@server-01 /home/churchi]#

Code:

[root@server-01 /home/churchi]# zpool status
  pool: storage1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: resilvered 276G in 4h4m with 0 errors on Sun May 20 18:01:25 2012
config:

        NAME              STATE     READ WRITE CKSUM
        storage1          ONLINE       0     0     0
          raidz2-0        ONLINE       0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  ONLINE       0     0    11
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     0
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0

errors: No known data errors
[root@server-01 /home/churchi]#

Strange enough disk02 is the one I replaced earlier this week and it has checksum errors. I will clear the stats now and see how it goes over the next day or so.

I have also upgraded my ZFS pool to the latest version:

Code:

This system is currently running ZFS pool version 28.

I still haven't had a chance to take apart the case or run the smartmontools, will get to that ASAP this week.

Do you think I need to look into replacing disk08?

Code:

(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd f6 0 0 eb 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd f6 0 0 eb 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd f6 0 0 eb 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd f6 0 0 eb 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd f6 0 0 eb 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd 62 0 0 f2 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd 62 0 0 f2 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd 62 0 0 f2 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd 62 0 0 f2 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)
(da7:mpt0:0:7:0): READ(10). CDB: 28 0 a9 9b cd 62 0 0 f2 0 
(da7:mpt0:0:7:0): CAM status: SCSI Status Error
(da7:mpt0:0:7:0): SCSI status: Check Condition
(da7:mpt0:0:7:0): SCSI sense: MEDIUM ERROR info:a99bce4b asc:11,0 (Unrecovered read error)

Code:

Geom name: da7
Providers:
1. Name: label/disk08
   Mediasize: 1500301909504 (1.4T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 2930277167
   length: 1500301909504
   index: 0

gkontos · May 20, 2012

You need to also upgrade ZFS if you haven't done this yet, with:

[CMD=""]#zfs upgrade storage1[/CMD]

Regarding you disks, label/disk08 and label/disk02 seem to have issues. The first one is also logging errors in messages.

You can use smart tools and perform both the short and the long test. It is better for the pool to be exported while you perform those tests.

I suppose that you have already changed your SATA cables. If the long test doesn't reveal any errors try changing the SATA port too.