Degraded gmirror

z662 · Aug 16, 2011

Hello,

I have a similar thread that was closed out not too long ago: http://forums.freebsd.org/showthread.php?t=25060

I am not sure why this is a re-occurring issue, but really need some help to troubleshoot.

Basically, I set up a simple RAID1 solution, bought 2 brand new identical HDD's, and even bought a brand new controller that plugs into a PCI slot (since the drives are ATA and my onboard controller seems to have broken). Originally, I was getting lots of WRITE_DMA errors, then when I got my new controller those errors went away, now they are back, but slightly different, and considering everything except the CPU and mobo are pretty much brand new, I dont understand why my gmirrors keep saying 'Degraded'. I have posted below some sample error messages.

I discovered these a few minutes ago:

Code:

Aug 16 07:32:07 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237485215
Aug 16 07:32:18 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237739007
Aug 16 07:32:30 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237784319
Aug 16 07:32:41 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237789567
Aug 16 07:32:52 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237805215
Aug 16 07:33:04 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=237851039
Aug 16 07:33:16 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238120703
Aug 16 07:33:28 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238145791
Aug 16 07:33:39 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238158847
Aug 16 07:33:51 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238191391
Aug 16 07:34:01 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238195871
Aug 16 07:34:12 mercury kernel: ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=238213151

So then I did 'sudo gmirror forget gm0' followed by a 'sudo gmirror insert gm0 /dev/ad4'
(/dev/ad5 was already added based on the output of 'gmirror status')

After that command I am greeted with such:

Code:

Aug 16 17:56:52 mercury kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad4.
Aug 16 17:56:53 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=20224
Aug 16 17:56:54 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=106752
Aug 16 17:57:01 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=505088
Aug 16 17:57:01 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=505088
Aug 16 17:57:01 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=518656
Aug 16 17:57:02 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=568832
Aug 16 17:57:02 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=579840
Aug 16 17:57:02 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=579840
Aug 16 17:57:02 mercury kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=579840
Aug 16 17:57:02 mercury kernel: GEOM_MIRROR: Synchronization request failed (error=5). ad4[WRITE(offset=296878080, length=131072)]
Aug 16 17:57:02 mercury kernel: GEOM_MIRROR: Device gm0: provider ad4 disconnected.
Aug 16 17:57:02 mercury kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad4 stopped.

z662 · Aug 17, 2011

I find it interesting that this very same command a few hours ago would not work, now it quietly accepts. Still seems like I have a rather major issue with my raid though.

Code:

[brad@mercury ~]$ ls /dev
acd0       audit      fd         mdctl      stdout     ttyv8      ugen3.1
acpi       bpf        fido       mem        sysmouse   ttyv9      uhid0
ad4        bpf0       geom.ctl   mirror     ttyu0      ttyva      ukbd0
ad4s1      console    io         nfslock    ttyu0.init ttyvb      urandom
ad4s1a     consolectl kbd0       null       ttyu0.lock ttyvc      usb
ad4s1b     ctty       kbd1       pci        ttyv0      ttyvd      usbctl
ad4s1d     cuau0      kbd2       pf         ttyv1      ttyve      xpt0
ad4s1e     cuau0.init kbdmux0    ppi0       ttyv2      ttyvf      zero
ad4s1f     cuau0.lock klog       ptmx       ttyv3      ufsid
ad5        dcons      kmem       pts        ttyv4      ugen0.1
agpgart    devctl     log        random     ttyv5      ugen1.1
ata        devstat    lpt0       stderr     ttyv6      ugen2.1
atkbd0     dgdb       lpt0.ctl   stdin      ttyv7      ugen2.2
[brad@mercury ~]$ gmirror status
      Name    Status  Components
mirror/gm0  COMPLETE  ad5
[brad@mercury ~]$ sudo gmirror insert gm0 /dev/ad4
Password:
Sorry, try again.
Password:
[brad@mercury ~]$ sudo gmirror status
      Name    Status  Components
mirror/gm0  DEGRADED  ad5
                      ad4 (0%)

z662 · Aug 17, 2011

Looks like each time I try to rebuild the mirror it fails:

Code:

[brad@mercury ~]$ gmirror status
      Name    Status  Components
mirror/gm0  DEGRADED  ad5
[brad@mercury ~]$

Code:

Aug 17 07:53:02 mercury kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad4.
Aug 17 07:53:03 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=56320
Aug 17 07:53:03 mercury kernel: ad4: WARNING - WRITE_DMA UDMA ICRC error (retrying request) LBA=56320
Aug 17 07:53:03 mercury kernel: ad4: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=56320
Aug 17 07:53:03 mercury kernel: GEOM_MIRROR: Synchronization request failed (error=5). ad4[WRITE(offset=28835840, length=131072)]
Aug 17 07:53:03 mercury kernel: GEOM_MIRROR: Device gm0: provider ad4 disconnected.
Aug 17 07:53:03 mercury kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad4 stopped.

wblock@ · Aug 17, 2011

Have you checked the drives for grown errors with sysutils/smartmontools? Look at the Reallocated_Sector_Ct value.

redw0lfx · Aug 17, 2011

I don't believe the issue is gmirror, but more likely bad hard drive or bad cables. I see you mentioned you bought two identical hard drives and they originally showed that same error. If they both came from the same batch, it is possible they were both bad to begin with. We have had this issue on more than one occasion, although a bit rare now.

Have you checked the cables to the hard drives and make sure they are good? For example, the cable might be to close to heat and has caused damage to it, or the cable has been pinched by another hardware part, etc.

You should run badblocks in non-destructive mode (unless you have backups) on the drive and see if it reports any errors.

z662 · Aug 17, 2011

Thanks for the replies,

I have already ran the Western Digital hdd health utility (last time) and it reported no errors on either drives. I will run this again though to see if the results are any different. I wouldnt imagine badblocks would be able to detect any problems that the WD utility could, would you agree?

In regards to the cables, I can take a very close look tonight, but those are also brand new cables that are not pinched. Heat shouldnt be much of an issue either as its in an AC room and the server has 3 fans. Im really scrathing my head on this one....

I was pretty convinced that the controller solved the problem last time, and still am since once I replaced the controller it worked fine. Is it possible that my controller already died or is that as unlikely as I would expect? Maybe I should try booting up with the controller into another OS on a different drive to see if it works as intended.

Thoughts?

Oxyd · Aug 17, 2011

I got simillar DMA errors when I connected my brand-new disk. SMART also reported the disk to be fine. In the end, I discovered it was due to loose cabling.

z662 · Aug 17, 2011

That doesnt seem to be the issue this time as it happened again, at a semi random time, with different errors (as compared to last time). Additionally I just opened the case and made sure everything was tight and snug.

In order to troubleshoot this:

a) How would I determine which disk is /dev/ad4 and which one is /dev/ad5? Would I have to reboot, or would the primary disk be /dev/ad5?

b) Where can I find out which error is relative? I can not find any information regarding the various types of WRITE/READ_DMA errors as well as ICRC error.

Please advise.

Oxyd · Aug 17, 2011

z662 said:
a) How would I determine which disk is /dev/ad4 and which one is /dev/ad5? Would I have to reboot, or would the primary disk be /dev/ad5?

Try smartctl -i /dev/ad4 and then match the displayed information with information on the stickers on the drives.