gmirror + raid = broken

z662 · Jul 15, 2011

I recently implemented a RAID-1 solution into my existing webserver. To do this I bought 2 identical brand new hdd's. I set up a raid and synced the drives per the handbook: http://www.freebsd.org/doc/handbook/geom-mirror.html

All was fine and gmirror status showed a status of 'Complete'. However today (about 3 days after the implementation) I had tons of error messages like the below appearing in my logs:

Code:

Jul 14 12:55:50 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258096703
Jul 14 12:56:00 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040543
Jul 14 12:56:12 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040479
Jul 14 13:00:11 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040051
Jul 14 13:00:38 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=1471
Jul 14 13:05:39 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=1311
Jul 14 13:11:10 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040255
Jul 14 13:22:10 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040051
Jul 14 13:22:20 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040479
Jul 14 13:25:40 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040447
Jul 14 13:25:50 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040511
Jul 14 13:26:00 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=257664127
Jul 14 13:26:21 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=257663903
Jul 14 13:26:21 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040511
Jul 14 13:26:41 mercury last message repeated 2 times
Jul 14 13:26:51 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=258040511
Jul 14 13:27:01 mercury kernel: ad1: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=257664159

And of course when checking the results of gmirror status I see that the mirror is 'Degraded'.

Assuming ad1 isnt bad already, what can I do? Is it likely that I bought a faulty hdd?

kev4bsd · Jul 15, 2011

I believe I went through this a while back when I implemented a RAID 1 setup using gmirror as well. I would figure you can shut the system down, detach the faulty equipment and test it out. Replace it if it is faulty and have gmirror forget and insert the new hard drive. I read somewhere that it doesn't have to be the exact same type, it can actually be a bigger hard drive, but the extra space will be unusable.

z662 · Jul 16, 2011

That is correct, however before I do that, Id like to ensure that this hdd is faulty considering that when I set the mirror up initially I did not encounter any timeout/write errors in addition to the fact that the drive is brand new.

Keeping that in mind, how should I proceed to do that in the safest/easiest manner possible? I was thinking I should just detach the drive (destroy the mirror) then proceed with the same steps I took last time after re-copying all the partitions and give it one more shot. From what I have read online, running fsck at this point and/or rebooting will in fact cause some headaches...

Thoughts??

jem · Jul 16, 2011

Install sysutils/smartmontools from ports, then use the smartctl program to check the drive's health.

z662 · Jul 18, 2011

Both smartmontools and the diagnostic utility for Western Digital have confirmed that both drives are in fact working properly. I believe from what I have been told that I need a new controller. I plan on purchasing a PCI card controller for my IDE drives. I will report back the results. Feel free to comment in the meantime.

z662 · Jul 21, 2011

I have just received my hdd controller. I made sure the jumpers were correct (one master, one slave) for the corresponding drives. After booting I then 'forgot' the mirror and re-inserted it. The drives appear to be working fine and the mirror is being rebuilt as I type this. Glad it was just the controller after all.

gmirror + raid = broken

z662

kev4bsd

z662

jem

z662

z662