I have a RAID-Z1 pool of three drives running ZFS on FreeBSD 8.4 which are experiencing hardware difficulties. I'm hopeful someone will be able to help me as I don't know how to proceed.
The machine was set up with three one terabyte SATA hard drives, seen by the system as ad8, ad10, and ad12. There are three slices on each. The first for the OS configured with gmirror to provide three copies. The second on each drive is swap. The third is for data - programs I've written, PC images for machines I've installed, etc. - and runs ZFS and its implementation of RAID-5, AKA RAID-Z.
My PC rebooted spontaneously twice last week. This is obviously not normal and on the second reboot, the startup messages had one about an 'ad12 TIMEOUT' error. I ordered a new drive, knowing that the third drive of the array was in the process of dying.
When the new drive arrived, I booted from a live CD and cloned the failing drive with dd - there was about a 4gig GB stretch around 760 gig GB in that would error out and so I used the conv=noerror parameter to bypass the problem area, making a bitwise copy of everything else. Since there are no open SATA ports in the machine, I removed the old drive and plugged the new one in it's place - same bus number = same device node and cloned = should take right off I thought.
The first part of that was correct - the system gave it the same device node. However, it showed as degraded when looked at withgigGB stretch around 760 gig GB into the drive where the I/O errors occurred on the original ad12 drive. I figure it's not a problem - I'll just recreate the data from what's on the other two drives. That's why I made it a RAID 5 to begin with after all, so if any drive fails, I'd not lose any data. I happily issued the
However, unbeknownst to me at the time, ad8 had more serious issues than ad12 - read errors, the drive resets itself spontaneously, et.c - even though there was no indication of errors in the startup messages. Basically, the machine can't reconstruct ad12 before ad8 becomes temporarily non-operational. It's gotten as high as 210gig GB before dying, but then either the resilver starts over again or the machine panics. I've tried everything I know to keep ad8 functioning long enough to rebuild ad12, up to having the case open with a desk fan blowing over the drives as it runs, but nothing has helped.
There are two more drives on the way to replace the failing ad8 and presently good ad10 once the others are done. (I no longer trust Seagate drives; they used to be good, but I've had more problems over the past couple years than in the previous decade combined. Plus, the drives had a 5-year warranty according to Amazon when I bought them and I believe this is printed on the box as well, but 3 years later show as out of warranty; researching this uncovered that Seagate apparently decided to retroactively lower the warranty periods on their drives. ) The new drives should be here Monday, but even after I have the new drives in hand, I'm not quite sure how best to proceed.
If I put the original ad12 back in the system with the intention of rebuilding ad8 first, I have to think that ZFS would start reconstructing it and put the kabosh on the data on it because it's in replace mode for that device. Using
At this point, I'm pretty sure I'm going to lose some data; I'd just like to minimize it as much as possible. Any advice?
The machine was set up with three one terabyte SATA hard drives, seen by the system as ad8, ad10, and ad12. There are three slices on each. The first for the OS configured with gmirror to provide three copies. The second on each drive is swap. The third is for data - programs I've written, PC images for machines I've installed, etc. - and runs ZFS and its implementation of RAID-5, AKA RAID-Z.
My PC rebooted spontaneously twice last week. This is obviously not normal and on the second reboot, the startup messages had one about an 'ad12 TIMEOUT' error. I ordered a new drive, knowing that the third drive of the array was in the process of dying.
When the new drive arrived, I booted from a live CD and cloned the failing drive with dd - there was about a 4
The first part of that was correct - the system gave it the same device node. However, it showed as degraded when looked at with
zpool status
. This was puzzling because the new drive should be bit-for-bit identical to the old one with the exception of that bad 4- zpool replace tank ad12s3 ad12s3
command and watched it start rebuilding. ('Resilvering' is the term used in the zpool output.)However, unbeknownst to me at the time, ad8 had more serious issues than ad12 - read errors, the drive resets itself spontaneously, et.c - even though there was no indication of errors in the startup messages. Basically, the machine can't reconstruct ad12 before ad8 becomes temporarily non-operational. It's gotten as high as 210
There are two more drives on the way to replace the failing ad8 and presently good ad10 once the others are done. (I no longer trust Seagate drives; they used to be good, but I've had more problems over the past couple years than in the previous decade combined. Plus, the drives had a 5-year warranty according to Amazon when I bought them and I believe this is printed on the box as well, but 3 years later show as out of warranty; researching this uncovered that Seagate apparently decided to retroactively lower the warranty periods on their drives. ) The new drives should be here Monday, but even after I have the new drives in hand, I'm not quite sure how best to proceed.
If I put the original ad12 back in the system with the intention of rebuilding ad8 first, I have to think that ZFS would start reconstructing it and put the kabosh on the data on it because it's in replace mode for that device. Using
zpool detach tank ad12s3/old
to stop the replace doesn't work because it says there are no valid replicas of the data, even though the ad12s3/old device, which is what showed up after I issued the replace command above, shows as unavailable because it's not connected. zpool scrub -s tank
works to stop the resilvering, but I'm still unable to take the drive offline or detach it from the pool.At this point, I'm pretty sure I'm going to lose some data; I'd just like to minimize it as much as possible. Any advice?