ZFS RAID-Z - Failing Drives

Ruler2112 · Feb 21, 2014

I have a RAID-Z1 pool of three drives running ZFS on FreeBSD 8.4 which are experiencing hardware difficulties. I'm hopeful someone will be able to help me as I don't know how to proceed.

The machine was set up with three one terabyte SATA hard drives, seen by the system as ad8, ad10, and ad12. There are three slices on each. The first for the OS configured with gmirror to provide three copies. The second on each drive is swap. The third is for data - programs I've written, PC images for machines I've installed, etc. - and runs ZFS and its implementation of RAID-5, AKA RAID-Z.

My PC rebooted spontaneously twice last week. This is obviously not normal and on the second reboot, the startup messages had one about an 'ad12 TIMEOUT' error. I ordered a new drive, knowing that the third drive of the array was in the process of dying.

When the new drive arrived, I booted from a live CD and cloned the failing drive with dd - there was about a 4 ~~gig~~ GB stretch around 760 ~~gig~~ GB in that would error out and so I used the conv=noerror parameter to bypass the problem area, making a bitwise copy of everything else. Since there are no open SATA ports in the machine, I removed the old drive and plugged the new one in it's place - same bus number = same device node and cloned = should take right off I thought.

The first part of that was correct - the system gave it the same device node. However, it showed as degraded when looked at with zpool status. This was puzzling because the new drive should be bit-for-bit identical to the old one with the exception of that bad 4-~~gig~~GB stretch around 760 ~~gig~~ GB into the drive where the I/O errors occurred on the original ad12 drive. I figure it's not a problem - I'll just recreate the data from what's on the other two drives. That's why I made it a RAID 5 to begin with after all, so if any drive fails, I'd not lose any data. I happily issued the zpool replace tank ad12s3 ad12s3 command and watched it start rebuilding. ('Resilvering' is the term used in the zpool output.)

However, unbeknownst to me at the time, ad8 had more serious issues than ad12 - read errors, the drive resets itself spontaneously, et.c - even though there was no indication of errors in the startup messages. Basically, the machine can't reconstruct ad12 before ad8 becomes temporarily non-operational. It's gotten as high as 210 ~~gig~~ GB before dying, but then either the resilver starts over again or the machine panics. I've tried everything I know to keep ad8 functioning long enough to rebuild ad12, up to having the case open with a desk fan blowing over the drives as it runs, but nothing has helped.

There are two more drives on the way to replace the failing ad8 and presently good ad10 once the others are done. (I no longer trust Seagate drives; they used to be good, but I've had more problems over the past couple years than in the previous decade combined. Plus, the drives had a 5-year warranty according to Amazon when I bought them and I believe this is printed on the box as well, but 3 years later show as out of warranty; researching this uncovered that Seagate apparently decided to retroactively lower the warranty periods on their drives.

) The new drives should be here Monday, but even after I have the new drives in hand, I'm not quite sure how best to proceed.

If I put the original ad12 back in the system with the intention of rebuilding ad8 first, I have to think that ZFS would start reconstructing it and put the kabosh on the data on it because it's in replace mode for that device. Using zpool detach tank ad12s3/old to stop the replace doesn't work because it says there are no valid replicas of the data, even though the ad12s3/old device, which is what showed up after I issued the replace command above, shows as unavailable because it's not connected. zpool scrub -s tank works to stop the resilvering, but I'm still unable to take the drive offline or detach it from the pool.

At this point, I'm pretty sure I'm going to lose some data; I'd just like to minimize it as much as possible. Any advice?

Ruler2112 · Feb 21, 2014

Using this excellent thread regarding a similar condition, I discovered that ad8 is failing on the same sector. However, the procedure described in that thread breaks down because fsdb cannot find a superblock it understands on ad8s3. Further, it refuses to operate on the ZFS pool/tank.

Does anyone know how to figure out which file occupies a given sector/LBA on a ZFS pool? (Maybe if I delete the file occupying the sector causing the problem, I can get it to complete? It made it to 91.2% done before dying last time, so maybe the desk fan blowing over the drives is helping.)

Apologies for the lack of formatting in my original post - I forgot in my hurry to get details posted in the hopes that somebody could help. I didn't know that gig wasn't an acceptable term for gigabyte; that must be new since I was active here, but I will of course comply in the future.

fonz · Feb 21, 2014

Ruler2112 said:
I didn't know that gig wasn't an acceptable term for gigabyte; that must be new since I was active here, but I will of course comply in the future.

It's no big deal, at least not as far as I'm concerned (me being the moderator who called you on it). Most people probably know that gig means GB or GiB and it's the same with meg (for which I also have a witty remark referencing Family Guy). But it's in the rules, so there you go.

Unfortunately I'm not a ZFS expert. If I were, I would of course have tried to help.

ralphbsz · Feb 22, 2014

Ruler2112 said:
At this point, I'm pretty sure I'm going to lose some data; I'd just like to minimize it as much as possible. Any advice?

Disclaimer: I'm not a ZFS expert (just a happy user), but unfortunately I know something about file systems and raid. My advice: a bottle of good wine. You will probably suffer data loss, and the wine will make it more bearable.

My PC rebooted spontaneously twice last week. This is obviously not normal

Unfortunately, with SATA drives and non-hardened kernels, this is normal. With really good kernels, and SCSI/SAS drives, the drive can do nearly anything it wants, and the OS will not fail (it will report disk errors, it might even disable the whole disk, but it will keep running). With SATA, rebooting is often the indicator that something is wrong with the drive.

and on the second reboot, the startup messages had one about an 'ad12 TIMEOUT' error.

That's either a bad sign, or a really bad sign. If it means that the drive is taking very long (many seconds) to perform an IO, that's bad, but it is not catastrophic: your OS is still talking to the drive, and the drive is still trying to do work on your behalf. On the other hand, it might also mean that the OS has lost communication with the drive, in which case getting the data off it is likely to be impossible.

I ordered a new drive, knowing that the third drive of the array was in the process of dying.

What you SHOULD have done: Regularly scrub your pool. That might have helped find latent errors earlier. And: have a spare drive ready to go, the moment the first drive starts getting errors. On an enterprise-grade installation, spare drives (perhaps already powered up, or perhaps spare space on drives) are kept ready at all times, just for that reason: So a single fault can be cured really quickly, before it can escalate to a double fault.

When the new drive arrived, I booted from a live CD and cloned the failing drive with dd - there was about a 4 ~~gig~~ GB stretch around 760 ~~gig~~ GB in that would error out and so I used the conv=noerror parameter to bypass the problem area, making a bitwise copy of everything else. Since there are no open SATA ports in the machine, I removed the old drive and plugged the new one in it's place - same bus number = same device node and cloned = should take right off I thought.

To be honest, that was not a good idea, although at the time it might have been your only choice. You think you are smarter than ZFS? Why don't you let ZFS do the cloning? It knows exactly which files are allocated where on the drive, and it can do a better job moving data. Ideally, you should have put the spare drive in, and let ZFS move the data for you. Unfortunately, my suggestion may fail the reality test: First, you said you don't have enough SATA ports, so doing out-of-band cloning was your only choice. This is a case of bad planning: how were you intending to do drive replacement, if the old (still partially readable) drive and its replacement spare can't be in the system at the same time? But, to make things even worse: If reading the bad drive causes your kernel to reboot, then ZFS won't make much headway either, and manually cloning might have been the only realistic option.

The first part of that was correct - the system gave it the same device node. However, it showed as degraded when looked at with zpool status. This was puzzling because the new drive should be bit-for-bit identical to the old one with the exception of that bad 4-~~gig~~GB stretch around 760 ~~gig~~ GB into the drive where the I/O errors occurred on the original ad12 drive.

ZFS can use much more than just the "bits on the disk" or the device node to identify the drive. For example, it can look at the drive serial number of WWN. From what you said, it seems that ZFS knew that the (partial) clone you made was not the real thing, and ignored it.

If the one dead drive were your only problem, even that would have been fine. But ...

I figure it's not a problem - I'll just recreate the data from what's on the other two drives. That's why I made it a RAID 5 to begin with after all, so if any drive fails, I'd not lose any data. I happily issued the zpool replace tank ad12s3 ad12s3 command and watched it start rebuilding. ('Resilvering' is the term used in the zpool output.)

However, unbeknownst to me at the time, ad8 had more serious issues than ad12 - read errors, the drive resets itself spontaneously, et.c - even though there was no indication of errors in the startup messages. Basically, the machine can't reconstruct ad12 before ad8 becomes temporarily non-operational. It's gotten as high as 210 ~~gig~~ GB before dying, but then either the resilver starts over again or the machine panics. I've tried everything I know to keep ad8 functioning long enough to rebuild ad12, up to having the case open with a desk fan blowing over the drives as it runs, but nothing has helped.

Sadly, with modern large disks, this is not unexpected. WIth today's large disks, the probability of finding a fault on the second disk while rebuilding a dead first disk is high. There are even some nice quotes from high-ranking NetApp executives to that effect. Your layout is single-fault tolerant, you had a double fault.

If you had had the old ad12 disk in place, you might have gotten lucky, in that there is no single piece of the disk that has faults on *both* ad12 and ad8. But your ad12 disk is now gone (you replaced it with a clone that is not the real thing). On the other hand, even having it in the machine might not help, because the faults on ad8 keep crashing your machine, which makes recovery awfully hard.

I no longer trust Seagate drives; they used to be good, but I've had more problems over the past couple years than in the previous decade combined. Plus, the drives had a 5-year warranty according to Amazon when I bought them and I believe this is printed on the box as well, but 3 years later show as out of warranty; researching this uncovered that Seagate apparently decided to retroactively lower the warranty periods on their drives.

You are free to have preferences for certain drives over others. Matter-of-fact, I have preferences too. As do the backblaze guys (and they have ample data, but it applies to a situation very different from you or me). But be aware of the following: You are me are using a small number of disks on our systems. While there are differences between the reliability of various vendors, with our small systems, the small differences between vendors are small effects. On the other hand, setting up our systems to be fault-tolerant, ideally even multiple fault tolerant, and maintaining them well (check temperatures, scrub disks regularly, look at SMART data for advance warning, keep backups) is a huge effect, much more important than what specific brand of disk we buy.

If I put the original ad12 back in the system with the intention of rebuilding ad8 first, I have to think that ZFS would start reconstructing it and put the kabosh on the data on it because it's in replace mode for that device. Using zpool detach tank ad12s3/old to stop the replace doesn't work because it says there are no valid replicas of the data, even though the ad12s3/old device, which is what showed up after I issued the replace command above, shows as unavailable because it's not connected. zpool scrub -s tank works to stop the resilvering, but I'm still unable to take the drive offline or detach it from the pool.

You put your finger on the real problem: You need to tell ZFS to "do the right thing". Here is what I would try: Get more SATA ports into the machine. Then put the real ad12 disk back in, the partial new copy of ad12, and the old ad8, two spare disks, all into the system. I think a good SAS controller is likely to help, since it shields the OS from the stupidity of the SATA bus, and if a drive totally misbehaves, the SAS HBA will probably just disable that one drive, and get on with life. I would look for one of the high-end LSI SAS HBA (the 92xx and 93xx series), and temporarily use it in SATA mode. Expensive, but very good.

But: You need to see whether ZFS can rebuild two disks at once. This requires a few things that are outside my skill area with ZFS: Tell ZFS to have both the old and new incarnation of ad12 in the pool at the same time, and tell it to avoid using ad8, and only fall back to it when data is not available on either incarnation of ad12. You might get lucky that ZFS will automatically to "the right thing", but I would try to locate some ZFS expertise.

Ruler2112 · Feb 26, 2014

Thanks for the reply Ralph. You raise some interesting points.

ralphbsz said:
When the new drive arrived, I booted from a live CD and cloned the failing drive with dd - there was about a 4 ~~gig~~ GB stretch around 760 ~~gig~~ GB in that would error out and so I used the conv=noerror parameter to bypass the problem area, making a bitwise copy of everything else. Since there are no open SATA ports in the machine, I removed the old drive and plugged the new one in it's place - same bus number = same device node and cloned = should take right off I thought.

Click to expand...

To be honest, that was not a good idea, although at the time it might have been your only choice. You think you are smarter than ZFS? Why don't you let ZFS do the cloning? It knows exactly which files are allocated where on the drive, and it can do a better job moving data. Ideally, you should have put the spare drive in, and let ZFS move the data for you. Unfortunately, my suggestion may fail the reality test: First, you said you don't have enough SATA ports, so doing out-of-band cloning was your only choice. This is a case of bad planning: how were you intending to do drive replacement, if the old (still partially readable) drive and its replacement spare can't be in the system at the same time? But, to make things even worse: If reading the bad drive causes your kernel to reboot, then ZFS won't make much headway either, and manually cloning might have been the only realistic option.

I use dd in this manner all the time to replace drives that are in the process of failing and the OS usually just picks up the new drive automatically. Done it successfully with both linux a few times and windows more times than I care to think about. I figured that by cloning the original bad drive and doing a simple replace, the system would pick it up like all other file systems I've done it on in the past. Either ZFS is too smart or too dumb for that...

It'd be much preferable IMO to simply do a bit-wise clone the original drive as above, skipping bad sectors as needed, having ZFS pick it up as being part of the array, and scrub the data to repair what was lost from the bad sectors on the old drive.

To be honest, I planned on doing drive replacement exactly how I did - physically swap the drive and let it re-stripe from the other two. Unfortunately, the error reporting aspects of ZFS that I read so much about before deciding on using it seem to be somewhat exaggerated. There was no complaining or notifications from ZFS about a drive having issues, much less two of them. Only way I discovered it was having mysterious reboots and watching the startup messages.

ralphbsz said:
ZFS can use much more than just the "bits on the disk" or the device node to identify the drive. For example, it can look at the drive serial number of WWN. From what you said, it seems that ZFS knew that the (partial) clone you made was not the real thing, and ignored it.

The serial number of the drive is also cloned using dd in that way if I'm not mistaken. Haven't tested recently, but I remember several years ago being very surprised by the serial number of the replacement being the same as the original once cloned.

ralphbsz said:
If you had had the old ad12 disk in place, you might have gotten lucky, in that there is no single piece of the disk that has faults on *both* ad12 and ad8. But your ad12 disk is now gone (you replaced it with a clone that is not the real thing). On the other hand, even having it in the machine might not help, because the faults on ad8 keep crashing your machine, which makes recovery awfully hard.

See below about updates regarding this - managed to get the original ad12 added back in.

ralphbsz said:
You put your finger on the real problem: You need to tell ZFS to "do the right thing". Here is what I would try: Get more SATA ports into the machine. Then put the real ad12 disk back in, the partial new copy of ad12, and the old ad8, two spare disks, all into the system. I think a good SAS controller is likely to help, since it shields the OS from the stupidity of the SATA bus, and if a drive totally misbehaves, the SAS HBA will probably just disable that one drive, and get on with life. I would look for one of the high-end LSI SAS HBA (the 92xx and 93xx series), and temporarily use it in SATA mode. Expensive, but very good.

But: You need to see whether ZFS can rebuild two disks at once. This requires a few things that are outside my skill area with ZFS: Tell ZFS to have both the old and new incarnation of ad12 in the pool at the same time, and tell it to avoid using ad8, and only fall back to it when data is not available on either incarnation of ad12. You might get lucky that ZFS will automatically to "the right thing", but I would try to locate some ZFS expertise.

Can't add SATA ports - the machine has no room for additional cards. I have some USB SATA adapters that I've played with since my original post, but have yet to get it to replace the drive how it wants.

My trying to find ZFS expertise is why this thread exists.

Additional details to come.

Ruler2112 · Feb 26, 2014

I realized that I had some adapters that will basically turn any SATA device into a huge USB flash drive. I plugged the original ad12 into it - the system gave it da0 just like it does any other flash drive, but ZFS did not recognize it as part of the pool. I was able to add a new drive to the pool as a spare using this adapter, but it doesn't appear that ZFS will use it. These adapters have a red LED that lights up when the drive is accessed and while it flickered once when the device was added to the pool, I never saw it light up again. I thought of trying to replace ad8 using a drive in a USB enclosure like this, but I think the below would probably happen, trapping it in an endless resilvering process, and I still haven't figured out how to stop a zpool replace once it's begun.

Belatedly, I realized that the DVD burner in this system is SATA; disconnecting that allowed me to plug the old ad12 into it. I booted up, thinking that I might be able to play around and get ZFS to recognize it as part of the pool. The system gave it ad14 and to my amazement, ZFS picked it right up and knew it was what the new ad12 should resilver off of. (This confirms what you said earlier Ralph - ZFS must look at something other than the data on the drive/device to recognize parts of a pool. I'd be very interested to know what - does anyone happen to know exactly what it does look at?) Being very optimistic at finally getting a piece of good luck in this whole ordeal, I let the new ad12 resilver with both ad8 and the original ad12 now seen as ad14 to pull from.

Believe it or not, this actually made things worse. Instead of taking just over two hours to resilver, it now takes nearly four. Plus, the resilver starts over whenever a bad sector is encountered on either ad8 or ad14. I once saw it get to 92.1% complete before restarting... was that ever frustrating! Seems silly that there's a RAID available and ZFS is unable to simply use the other devices to compensate for errors on one. Wish I could tell ZFS to just continue on after read errors and let me know what files are affected, but I don't know how to do this and have found nothing online to indicate that it's even possible. Does anyone know if this is possible and if so, how to do it?

My research has uncovered many methods of determining which file is assigned to a given block on a device, but none for ZFS - the methods and utilities for UFS, ext2, resiserfs, etc simply do not work on ZFS. If I could figure out which files are using the bad sectors, I could simply copy or erase the files and the problem would likely disappear because it wouldn't need to access the problematic areas. I'd certainly rather lose a few files than all of them. zdb provides quite a bit of information regarding the ZFS internal structures, but nothing that would be useful for this purpose from what I can see. Anyone know how to figure out which file is in a given block on ZFS? (I thought of writing a script that would simply execute dd if=whatever of=/dev/null for each file in the pool and watch to see which generate console messages indicating a read error, but this would take a long time and I seriously question the reliability of such a method given how the system and disks buffer IO operations.)

I found that there was a firmware update available for the particular model drives that are failing. Updated the firmware via their FreeDOS based LiveCD, hoping that they improved the error detection and recovery aspects of the drive. Everything went fine and the drives are running the new firmware, but the resilver again stops and restarts at the first bad sector it encounters.

Further research revealed that modern drives will automatically detect and relocate bad blocks. However, they do so only when the sector is written to, not when being read. Discovering this, I had an idea that might force the drive to detect and work around the bad sectors without destroying everything on the drive. I booted the system from a Linux LiveCD with only the 2 failing drives connected and ran badblocks -svn on each drive. This reads each sector on the drive and writes the same data back to it. (By default it does every single sector on the entire drive, but you can specify ending and starting points; this becomes important shortly.) It obviously takes an excessively long time, but should IMO allow the drive to detect the bad sectors and relocate them at a firmware level. Of course, this approach assumes that ZFS doesn't think it's smarter than the drive and bypass it's firmware somehow, disregarding what the drive sees as bad sectors. Interestingly, by using this tool I was able to determine that if I access sectors between 717239661 and 717500000 on ad12, the drive starts making a very annoying high pitched whine and resets itself, disconnecting from the SATA bus until power cycled. All the sectors before that point don't cause this and sectors after don't either. I know from earlier experience that the drive disconnecting itself also makes the resilver of the new ad12 start over again. Is there a way to force the drive to not even try accessing certain blocks so the resilver won't die when it wants to read them and instead use the parity on the other drives to continue?

My plan is to let the non-destructive read-write test complete, then whittle down the range of sectors that cause ad12 to reset in the hopes that I can figure out a way to prevent the drive from trying to access these. If the resilver again fails after this read-write test is complete and I'm unable to learn how to stop the drive from killing itself, I'm going to boot the system with only ad8 and ad10 and copy off all the files in the pool to a few large external drives. There will obviously be some that fail because of the bad sectors on ad8 and ad12 not being connected - I'll make note of these files. Then I'll power off and boot up with only ad10 and ad12 in the system and try copying the files off that failed in the first configuration, again noting which files fail. Finally, I'll boot with all three drives connected and attempt to copy the remaining files. Whatever files error this time I'm going to assume are irrevocably lost. (Basically, I'm going to be doing manually with another drive what the RAID should be doing automatically, working around the bad sectors on one device by using the information stored on the others in the array.) I'll then replace the drives, recreate the pool on the new drives, and copy everything from the external drives to the new pool.

The next step is to create a script to fire from cron that will run SMART tests and e-mail me the results nightly.

Anyone know the answers to any of my above questions or see a flaw in my proposed plan?

wblock@ · Feb 26, 2014

sysutils/smartmontools has a daemon that can be configured to send email if there is a problem. It can also run short or long tests. Just remember that it does not do this by default.

I would not use dd(1) to try to sneak a new drive into a ZFS pool without ZFS noticing. That's kind of the opposite of how things are supposed to be done.

Likewise, writing to a failing drive is dangerous. It makes the heads access those bad spots multiple times with retries. Depending on the problem, that might create more bad blocks (like if more oxide flakes off the disk and splats into the heads).

ZFS does not bypass the drive firmware, it can't.

Regrettably, probably none of this helps you now. In the future, a RAIDZ2 layout gives more redundancy. I'm becoming convinced that any drive with a terabyte or more of disk space should be in at least a mirror. The bigger drives just aren't very trustworthy.

ralphbsz · Feb 27, 2014

Ruler2112 said:
To be honest, I planned on doing drive replacement exactly how I did - physically swap the drive and let it re-stripe from the other two. Unfortunately, the error reporting aspects of ZFS that I read so much about before deciding on using it seem to be somewhat exaggerated. There was no complaining or notifications from ZFS about a drive having issues, much less two of them. Only way I discovered it was having mysterious reboots and watching the startup messages.

Two issues here. First, to get error reporting from ZFS, you need to actually scrub your file systems. That is not automatic. I scrub them twice a day, but then my machine is pretty idle.

The other issue is a fundamental one, which is really hard to deal with when using a commodity OS (FreeBSD) on commodity hardware (x86 servers with SATA). In order for a RAID layer (such as ZFS) to be able to deal with drive failures, it needs to be able to find out that drives have failed, and still be alive afterwards to be able to react to the drive failure. After all, ZFS is just a piece of software running on the CPU in the OS, and if either the whole motherboard reboot, or the OS crashes on each drive failure, there is nothing ZFS can do. Unfortunately, with motherboard-based SATA controllers, and with the drivers for them found in commodity OSes, many things that are really just disk drive errors, or problems with the communication interface between the drive and the "computer" turn into reboots and crashes. One thing which really helps there is to put an industrial-strength HBA between the drives (and their interfaces) and the computer. I've been pretty happy with the LSI SAS controllers, but they're pretty expensive for a small system. And even they can be stumped by sufficiently badly behaving drives; I've seen situations where certain drives on the SAS network cause a 2-hour delay when booting (which in most cases de-facto prevents the machine from booting, because the human operator loses patience).

It's hard. If you want to do RAID "right", meaning survive nearly any sufficiently misbehaving hardware, you have to tune and configure hardware and OS layers significantly.

The serial number of the drive is also cloned using dd in that way if I'm not mistaken. Haven't tested recently, but I remember several years ago being very surprised by the serial number of the replacement being the same as the original once cloned.

No, it can't be. If you get the serial number from a drive (on a SCSI drive with an INQUIRY command, on ATA drives with the equivalent command), it will match the serial number that's printed on the paper label of the drive. For example, try camcontrol identify ada0, and it will tell you things like the manufacturer, model, serial number, WWN (world wide name, the unique identifier of this drive), and so on. And if you look carefully, you'll find that the values given there actually match what is printed on the outside of the drive (not all vendors print both WWN and serial number on the paper label, but usually at least one is present). And clearly, dd is not capable of changing the values on the paper label.

Furthermore, ZFS knows more about the drive than just its content. For example, have you ever noticed that ZFS remembers what device name a disk was connected at? I've had a drive fail, I physically removed it from the computer, and afterwards zpool status showed the missing drive as "offline" (correctly), and told me that it was last seen at /dev/ada2.

I have no idea what mechanism ZFS actually uses to identofy drives (never looked at the ZFS source code), but I bet it's a combination of data that it writes to the drive (which WOULD be copied by dd), hardware identifying information about the drive itself (such as Seagate model 1234 serial ABCD), and location where it is attached as a device to the host.

Not all RAID systems work this way. Some software-RAID products completely ignore the hardware identification, and go completely by what they find on the drive. There are lots of stories of RAID systems (typically made by companies with 2- and 3-letter names such as Evil Machine Company, Immense Bowel Movement and Hubris Pachyderm) are extremely picky about hardware identification, to the point that they refuse to touch a disk if they don't recognize the firmware version loaded onto the drive.

My trying to find ZFS expertise is why this thread exists.

I'm just a happy user of ZFS, so all I can offer is generalities about how RAID systems usually work (and a shoulder to cry on). Let's hope this discussion wakes up some ZFS developers or experts.

Ruler2112 said:
I realized that I had some adapters that will basically turn any SATA device into a huge USB flash drive. ...I plugged the original ad12 into it - the system gave it da0 just like it does any other flash drive, but ZFS did not recognize it as part of the pool.

Depending on what technique ZFS uses to identify drives, this MIGHT make sense. For example, many USB adapters report totally fake data for manufacturer/model/serial/revision/WWN information. ZFS might be looking for "Seagate model 1234 serial ABCD", get presented with "Sabrent model USBSATA serial 0", and give up. The problem is that USB has its own storage protocol, and these adapters have to tunnel the data through it.

My research has uncovered many methods of determining which file is assigned to a given block on a device, but none for ZFS ...

People who implement RAID or file systems hate doing this kind of stuff (namely answer the question "where is file X stored right now"). Because users who employ these tricks usually end up shooting themselves in the foot, and then blame the file system. Just imagine the atomicity and correctness problems here. You might ask the question, and the answer is rather complicated (for example: one copy is on disk A sector X, another copy is on disk B sector Y, but that second copy is known to be stale or needs to be checksum-verified, and there is also a snapshot from last Thursday on disk C sector Z, plus there is an offline disk D in play). Furthermore, what you think of as a single file or track or block or strip might be split apart in many places (striping!) or merged together with other things (clustering!), and explaining this requires understanding most of the internals of the file system in question. And the locking required is yet another problem for atomiticy: the answer you get is not guaranteed to stay correct, even for as long as it takes you to read it. Do you think it would help you if the answer were "file X sector 1 was last seen on disk A sector X, but we're in the process of resilvering it to disk B and haven't decided where we're going to put it, and there is a cached copy on SSD Z, and I have the feeling in my joints that I'll move it tonight to defragment that corner of the disk, which has been collecting spiderwebs recently".

But the opposite question actually does work: if you have an error on the disk, you can find out where the problem is. If you do zpool status -v, it will actually tell you for each known problem which file it is in (and unfortunately, it will sometimes tell you that the problem is in some vitally important but anonymous piece of management information, like inside an inode or the root directory, in which case you know that you're screwed, but you don't know which way). I've seen zpool status -v work and tell me the file name of a file that had errors in it.

Unfortunately, with your unreliable disk, and with read errors that cause major damage (at least they cause resilver to restart, which seems like a terribly mis-feature to me), the file system may not stabilize for long enough for you to be able to so zpool status -v.

Further research revealed that modern drives will automatically detect and relocate bad blocks. However, they do so only when the sector is written to, not when being read. Discovering this, I had an idea that might force the drive to detect and work around the bad sectors without destroying everything on the drive. I booted the system from a Linux LiveCD with only the 2 failing drives connected and ran badblocks -svn on each drive. This reads each sector on the drive and writes the same data back to it. (By default it does every single sector on the entire drive, but you can specify ending and starting points; this becomes important shortly.) It obviously takes an excessively long time, but should IMO allow the drive to detect the bad sectors and relocate them at a firmware level. Of course, this approach assumes that ZFS doesn't think it's smarter than the drive and bypass it's firmware somehow, disregarding what the drive sees as bad sectors. Interestingly, by using this tool I was able to determine that if I access sectors between 717239661 and 717500000 on ad12, the drive starts making a very annoying high pitched whine and resets itself, disconnecting from the SATA bus until power cycled. All the sectors before that point don't cause this and sectors after don't either. I know from earlier experience that the drive disconnecting itself also makes the resilver of the new ad12 start over again. Is there a way to force the drive to not even try accessing certain blocks so the resilver won't die when it wants to read them and instead use the parity on the other drives to continue?

Disk drives have been doing revectoring on writes for several decades now (it was in SCSI manuals in the early 80s). From this viewpoint, your basic idea is good. BUT: What you're doing "manually" (using Linux badblocks), ZFS should be able to do just as well. In theory, ZFS should be able to issue the same read command to the disk that badblocks does, and when it gets an error, it should go to one of its redundant disks.

The real issue is not that badblocks or ZFS are smart or stupid. The real issue is that your disk is totally brain-damaged, and any access to sectors between 7xxx and 7yyy, it turns into a brick. Instead, it should return a clean read error.

Do I know a way to make ZFS not access a certain range of sectors? No, I said above that I'm not a ZFS developer. But here is a suggestion. If we think that ZFS is willing to accept a copy of a drive (made using dd for example) without glitching, you could obtain a blank drive (preferably with the same GPT or MBR partition map, ideally of the same manufacturer and model number), and then copy all the data to it, except that one range from 7xxx to 7yyy which is unreadable. For the sectors in that range, replace the data on the copy with a test pattern, which is guaranteed to fail ZFS checksum validation (for example, fill those sectors with 0xDEADBEEF). Then try to feed that copy of the drive to ZFS. That *might* give you the best of all possible worlds: For the parts that are readable, ZFS will do the right thing, and for the parts that used to be unusable, it should get a nice clean checksum validation error, and use the redundant copies.

All this is going to be a lot of work, and time-consuming. I'm sorry to not be able to be more specific about ZFS, but I just don't know its internals, I'm just guessing how it ought to work.

The next step is to create a script to fire from cron that will run SMART tests and e-mail me the results nightly.

Excellent plan. That had always been my intention too. Except: With my host (Jetway NF99FL motherboard with built-in SATA ports), OS (FreeBSD 9.0, I know it needs to be upgraded) and disks (Seagate 1TB Barracuda, about 4.5 years old), the attempt to run smartd or smartctl causes my kernel to hang. So I never got around to using SMART, until one of the two Seagate drives croaked. I'm hoping to find some time soon to upgrade FreeBSD, and maybe then I'll enter the era of enlightenment, and start using SMART on my home server.

And please understand that SMART is not a panacea. Certainly, lots of disks that get SMART warnings about "impending failure" go down the drain soon thereafter. But there is published data (paper from a few Google people, published in a FAST conference) that says that roughly half the disks that fail never got a SMART warning before failing. And about half the disks that got a SMART warning continued to function for years thereafter. Still, any warning (even if unreliable) is better than complete ignorance.

Because of this unreliability of SMART and such technologies, and because single-fault-tolerant RAID (meaning RAID that is affordable for consumer's home machines) is too easily defeated by today's high-capacity disk drives, I instead use lots of ZFS scrubbing, and backups of all the data to separate drives.

ralphbsz · Feb 27, 2014

wblock@ said:
Regrettably, probably none of this helps you now. In the future, a RAIDZ2 layout gives more redundancy. I'm becoming convinced that any drive with a terabyte or more of disk space should be in at least a mirror. The bigger drives just aren't very trustworthy.

About a year ago, a high animal from NetApp (their CTO or something like that) said something to the effect that with modern disk drives, selling products that are only single-fault tolerant amounts to malpractice. For enterprise data, and customers to whom their data matters (meaning they're willing to spend the money to protect it adequately), I whole-heartedly agree. All the stuff in my office is 2- or 3-fault tolerant, plus snapshots, off-site mirroring, and backups to tape.

Unfortunately, for home users it's often already a stretch to go to any redundancy. On the other hand, the damage is also much smaller. If all the cute baby pictures of my kid are forever lost because of a disk drive failure, that's not the end of the world (whereas if something like the Social Security Administration lost all the data about retirement pensions, it would be pretty ugly). Matter-of-fact, all the baby pictures of my sisters and me were actually lost, when my parent's house had its basement filled with water (in the 100-year flood of the Rhine river, about a decade or two ago). Compared to the risks we take with physical pictures in photo albums stored in cardboard boxes in a basement, anything that involves a computer and a few simple choices about backup are already a vast improvement.

Ruler2112 · Feb 28, 2014

Preventing the drive from accessing certain sectors would have to be done on the drive firmware level. Don't know if it's possible using something like SMART or not...

I implemented the first part of the above plan - connected ad8 and ad10 and began copying files using cp -v -R -p /home/Images /mnt/usb directing both stdout and stderr to a file. I figured that I could grep() the file for 'FAILURE' and the file listed directly above that line would be the one with the problem; I started the copy and went to bed. Unfortunately, I discovered that the files being copied are only output when it gets done with an entire directory, so it wouldn't work to locate the bad files. Plus, this morning had several READ_DMA48 errors listed on the screen indicating that it'd encountered the bad sector on the drive, but grepping the file for the error code returned no results. It appears that the errors are not output to stdout but to the console.

I have to leave now, but my intention for later is to write a short perl script that will copy each file individually and examine either /var/log/console or /var/log/messages between each one. If it finds a read error, the script will record such, erase the target file to avoid having a broken file, and move on to the next one. Before I go through the effort of writing such a script, is there a simpler way of copying a given directory recursively and detecting read errors? Maybe something already written? (If I do end up writing it, I'll share on these forums.)

Ruler2112 · Mar 28, 2014

For those interested, I wrote a script to extract the data and detect when the drive hits bad sectors. I was able to do manually what the RAID should've done automatically and managed to extract everything.