ZFS Predictive Failure Hard Disks

Hi all,

I have almost no experience with freebsd but I'm working now in a new project that uses it in storage enclosures.

We have a Supermicro enclosure with FreeBSD 9.2 and 36 disks and 3 pools. First pool has 12 disks some of them (7) are really old (Power_on_hours parameter is near 39.000)

We are experiencing problems as disks are beggining to fail. Enclosure marks them as REMOVED. We replace the removed disk and resilver starts. When resilvering, usually another disk fails. Ok, no problem, our pool support two-disks failure without data loss. Let's wait until resilver finishes and we will replace the second disk. Unfortunately, it's very usual that, a third disk is marked as removed and then resilver begins to get slower and slower and it doesn't finish. In spite I have read it's not a good idea, restarting the enclosure, solves the issue. Disks are marked as ONLINE again and resilvers continues and finishes properly.

All data in the pool is double backed-up in external enclosures, so we are not worried about the data. Doesn't matter if pool is destroyed. But the thing is... why system is marking as REMOVED some disks if they come ONLINE again when rebooting the enclosure?

I have read in some forums that, in order to "predict" a disk failure it's used to check smartctl ID 5, 197, 198 and 199 values. If they are above 0, the disk may fails. The funny thing is there are old disks with values greater than 500 that enclosure does not remove and another one with values equal than 0 that system mark as REMOVED. And not always are the same disks.

We assume data is almost lost, but I would like to go a step further and be able to understand how freebsd decides one disk is faulty and if there is any way to predict which disks are nearest to fail. We have a lot of enclosures with similar hardware and I woulk like to anticipate to disk failures in all systems.

Please, could you help me in anyway?

Thank you very much in advance.
 
Hi SirDice,

Thanks for the response. I see it's really out of support (5 years!) We should update enclosures or migrate data.

Anyway, would it be possible to know how newer (or supported) versions of freebsd decide if disk is faulty? It's quite strange that sometimes OS marks a disk as faulty and when storage is restarted, another disks are mark as removed.

Sorry for the inconveniences.

Thank you very much and best regards.
 
It's possible the disks are old and have started to act up. Another option is the controller, or in your case the port extender that's likely in your enclosure. If the port extender starts acting up random disks can go in failure mode causing all sorts of havoc.

Don't worry too much about the power_on_hours, if you have enterprise class disks its more than common to see them run without issue for 5 or more years.
 
Maybe issue is related with the port extender as you propose.

I would ask you to not close the ticket for the moment just in case any other user can provide us with some extra information. If there is no answer in some days, please feel free to close the issue if forum rules mark to close issues that are not answered for a while.

Thanks again and best regards,
 
This isn't a "ticket", it's a forum post. And we generally don't close threads unless they spiral out of control.
 
To begin with, your FreeBSD version urgently needs upgrading, just as SirDice said. But that makes no difference for the following, because I'll talk about disks and enclosures and errors, not the OS or the file system.

First pool has 12 disks some of them (7) are really old (Power_on_hours parameter is near 39.000)
That's only 4-1/2 years. For good quality enterprise disks which are run in a sensible environment (not too hot, not too cold, not too much vibration), 5 to 7 years is a perfectly sensible time to use disks. At that point, they are economically no longer viable: It becomes much cheaper to replace them with new ones, instead of paying electricity/cooling/maintenance/hassle to continue using the old ones.

Remember, the MTBF of enterprise disks is quoted by the manufacturer as typically 10^6 hours or more. So at an age of ~50K hours, you should have no more than 1/20th of the disks failing.

We are experiencing problems as disks are beginning to fail. Enclosure marks them as REMOVED.
These are SATA disks. Are they connected via SATA or SAS? How does the enclosure sense that they are removed? I'm not familiar with SuperMicro's enclosures; typically the removed state is triggered by a contact on the power connector, but some enclosures do that when the disk fails to speak the intended protocol (SATA or SAS) to the expander in the enclosure. Maybe the enclosure sets a disk to "removed" when there is any IO error?

Also, when you say "enclosure", do you perhaps mean that you also have a RAID controller in the chain somewhere? Could it be that the "removed" is really a function of the controller becoming unhappy? That can happen even if the RAID controller is configured to not actually do RAID, but to present each drive as a single logical volume.

When resilvering, usually another disk fails.
That is unexpected. I would fully expect that a second disk has a single IO error (look at the uncorrectable error rates for modern disks), but complete failure is not good, nor logically explainable. Could it be that your disks are not just old, but also very sick? Maybe your disks are not long-lived enterprise grade models? Maybe they are from a faulty manufacturing batch?

In spite I have read it's not a good idea, restarting the enclosure, solves the issue. Disks are marked as ONLINE again and resilvers continues and finishes properly.
Ah, so the disks are not really removed, but the enclosure is confused, and restarting it de-confuses it. This is actually good, because it allows the file system to finish resilvering.

why system is marking as REMOVED some disks if they come ONLINE again when rebooting the enclosure?
That's not a FreeBSD question, but a question for the people who make your enclosure (and perhaps RAID controller). How do you exactly find out that it is removed? That's not something that FreeBSD or ZFS typically do.

I have read in some forums that, in order to "predict" a disk failure it's used to check smartctl ID 5, 197, 198 and 199 values. If they are above 0, the disk may fails. The funny thing is there are old disks with values greater than 500 that enclosure does not remove and another one with values equal than 0 that system mark as REMOVED. And not always are the same disks.
It's complicated. There is no 100% accurate failure predictor. My joke (pretty close to the truth) is that the best predictor (which you sort of described above, except that the correct threshold is not 0, but growth in the numbers) is only 50% accurate: If the predictor says that the disk will fail, it will be wrong 50% of the time, and the disk will actually survives for a long time. On the other hand, if the disk actually does die, then 50% of the time the predictor has given no warning. On SAS disks, this is a lot easier, because they have predictive failure built in, and you can explicitly ask the individual disk whether it thinks it will fail or not. The advantage of doing this within the disk is: you can call the disk vendor and get a warranty claim!

... understand how freebsd decides one disk is faulty and if there is any way to predict which disks are nearest to fail.
I don't think FreeBSD does this. Not even within ZFS. It handles individual IO errors, and it handles disks that completely vanish; it doesn't try to predict whether disks will fail. You can run smartctl or smartd and do it yourself though.

... I woulk like to anticipate to disk failures in all systems.
You and everyone else! This is extremely difficult, and I know of no good solution that is freely available. If you look at the published research literature, you find that predicting disk failure is really hard (look for papers by Eduardo Pinheiro and Bianca Schroeder). There are better solutions in commercial products (look for disk hospital), but those are not available either on FreeBSD or for free.

Please, could you help me in anyway?
Only with a bit of advice for the long term: Buy newer disks, and use lots of RAID and replication for safety. But the current "removed" state you are getting is probably not real, and fixing that is the best short-term advice.
 
Hi Ralphbz,

First of all, thank you very much for you long and very well explained response. It has clarified a lot of things to me but, unfortunatly, I have realized that I have no idea about a lot of things :oops:. Never mind, I can learn (hopefully soon :rolleyes:)

Disks that are failling are SATA and they are connected via SATA. There is a controller in the enclosure but it's no used to make a RAID (software RAID is being used)

It has come to our attention that disks that are getting removed from the enclosure are Seagate Barracuda - Model: ST3000DM001-9YN166.

They all experience ATA Errors Count (Around 30-50 each disk) and all show next error: Error: UNC at LBA = 0x0fffffff = 268435455. Some of them also have smartctl ID 5,197,198 and 199 errors. If I have properly understood, these errors are not so determinant as we supposed, but obviously these disks are not so sane as we would like.

I'm gonna investigate about supermicro controller the enclosure uses. As you commented, it's quite strange that disks fails in a random way. For example, a couple of weeks ago da1 and da5 were removed from the system and we thought they were damaged. We restarted the enclosure and now both disks are working properly and da9 has failed (da9 had neither ATA errors not ID 5,197,198 and 199 errors) I bet that if we restart the enclosure again da9 will be online again and another different Seage will fail. Perhaps controller feels unhappy about a disk and it informs OS that has to be removed, or freebsd considers that with the information controller provides, the disks has to be removed. I'm not really sure how it works. I opened this post because I thought it was a freebsd matter, not hardware. If I find anything I will update the message to help others.

Pool affected by this issue has 7 disks from the same manufacturer, more or less with the same age and with some ATA errors and 5,197 and values above 0. We will replace the disks with WD Red Disks, one by one, crossing fingers to not get more than 2 disks failing at the same time. In that case, I hope that restarting the server, will continue with resilver if it gets hanged. If not, we will assume data loss and will recover from backup. Once finished this part, we will upgrade freebsd as both of you have proposed.

Thanks and best regards.
 
Hi again,

At the end, we decided to replace all suspicious disks, so we delete the pool, replace all the disks and then recover from backup. There was a disk (da2) that in spite of it was new, looked like didn't work properly (a replace process executed over it, didn't finish in nearly 2 weeks) so we decided to replace it also.

We used rsync to pass information from our backup source to the destination(one rsync for each folder) and everything seemed to be working properly. But when we tried to execute a new rsync simultaneously, immediatly da2 disk was removed from the system (didn't even apear in /dev/, it was completly ejected from the system) It was quite strange as da2 was completly new. As usual a restart solved the issue. We didn't get any risk and executed rsync but not simultaneously. All rsync finished ok in a couple of days.

Today we are executing md5deep to verify all copies have been properly performed (a double-check after rsync) and da2 is showing next errors:

6676


We have investigated in other forums and everything points to a hard disk failure, but we don't think this is the problem as we have been working with 2 different new disks and always is da2 bay which fails.

Based on error above, could we discart any freebsd error and ensure that is a problem with the controller or perhaps with backplane? Looks like error log could confirm a hardware error and not an OS problem.

Note: disks da2 has not been yet removed from the pool. We

Thank you very much in advanced.
 
Nope, the error you are seeing comes directly from the hard drive. A controller or backplane should not create a medium error; this is a problem with the platter or the head. That disk had at least one error. It may have had more; smartctl may be able to diagnose that.
 
Nope, the error you are seeing comes directly from the hard drive. A controller or backplane should not create a medium error; this is a problem with the platter or the head. That disk had at least one error. It may have had more; smartctl may be able to diagnose that.

Allow me to have a different opinion. I have seen such kind of errors, and although they are commonly considered disk hardware/surface errors, and any professional maintainer would immediately replace the disk, I found many of them not actually being surface errors. My impression is, they can be anything: controller flaws, cabling flaws, power supply flaws, CMOS setting flaws, OS timing setting flaws. The final result is that garbage appears at the CAM layer, but there is still a bunch of firmware below.

Admitted, I am running hardware that other people would put in a museum ;), so I am seeing many things one should never see on a properly working commercial grade system. But this story sounds somehow foul to me. The proper thing to do would be to take the new disks, attach them natively to a reliable unix machine and run burn-in with serious load over a few days, as then you can with high certainty say that the disk is good.

The CDB 0x28 shows six different LBAs. No idea what interim translation the controller might do, but that doesn't look like a specific surface error, rather like the read process as such got a bad mood, or the disk getting bored (for whatever environmental reason).

Another fancy thing is: why does it do a READ(10) at all? If these disks are ST3000DM001, they should be 3TB, and READ(10) is only supported up to 2TB.
 
You are right, but fortunately only rarely. Clearly, strongly vibrating disk mounts can create IO errors, as can faulty power supplies. Those are actually real disk/platter IO errors, although the disk is not at fault. The other thing that really annoys me is the the behavior of HBAs when using T10DIF (the checksum protocol for SCSI disks), and reporting errors: If they find a checksum error and have no sensible method to report it to the host (because the SCSI device driver has no way to encode these errors), they will create fake sense data that pretends that a real IO error happens. But all these scenarios are rare.

In the other department of insane things: A few colleagues and me were working on an early prototype 6gig SAS setup that was using T10DIF disks. We had one big SAS cable that would reliably create checksum errors, which were then misreported as IO errors. But most of the IO worked just fine (99.99% of the IOs had no problem). And the problem was really caused just by the cable; after replacing it, everything worked fine.

But in this case, it seems that the ASC/ASCQ of 11/0 very likely comes directly from the disk drive.

About the READ(10): Perhaps this IO was in the first couple TB of the disk, and the low-level CAM driver uses the shortest IO possible?
 
We have reviewed disks logs and ralphbsz is right. The disk is completly death. It has thousand of errors it didn't have when we replace it. What I don't know is why system has not ejected it as before. This issue is a pain 😞

Well I think it's time to contact enclosure manufacturer as it's already clear that something is not working properly on this hardware. Perhaps backplane, cables or controller, not really sure. As is always the same disk bay which is failling (the other disks in the pool are working fine) perhaps I would bet for a bad backplane.

I'll be back in a couple of weeks and will continue updating the post. In spite of it doesn't seem to be a freebsd matter, perhaps this could help other beginners.

Thanks to both for your help!
 
Hi everyone,

I'm only replying to this post to let know to whom it may affect, which was the issue we were experiencing. At the end, it was an enclosure failure (hardware) nothing related with OS.
In our case problem came from a defective backplane (front disks backplane). We thought first it could be the controller, but after replacing it, issue happened again.
Once backplane has been replaced, we have experienced no issue in the system. Even the disk that was marked as defective before power off the enclosure, appeared healthy again after backplane replacement.

Thank you very much everyone for your help!
 
All data in the pool is double backed-up in external enclosures, so we are not worried about the data. Doesn't matter if pool is destroyed. But the thing is... why system is marking as REMOVED some disks if they come ONLINE again when rebooting the enclosure?

...

We assume data is almost lost, but I would like to go a step further and be able to understand how freebsd decides one disk is faulty and if there is any way to predict which disks are nearest to fail. We have a lot of enclosures with similar hardware and I woulk like to anticipate to disk failures in all systems.

So, Better You to know, FIRST thing that You MUST HAVE in Your tech stuff room are separate platform (better that would be IBM or Dell xSeries server) on which You are doing ONLY ONE THING -> DOS/nix testing the SATA / SAS drives.
Of course, onnected on separate RACK-VERSION of online UPS like Liebert or GE with good batteries, value from 3 to 5 KVA would be good.
 
You are right, but fortunately only rarely. Clearly, strongly vibrating disk mounts can create IO errors, as can faulty power supplies. Those are actually real disk/platter IO errors, although the disk is not at fault. The other thing that really annoys me is the the behavior of HBAs when using T10DIF (the checksum protocol for SCSI disks), and reporting errors: If they find a checksum error and have no sensible method to report it to the host (because the SCSI device driver has no way to encode these errors), they will create fake sense data that pretends that a real IO error happens. But all these scenarios are rare.
Absolutely agree with You because have the same 27+ years experience with hardware. :)

Just add to this above for topicstarter:

Read carefully Hardware Maintenance documentation from valued brands like IBM, Sun, Dell. All that I write here and much more are already in docs. :)

Even You have all servers on DC with environment and power monitoring and control, You need once a year pulling out each server for FULL maintenance: clean from dust, check and clean all contacts and cables, take a electrical lubricant (like LiquiMoly has very good ELECTRONIC SPRAY) on a connection pins, refresh CPU Thermal Grease, check current parameters from PSU by Fluke meter, and take full internal server tests.

If Your system are new, UPDATE ALL FIRMWARE periodically 1/2 of Year.
 
Don't worry too much about the power_on_hours, if you have enterprise class disks its more than common to see them run without issue for 5 or more years.

sagittarius:~# hd -e c2t4
...
9 Power on hours count 0x32 1 1 77602
...

Yep :)

77602 = 3233 days, or almost 9 years...

(This is an old Sun Thumper (X4500) filled with 2TB Western Digital WD2003FYYS SATA drives. That particular model has been insanely good for us (have around 200 of them in use, and just a handful has died on us). Running Solaris 10 & OmniOS on those servers though.

Tried to use FreeBSD on them a couple of years ago but it crashed - I think it had problems with the silly amount of disk controllers and/or the number of PCI busses or something if I remember correctly.
 
Back
Top