Hi all,
I have almost no experience with freebsd but I'm working now in a new project that uses it in storage enclosures.
We have a Supermicro enclosure with FreeBSD 9.2 and 36 disks and 3 pools. First pool has 12 disks some of them (7) are really old (Power_on_hours parameter is near 39.000)
We are experiencing problems as disks are beggining to fail. Enclosure marks them as REMOVED. We replace the removed disk and resilver starts. When resilvering, usually another disk fails. Ok, no problem, our pool support two-disks failure without data loss. Let's wait until resilver finishes and we will replace the second disk. Unfortunately, it's very usual that, a third disk is marked as removed and then resilver begins to get slower and slower and it doesn't finish. In spite I have read it's not a good idea, restarting the enclosure, solves the issue. Disks are marked as ONLINE again and resilvers continues and finishes properly.
All data in the pool is double backed-up in external enclosures, so we are not worried about the data. Doesn't matter if pool is destroyed. But the thing is... why system is marking as REMOVED some disks if they come ONLINE again when rebooting the enclosure?
I have read in some forums that, in order to "predict" a disk failure it's used to check smartctl ID 5, 197, 198 and 199 values. If they are above 0, the disk may fails. The funny thing is there are old disks with values greater than 500 that enclosure does not remove and another one with values equal than 0 that system mark as REMOVED. And not always are the same disks.
We assume data is almost lost, but I would like to go a step further and be able to understand how freebsd decides one disk is faulty and if there is any way to predict which disks are nearest to fail. We have a lot of enclosures with similar hardware and I woulk like to anticipate to disk failures in all systems.
Please, could you help me in anyway?
Thank you very much in advance.
I have almost no experience with freebsd but I'm working now in a new project that uses it in storage enclosures.
We have a Supermicro enclosure with FreeBSD 9.2 and 36 disks and 3 pools. First pool has 12 disks some of them (7) are really old (Power_on_hours parameter is near 39.000)
We are experiencing problems as disks are beggining to fail. Enclosure marks them as REMOVED. We replace the removed disk and resilver starts. When resilvering, usually another disk fails. Ok, no problem, our pool support two-disks failure without data loss. Let's wait until resilver finishes and we will replace the second disk. Unfortunately, it's very usual that, a third disk is marked as removed and then resilver begins to get slower and slower and it doesn't finish. In spite I have read it's not a good idea, restarting the enclosure, solves the issue. Disks are marked as ONLINE again and resilvers continues and finishes properly.
All data in the pool is double backed-up in external enclosures, so we are not worried about the data. Doesn't matter if pool is destroyed. But the thing is... why system is marking as REMOVED some disks if they come ONLINE again when rebooting the enclosure?
I have read in some forums that, in order to "predict" a disk failure it's used to check smartctl ID 5, 197, 198 and 199 values. If they are above 0, the disk may fails. The funny thing is there are old disks with values greater than 500 that enclosure does not remove and another one with values equal than 0 that system mark as REMOVED. And not always are the same disks.
We assume data is almost lost, but I would like to go a step further and be able to understand how freebsd decides one disk is faulty and if there is any way to predict which disks are nearest to fail. We have a lot of enclosures with similar hardware and I woulk like to anticipate to disk failures in all systems.
Please, could you help me in anyway?
Thank you very much in advance.