ZFS Optimization

Hello, we have a ZFS server with about 24 drives. We are burning through drives like crazy. Drive failure almost every month - and it's not that active a server. I am trying to figure out if it's just bad batch of drives or a mis-configuration of ZFS by the previous sysadmin.

Has anyone encountered a similar situation? Can anyone post some ideas of what to look for as far as optimization and performance tuning of ZFS of FreeBSD is concerned.
 
Are the disks actually dying, or is ZFS simply reporting an offline or faulted drive that often? If the drives are physically wearing out and and are all of the same brand and came from the same place, I'd suspect a bad batch. The idea that ZFS (or any filesystem) could in some way physically destroy drives is very far-fetched. It would require insanely I/O-intensive operations that were physically taxing on the drives, and you'd notice that long before the drives actually died.

If you're saying the pool is complaining about disks dropping from active pools, but the disks appear to be alright, that's something else entirely.
 
It seems that ZFS seems to be frequently reporting the drives as unavailable, removed and faulted drives. Checking the drives, they seem to be working. But I don't want to put them back into the pools.
 
Use sysutils/smartmontools to check if the disks really are bad or not. Also check the temperature inside the machine itself. Especially the disks. Cheaper commercial disks really don't like heat.
 
What HBA is used? Already checked the cabling and backplane?
Also what drives are used? Consumer drives running in high-vibration environments are known to end up having latency spikes up to several seconds, causing them to fall off of RAIDs or ZFS pools.
 
Muchos gracias. I will try that. They are Seagate enterprise drives. Supposed to be good quality products but not so sure anymore.
 
Even good quality drives can get "heat-stroke". Especially with cabinets that house a lot of disks heat can build up inside if it's not cooled properly. This will severely diminish a drive's expected life-span. The disks don't need to be "chilled" but the excess heat does need to be removed so it doesn't build up inside.
 
Check /var/log/messages for any error messages from the drives that are being reported as bad.

Check dmesg output as well.

Check smartctl -a output when run against the "bad" drive. Pay attention to the Reallocated_Sector_Ct, Temperature_Celsius, and Current_Pending_Sector counts. If the two sector counts are high and/or increasing over time, the drive is dying. If the temp is too high, it will lead to premature drive death.

See if a simple "zpool online" of the "bad" drive brings it back online and kicks off a resilver. If it does, then it may not be a bad drive, bad a bad cable, or a bad connection to the HBA, or an overloaded HBA dropping drives.

ZFS will drop a drive out of a pool for a variety of reasons. It's not always due to a dead hard drive.
 
Please be specific about the error messages and the HBA. These do not sound like drive failures, more like a problem with the HBA. Possibly it's a hardware RAID card between ZFS and the drives. It might even be something as simple as SMART tests scheduled at the same time as scrubs, making the drives take too long to respond.
 
Back
Top