nakal said:
When you have 1 TB drives, the chance of getting a drive with a initial bad sector is around 50% (for me here).
I still think you have drives damaged in shipping or a manufacturer with poor quality control.
My most recent experiences have been with WD2003FYYS drives, where I've received over 200 of them (both directly from Western Digital and from various distributors). None of those drives had any defects when installed (both a surface scan and SMART showed no problems).
Some years ago, a number of resellers were simply putting bare drives in antistatic bags into a box and then using some air pillow packing in the hope that the drive wouldn't move around. I've learned to not buy from those resellers (hence my buying drives in OEM 20-packs for the most part).
Even longer ago when I was using Seagate (this was old-logo Seagate) and had a dedicated on-site account rep, I reported some resellers to him for poor packaging. At that time, Seagate shipped the resellers knocked-down Seagate boxes and packing material for each drive, to be used when shipping drives to end users, and some resellers weren't using it for some reason. The resellers were eventually told to either use the Seagate packaging or else Seagate wouldn't sell drives to them.
You will definitely notice this if you use RAID on such a drive. On cheap RAID implementations it will look like a failure. Better controllers will try to get the replacement mechanisms to work on this.
I guess it depends on how you define RAID. I agree that there's a great deal of quality variation in RAID implementations, ranging from the excellent (3Ware), to middle-of-the-range (discount controllers) to poor (most BIOS PseudoRAID). When you're using a raidz* in ZFS, ZFS is only going to look at the areas of the disk that you have files on. I use 3Ware 9xxx-series controllers and export each drive as an individual volume to FreeBSD (as opposed to just exporting the raw drives). This gives me full control of the drives - I can have the controller do scheduled background verification without impacting FreeBSD, request specific scan operations, etc. The controller also operates the 3 drive bay LEDs on each drive to show activity / status / errors. And it supports SMART passthru so I can use
sysutils/smartmontools to monitor each individual drive.
I have several drives which had 1-2 faulty sectors on day 1. They all work since years. I only watch, if the drives accumulate errors in a constant rate. This is the time where you should panic and look for a replacement.
It may be that your manufacture doesn't do a particularly thorough surface analysis when they build the drives, and it is a simple "bad spot" on the media and not a sign of physical damage. However, as I said in my earlier reply, without OEM-type analysis you can't tell what caused the bad sectors. You're had good experience where no additional errors appear.
In my case, when a drive develops an error (which SMART will usually report as an Offline Uncorrectable), if I tell the controller to put the drive back into service, within a few days the number of bad sectors will ramp up rapidly, from the initial 1 to something like 49.
I have even 2 drives where is S.M.A.R.T. is broken and shows me 2047 (highest value) faulty sectors. The drives are perfectly fine! No read errors at all.
Out of curiousity, are these drives the same brand / model / firmware as other drives that don't have the problem, or are they different? There have been a number of drives from various manufacturers where there were problems with the SMART implementation. Those are usually "mostly harmless" and just report preposterous values. There was at least one type of drive where an inopportune SMART request would cause actual data corruption on the drive.
I simply don't buy drives built by companies that don't fix this sort of thing. And when I'm planning a major buy, I'll "taste" the drive model by buying first a single drive and testing heavily, and then a single 20-pack to build a complete array and do more testing.
Of course the manufacturers will replace your drive. It's more expensive to check it thoroughly than to send you a replacement.
Trust me, the manufacturers don't simply junk the drive you send back - particularly in this economic climate, and with the recent drive shortages.
I deal almost exclusively with WD these days, but procedures at other manufacturers will be similar.
Drives that are returned are processed on test equipment to determine if there is a fault, and if a fault is found, to determine if the problem is inside the sealed environment or is a logic board problem. At that point, the path diverges depending on whether or not repair needs to be done in a clean room.
Let me give you a sample of an incoming test (with identifying numbers removed):
Code:
Category Mode Submode
----------------- ----------------- ----------------- -----------------
Complete Unable to Process
Through FSPT/FTA
Drive HSA Poor On-Track Over Write Degraded
Write
PRELIMINARY LEVEL
DRIVE LEVEL
Failure Code Observations: Customer reported failure "WD2003FYYS Failing 20%+ with
RAID drop offs." You can also see detailed customer information on attachment for
ITR # xxxx
Relolist has 8 entry, 1 relocated, glist has 1 entry. Validate reserve cyl OK. SMART
OFFLINE failed on head 5. DST test passed. POH = 3980 hours 42 minutes, 1367 active
logs, 7 ECC errors reported but may not see by Host, no errors reported to Host. Head
5 has excessive ECC errors on all logs.
Found bad head 5, OW degraded by 8DB, drop below limit.
COMPONENT LEVEL
n/a
CONCLUSION
OW degraded on head 5, drop 8DB fall below 25DB limit.
If an end user returns a bunch of drives that show no problem found on the incoming test, WD may contact the user to ask them why they feel the drives failed and needed to be returned. Depending on the outcome of that discussion, either the incoming test will be expanded to detect a missed fault, or (more likely) the user will be told that none of the drives were actually bad, and to investigate other components (controller, cables, power supply, environment, etc.). In addition to showing the customer that they're being listened to, this will save the customer time and money in not RMA-ing good drives. And, of course, it also saves WD money as they don't have to process good drives as RMA's.