Installing new hard disks into system

I'm fairly new to FreeBSD and I would like to know how you folks out there test the health of a new hard disk before installing it to a NAS server (in order to avoid beginning building say a 6 HDD RAID-Z2 with 2 out of 6 drives on the verge of death).

I'd check SMART output for the drive but I don't know what else.

In GNU/Linux normally it's recommended to run badblocks on the drive which indicates if some sectors are damaged.

Or you just run dd on the new drive for a week or so to stress it and see if anything weird (noise or SMART data) happens?


What happens if some (not hundreds of) bad sectors begin to appear? Can you fix them like in GNU/Linux using badblocks (mark those blocks as unusable) or fsck? Or are there some other tricks?
 
luckylinux said:
What happens if some (not hundreds of) bad sectors begin to appear?
Bad blocks are mapped into a different "spare" region of the disk. This happens within the drive's firmware and you will never notice it (except by looking at SMART data). As soon as bad blocks start to appear this "spare" bit of disk is full and the drive will need to be replaced.

Can you fix them like in GNU/Linux using badblocks (mark those blocks as unusable) or fsck? Or are there some other tricks?
You can't "fix" bad blocks. I have no idea what that tool does but bad blocks indicate the drive needs to be replaced a.s.a.p.
 
Thank you for your quick reply SirDice.

Does this mean that any (even just one) bad block is enough to replace the hard drive?
This seems quite strange. I mean there are many thousands of blocks in a hard drive any you expect all of them to be perfectly healthy?

On GNU/Linux the use of e2fsck with the "-c" option allows the use of badblocks and
-c
This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in order to find any bad blocks. If any bad blocks are found, they are added to the bad block inode to prevent them from being allocated to a file or directory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

I'm not asking to fix bad blocks (obviously you cannot recover any data from them, since they're just bad). I'm asking if there is a way to make FreeBSD or ZFS mark them as "unsuitable for data placement" or anything like this.

By the way: did you have many experiences with bad blocks in a hard drive? Is it that uncommon?
 
luckylinux said:
Does this mean that any (even just one) bad block is enough to replace the hard drive?
Yes.

This seems quite strange. I mean there are many thousands of blocks in a hard drive any you expect all of them to be perfectly healthy?
No, the minute you start noticing bad blocks it means that "spare" bit of disk is full and it cannot map out the bad blocks anymore.

I'm asking if there is a way to make FreeBSD or ZFS mark them as "unsuitable for data placement" or anything like this.
Don't go there, simply replace the disk when bad blocks start showing up.

By the way: did you have many experiences with bad blocks in a hard drive? Is it that uncommon?
It's more common than head crashes or other drive failures. Especially when the drive gets older. The magnetic properties of that bit of the platter just wears out.
 
SirDice said:
No, the minute you start noticing bad blocks it means that "spare" bit of disk is full and it cannot map out the bad blocks anymore.

When you say "spare bit" I think you mean a certain number of "spare sectors" such the way SSD have a certain amount of sectors used for wear-levelling.
Or you actually mean a region of the hard drive (unaccessible from the system) where there is a list of the bad blocks to NOT use (firmware managed).

SirDice said:
Don't go there, simply replace the disk when bad blocks start showing up.

Is the appearance of bad blocks "enough" to get a RMA replacement? Seems kind of expensive otherwise ... and scary :\

SirDice said:
It's more common than head crashes or other drive failures. Especially when the drive gets older. The magnetic properties of that bit of the platter just wears out.
Could you put it roughly in percentage? Like how much % of HDDs get replaced in the first 3 years of service?
 
luckylinux said:
When you say "spare bit" I think you mean a certain number of "spare sectors" such the way SSD have a certain amount of sectors used for wear-levelling.
Or you actually mean a region of the hard drive (unaccessible from the system) where there is a list of the bad blocks to NOT use (firmware managed).
A bit of both. Similar to SSDs or flash a drive has a spare region you cannot access. Bad blocks get mapped to this spare region. You will never notice this happening as it's all taken care of by the drive's firmware.

Is the appearance of bad blocks "enough" to get a RMA replacement? Seems kind of expensive otherwise ... and scary :\
Yes. Some SMART values are sometimes also accepted as pre-failure replacement. I.e. you can RMA the drive before it actually fails.

Could you put it roughly in percentage? Like how much % of HDDs get replaced in the first 3 years of service?
Didn't google publish something like that some time ago?
 
SirDice said:
Yeah, that's the one.

About 25% failure rate in the first 3 years. I'd say not too bad.
Depends if they also considered bad blocks as a failure factor or "just" the head motor crash.

EDIT: so if you can't run any command to fix these bad blocks, where do you see if they exist? In SMART data or is there an additional command that does the job?

I still find it strange that in GNU/Linux such a "fix" exists while in FreeBSD it doesn't. It is a file-system level fix though (EXT2/3/4).
 
smartctl from sysutils/smartmontools shows the SMART data. "Reallocated sector count" is the field:
# smartctl -a /dev/ada0 | grep -i reallocated_sector

Scanning a disk for bad blocks and adding them to a file was something done back in the old days before drives had bad block reallocation. Now that the drives handle it internally, it's not very useful to do the file-level version. Could be worse than nothing, in fact: if the drive is failing so badly that internal block reallocation can't keep up, mapping out blocks on a filesystem level could just hide the problem. Why do people still do it? Like a lot of things, because they always have.

These days, I write all zeros to a new disk with dd(1), run a SMART long test on it, then put it on probation.
 
Thank you for your answer wblock.

wblock@ said:
These days, I write all zeros to a new disk with dd(1), run a SMART long test on it, then put it on probation.

Would you mind explaining a bit more in detail please?
Like:
  1. How many times you run dd on the disk?
  2. How you run a long SMART test? I though SMART was only a readingyou usually schedule on a regular time basis and when something weird is found you setup an email notification through a script or something.
  3. What do you mean by put it on probation? You mean to actually put it in use in a real-life RAID-Z/Z2/Z3 array, just let it spin for a week or so to ensure nothing is wrong with the drive's motor, stress test using dd or other tools?

Thank you very much for your help.
 
luckylinux said:
How many times you run dd on the disk?

Just once:
# dd if=/dev/zero of=/dev/ada4 bs=1m

How you run a long SMART test? I though SMART was only a readingyou usually schedule on a regular time basis and when something weird is found you setup an email notification through a script or something.

SMART has a bunch of capabilities. Some are optional. Most drives support the short and long tests. See smartctl(8). For the above drive, it would be:
# smartctl -tlong /dev/ada4

Test status will be shown in the -a output. It takes about the same time as the dd(1).

What do you mean by put it on probation?

At that point, it's considered working, but not yet trusted. It would not hurt to run it with some real usage, say as another drive in an existing mirror, or copying large amounts of data to it.
 
Thank you for the other explanations. Just another question on the last point ;)

wblock@ said:
At that point, it's considered working, but not yet trusted. It would not hurt to run it with some real usage, say as another drive in an existing mirror, or copying large amounts of data to it.

In my use case I really don't think putting it in a mirror is suitable (I think on doing only RAID-Z2 stuff, except maybe the boot drive which may be a mirror).

By copying large amounts of data to it you mean "random data" (for instance /dev/urand or /dev/rand) or some data that may be checksummed in order to verify integrity?
How long before you actually trust the drive (if ever), before you can put it into "production" in a RAID-Z2 of say 6 drives?
 
Adding it to the mirror does a large initial sync with the existing drives (yes, it should be possible to have more than two drives in a mirror), then has some additional usage. That would not have to be permanent, the point is to exercise the drive. If any loose flakes of rust are going to come off, or poorly-soldered components fail, better sooner than later. Another way to do it would be to create the array and use it for a while. If you have multiple redundancy, it is unlikely that enough drives would die at the same time to lose data. But still, keep good backups.
 
luckylinux said:
Does this mean that any (even just one) bad block is enough to replace the hard drive?
This seems quite strange. I mean there are many thousands of blocks in a hard drive any you expect all of them to be perfectly healthy?
To expand on the answer by others... I have never had a RMA refused for even one reallocated sector from any disk drive manufacturer. And I know for a fact that (at least) Western Digital keeps track of each customer's RMAs and the percentage of "no problem found" drives they get back.

The drive manufacturers have much more sophisticated methods for detecting bad spots on the media (remember, they have access to the raw signal coming off the read/write head, while all you can see from the operating system is if the block is good or bad), and before they ship the drive they map them out, so that the drive presents "logically perfect" media to the user. That means that any errors that appear on a disk drive either happened after the drive left the factory (and how many more might there be that just haven't been detected yet) or the factory test missed a bad spot (likewise).

In "classic" Unix, disks were prepared using the bad144(8) [note - you'll need to select "4.3BSD Reno" or similar to actually see the manpage) utility which accepted both user-added bad sectors as well as the manufacturer's list, which was usually printed on a label attached to the drive (for non-removable media) and/or stored in a reserved location on the media.
 
Back
Top