New hard-drive "burn-in" in FreeBSD?

Greetings all,

since I had brainwashed myself into believing in exercising any new hard-drive for several hours/days before using it, in the past I used Linux "badblocks." Since there does not appear to be a FreeBSD alternative, I wonder what I could use instead.

What I am looking for is something akin to -p and -w option of badblocks so that I can scan for bad blocks by writing some patterns on every block of the hard-drive, reading every block and comparing the contents several times.

Kindest regards,

M
 
I looked around for something a while ago and ran up short, so I'm hoping there's a good answer to your question.

My suggestions would be:

1.) Use dd to check the drive. See "man dd" for the relevant options.
2.) Create a zfs pool with the drive and create a number of large files on the drive (say, using dd and /dev/zero). Once there's some size-able information on the pool, do a zpool scrub and check for errors. (Writing data to the drive might not be necessary for the zpool scrub command.)
3.) Search the ports tree and package databases for better options.

I'm at my wits end, too, but I'm trying to help.
 
You don't need to write "patterns" to a hard disk. It works differently from memory, where a bit is wired to an exact position. A hard disk modulates the data ontop a pattern, so writing 0s won't destroy the hard drive (demagnetized hard drive is not readable).

It should be enough to write 0s and read them. You will never get write errors, only read errors. You will also not see any bit "flipped" on the disk, because there is lots of ECC written to the disk and it will recognize a faulty sector pretty precisly.

The badblocks program is useless in my opinion. You can simply use dd(1) and dump everything to /dev/null. If you want to be sure overwrite with 0s multiple times.

And marking sectors as "bad" is stupid. Because even you have a few faulty sectors, the hard drive and the sector is not broken. Overwrite them a few times and read again, they will "magically" disappear. If the hard drive is new, it's almost normal that you get a few broken sectors on it (I had 4 of them and when I bought my hdd last year, now it works without a problem, and will work for a loooong time, I'm sure).

Only if the hard drive accumulates more and more defects, you should be careful. Use smartd to detect such anomalies and hope that it was implemented correctly and does not lie to you.
 
I do the same when I buy a new hard disk, I check it before using it. And if a single block is bad, I put it back! As I see this as the beginning of the end :-).

You can still use badblocks by installing emulators/linux_base-f10 from ports or you can boot from a Linux livecd.
 
ian-nai said:
2.) Create a zfs pool with the drive and create a number of large files on the drive (say, using dd and /dev/zero). Once there's some size-able information on the pool, do a zpool scrub and check for errors. (Writing data to the drive might not be necessary for the zpool scrub command.)

Note: scrub only checks the blocks that have been written to, and does not touch unused space. So filling 10% of the drive and then running a scrub ... will only check 10% of the drive. :)

Same for resilvering (replacing) a drive in a redundant vdev.

Just something to watch out for, in case you were expecting scrub/resilver to touch every sector of the disk.


As to the original question, why not:
  1. use sysutils/smartmontools to get the error numbers (this is your baseline / starting point, and should be all 0)
  2. use dd(1) to write 0s to the entire drive (if=/dev/zero of=/dev/whatever)
  3. check SMART values again for any errors
  4. use dd(1) to read back the drive contents (if=/dev/whatever of=/dev/null)
  5. check SMART values again for any errors
  6. repeat dd / SMART process using /dev/random to write (and then read) random bits

Do that a few times, and you'll know whether or not the drive is fine.
 
Thank you all for the suggestions.

ian-nai,

I looked at dd but if I understand the man page, it will write, e.g., zeros:

Code:
if=/dev/zero of=/dev/myDrive

not re-read the written pattern.

Or, could I read from the drive and redirect to a file?

Code:
if=/dev/myDrive of=/dev/null>myFile

and then check that myFile has all zeros in it?

If this works, I can repeat the process several times.

I am also not sure what your option 2. accomplishes. For this process, I would have to write to the entire drive, correct?

I have looked at the ports tree, but I believe the best option is use emulation of LiveCD as formateur_fou suggested.

nakal,

I am not sure whether I understand your answer. Could you please amplify based on my questions below?

What does the first paragraph have to do with my question?

Why is it useless to write and read form a hard-drive to exercise it? If I understand the statistics of electronic devices failure, they mostly fail shortly after being placed into service. If I can run it for several hours/days with random pattern, and it will not fail, I personally, will feel safer.

I do not feel that there is a need to call me stupid, especially considering that I have explained what I am trying to accomplish, i.e., that am not trying to mark bad blocks. In my understanding they will be re-mapped by the drive's firmware.

Again the last paragraph does not appear to address my question. As I understand it smartmontools monitors the drive once in service. This is a new drive. Did I miss something?

formateur_fou,

do you know which LiveCD has badblocks on it? Or do all have?

Kindest regards,

M
 
phoenix,

thank you for the reply, this is a great idea.

One question regarding:

6. repeat dd / SMART process using /dev/random to write (and then read) random bits

I understand how to write a random pattern to the drive:

Code:
if=/dev/random of=/dev/myDrive

Can I read in this manner:

Code:
if=/dev/myDrive of=/dev/null>readFile

But, even if I can, how do I compare the random pattern that was written to the drive with what I read back and stored in readFile? I need some mechanism to write to myDrive and to writeFile, and then to compare readFile and writeFile.

Kindest regards,

M
 
You specify the file to write to in the of= argument: $ dd if=/dev/whatever of=/path/to/some/file.txt

if= tells dd what to read from.
of= tells dd what to write to.

For working with harddrives, you'll also want to specify a block size (bs= argument) to use. Depending on the CPU, the harddrive, the driver, etc, the "optimal" block size will vary. I tend to use 1M everywhere just to keep it simple. Some people find 16M faster. But don't leave it at the default which is 0.5K, as that will be *VERY* slow. :)
 
formateur_fou,
do you know which LiveCD has badblocks on it? Or do all have?
I think this program is included in nearly all of them.
SystemRescueCd is aimed at administrating or repairing a system and has many tools.
But maybe as suggested before, you can use FreeBSD tools instead.
 
phoenix,

sorry for being unclear. I understand the basics of the dd command. What I was inquiring about is what is the syntax for writing to two files simultaneously.

formateur_fou,

thank you for finding sysutils/e2fsprogs.

Kindest regards,

M
 
Back in the day, the concept of scanning a drive and marking bad sectors as used in the filesystem was a fine concept. It's outdated now.

If a modern drive has a write error, it will automatically map out the bad sector to one of the spare sectors on the drive. The remap count can be seen with smartmontools. A typical program will probably never see a write error until the drive runs out of spares and is near failure anyway. Some of us believe that bad sectors appearing after a drive has left the factory means that drive should be replaced.

Which brings a thought: has badblocks ever found an error on your drives?

If there's a read error, you'll see that in the FreeBSD /var/log/messages.

If you still want to do a full write test with dd, give it a larger-than-default buffer or it will take forever:
# dd if=/dev/zero of=/dev/ad4 bs=64k

Press ctrl-t while that's running and you can see how far it's gone.
 
phoenix,

I may not understand point 6 of your first post.

Let us say that I will write to the drive:

Code:
if=/dev/random of=/dev/myDrive

This will write random numbers to the drive. Now I read what is on the drive:

Code:
if=/dev/myDrive of=/path/to/writeFile

Since I have not captured the random numbers written to the drive, how do I compare what I have read back with what was written to the drive?

wblock,

perhaps my English is failing me, but I have several times explained that I am not trying to find bad blocks. I was simply using the badblocks commands because by its operation:

1. some patterns are written on every block of the hard-drive;
2. every block is read back;
3. the contents of read and write are compared; and
4. any difference is reported.

If I repeat this sequence for hours/days and do not see any errors and the drive still runs, I believe that the drive is not likely to fail, because if the statistics is to be believed, electronics mostly fail shortly after being placed into service.

Is this an outdated concept? Am I misinterpreting something?

Kindest regards,

M
 
Let me keep this clear. It's not useless to test the hard drive. It's useless to use badblocks or any tool that thinks hard drive is like designed like system memory.

Why it's not useless to run dd(1). You will sometimes discover faulty sectors on a new hard drive and it might disturb RAID functionality. Yes, it takes long to zero out the whole terabyte drives, but at least you get rid of all errors that usually appear when you don't need them.

And as I said, if a drive has a few errors when it's new, it's completely normal. Imagine the error rate of having 4 sectors faulty on a 1 TB drive is 1 to 1 billion. Do you know cheap parts that have such a high quality? A normal hard drive has hidden sectors (2048 are on a 500 GB drives on Seagate) that will be used transparently if a sector has to be replaced. As long as the drive is not accumulating faulty sectors it should be generally ok.

I have a drive here with faulty sectors. Overwriting the sectors multiple times with zeros and reading them again caused them to be relocated. I have a very stable PC here with RAID-1 and after this initial relocation I have never had any problems anymore.
 
I thought that modern drives verified data after it was written, but can't cite any evidence of that; surely there are documents on the web describing low-level activity on a write. If it's true, then filling the drive with any data and not getting errors would show the drive is trustworthy.
 
From my experience, hard drives do not verify data after writing it. If the sector is faulty, you would not get any feedback. That's why you need to read the sectors again.

RAID/mirror also fails only when reading errors appear. When it writes to a drive and something gets lost, you would not notice it. Also on one part of a mirror you can have a faulty sector and it will be discovered very late, because the balancing algorithm could pick the healthy drive by probability of 50%. The routines don't even compare the mirrored data, so it's possible that you don't notice a problem with a drive.
 
It is not true, that drives do not verify after write. Some do. :) You need to enable it though, as it will make your drive slower (and manufacturers typically want to claim their drives are bigger and faster).

# camcontrol identify ada0
Code:
[...]
device model          ST3500418AS
[...]
write-read-verify              yes      no      0/0x0
# camcontrol identify ada1
Code:
[...]
device model          Hitachi HDT725050VLA380
[...]
write-read-verify              no       no
Do as phoenix sugested, overwrite/read the entire drive few times with zeros and check S.M.A.R.T. status. It's normal for new drives to grow defects, for example (one day powered up):

Code:
smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-PRERELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST2000DL003-9VT166
Serial Number:    5YD1D7YC
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb 23 12:44:58 2011 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   100   006    Pre-fail  Always       -       121018256
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   065   060   030    Pre-fail  Always       -       3281811
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       25
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
183 Runtime_Bad_Block       0x0032   096   096   000    Old_age   Always       -       4
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   058   045    Old_age   Always       -       40 (Min/Max 24/42)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   040   042   000    Old_age   Always       -       40 (0 24 0 0)
195 Hardware_ECC_Recovered  0x001a   033   014   000    Old_age   Always       -       121018256
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       199166223450137
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1012062049
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       10833483
[...]

It had zero Runtime_Bad_Block when first connected.
 
None of my drives supports write-read-verify. That's why I said "from my experience". I only answered the previous post by saying that you usually don't see any write errors (I've seen write errors related to a faulty controller though), only read errors.

SMART is not very reliable. Some (most) vendors lie in the logs. But I try to believe that the drives tell at least a bit of truth. I have told you about this hard drive here with 4 or 8 faulty sectors. I've seen them in "pending" attribute, then in "relocated" attribute and now everything shows me "0" (Samsung 1 TB drives).

Then I have 2 drives (Seagate 500 GB) which suddenly both had 2047 relocated sectors (they have been rejected by RAID controller) and don't have any surface errors since half a year.
 
danbi said:
It's normal for new drives to grow defects, for example (one day powered up):

Code:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
183 Runtime_Bad_Block       0x0032   096   096   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

It had zero Runtime_Bad_Block when first connected.

Reallocated_Sector_Ct is the number of sectors that have been reallocated, and Current_Pending_Sector is a count of blocks that had read errors and will be reallocated the next time they are written. Don't know what Runtime_Bad_Block is, but the good news is that your drive didn't grow any errors.
 
Thank you all for the discussion contribution, I have learned quite a bit.

nakal,

thank you for the clarification. What you are describing:

Overwriting the sectors multiple times with zeros and reading them again caused them to be relocate

is my understanding, the badblocks just uses random numbers instead of zeros. I have also experienced the SMART unreliability. I have a drive that does not pass build-int test on HP laptop, but SMART reports the drive as healthy.

Kindest regards,

M
 
wblock said:
Back in the day, the concept of scanning a drive and marking bad sectors as used in the filesystem was a fine concept. It's outdated now.
[...]
Which brings a thought: has badblocks ever found an error on your drives?

Yes it did many times. Even once, on a netbook I just bought.
 
wblock said:
Interesting! Did the SMART statistics agree?

I guess it would. I used to check customers PC's with a live cd when I suspected a faulty hard drive as chkdsk on Windows doesn't give you much information. These computers didn't have any disk monitoring tool.

Anyway, it is amazing to see so many hard drives failing after a while.
 
nakal said:
And marking sectors as "bad" is stupid. Because even you have a few faulty sectors, the hard drive and the sector is not broken. Overwrite them a few times and read again, they will "magically" disappear. If the hard drive is new, it's almost normal that you get a few broken sectors on it (I had 4 of them and when I bought my hdd last year, now it works without a problem, and will work for a loooong time, I'm sure).
I respectfully disagree with this. See google's disk failure paper.

After the first scan error, drives are 39 times
more likely to fail within 60 days than drives without
scan errors.
On that basis and the Reallocation Count, Offline Reallocations and Probational Counts, I think that the only reason not to RMA the drive is if you have so much redundancy in your system that you can afford to do this. Otherwise there is a high likelihood that you have a lemon on your hands. A stitch in time saves nine.

In aggregate, it is true that any of these categories of SMART data is enough to greatly increase the risk of continued usage of the drive. For individual drives, you may of course be lucky as your experience would indicate. I think if reliability is of high importance, you need to use copies>1 especially if you are using risky drives in your pool. If you get a few bad sectors on the same block for more drives than you have redundancy, ZFS won't fix the corruption.

Note also that while bad SMART data will predict a failing drive well enough to prompt me to RMA it, drives can fail without any bad SMART data. This does not mean "ignore SMART data as it is useless". It means ignore SMART data at your peril, and have extra redundancy/backup to catch drives that just fail out of the blue.
 
Back
Top