SMART Threshold exceeded

Hello all,

I have a backup server that recently wasn't performing well. I went to check it out and the last two lines on the screen read :
Code:
twa0:WARNING: (0x04: 0x000F): SMART threshold exceeded: port=0
twa0:WARNING: (0x04: 0x000F): SMART threshold exceeded: port=2
I ran fsck -y and that made it work ok for a few hours. What would you guys suggest I do next? I can get into the 3ware BIOS Manager.

It says I have 1 Exportable unit (Unit 0), with 4 ports (0,1,2,3), port 1 says not in use. It says it is 4 Drive 256K RAID 5 2.72TB - DEGRADED.

Below that, it shows Unit 1, a 4 drive 64K RAID 5 2.72 TB - MISSING DRIVES

Now, I'm not sure what this Unit 1 is. I didn't build this box, it was here when I started, and there was no documentation when I started.
 
I'm guessing the Unit 1 is the hard drive the OS is on. Is there a way to check?

Also, this FreeBSD 8.1-RELEASE
 
Backup of the backup server? or backup of the files on the production server? I made full backup of the files on Friday (oct 12th). I'm almost thinking of wiping the box (after making another backup of the production server files) and starting over with a new installation.
 
Backup of anything on the system with the reported failing drives. They might not actually be failing, but worry about that later.
 
Naw, no backup of the system with the reported failing drives, since it was the backup. I think Im gonna get a 3 TB external HDD, backup the production files to that in the meantime. Order a few 1 TB hard drives and rebuild the failing system. I didnt like the setup of it anyways.
 
Unit 0 is the RAID array in use. It's comprised of 4 harddrives, and one of the drives is missing. This drive was dropped from the array by the RAID controller due to "SMART Threshold Exceeded". Basically, the drive died, the controller noticed and removed it from the array. Hence, the DEGRADED state.

Then the drive "came back to life" enough for the controller to notice it, and to read the RAID metadata off the drive, but it didn't read it right. Now, it thinks this 1 drive is part of a second RAID array (Unit 1), and that 3 of the drives are missing.

You need to Remove Unit 1 from the RAID controller. Then physically remove that drive from the system and replace it with a new drive. Then go into the RAID controller management tool, and Create a new Unit of type SPARE. About 30 seconds after that is done, the RAID controller will use the spare drive to rebuild the RAID array.
 
Wow Phoenix, thanks a lot. That makes sense. I tried rebuilding the array, since it said drive not in use on one of them. That failed. I will try your suggestion as soon as I get a TB harddrive. I will update thread as soon as this gets done.

Thanks everyone so far.
 
Just went and checked on the box. The line after it stating rebuilding the array (without replacing the bad HD), now says the same thing, but ports 1 and 2...
Code:
twa0:WARNING: (0x04: 0x000F): SMART threshold exceeded: port=1
twa0:WARNING: (0x04: 0x000F): SMART threshold exceeded: port=2

Is there a way I can test all my drives by plugging them into another computer? They are easily taken out of the box, and now I dont know which drives are good or bad now that it is claiming another is exceeding SMART threshold.
 
Phoenix, I appreciate your suggestion, but I should have mentioned there really is 5 physical hard drives in the box. I found out port 0 is on my right, port 4 being on the left. I think I am just going to wipe the whole thing, do backups on an external harddrive until the new backup box is built.

But for now, I gotta replace bad HDDs. How should I go about testing the HDDs? My company is supposed to be on a spending freeze / doing cost cutting measures, but they know this repair is needed. I only want to buy enough hard drives +1 to cut down on costs.
 
Examples shown here are for ada1. Some are destructive, be sure you have the correct device.
# diskinfo -v ada1
That will usually show the drive serial number.

Install sysutils/smartmontools on another system, connect the drives to it. See the logs with
# smartctl -a /dev/ada1 | less -RS

In particular, look at reallocated sector count and self-test logs.

Worth running a long test on each drive also:
# smartctl -t long /dev/ada1

The manual version of that test is to just write zeros to the entire drive.
# dd if=/dev/zero of=/dev/ada1 bs=64k

Note: if another FreeBSD system is not available, the SMART data is shown on one of the tabs in Defraggler on Windows. Lots of others, too, no doubt.
 
Also, I found some TB hard drives that windows cannot recognize when I plug them in. Maybe they are part of a RAID array? I plugged one into the failing box (replace the "not in use" HD with it). Funnily enough, it said it was part of the Unit 1 Raid array. I have no idea how these HDs got here, or what they were used for. Im the only IT guy here, think I should I just wipe them? If so, what do you suggest to do so? Im downloading Gparted now, gonna see what that tells me.
 
The best hard disk testing program I know of is MHDD. It is not pretty, but it is effective. I boot it off a CDROM. While it can run the SMART tests, it can also ERASE and SCAN entire disks, etc. This will detect many more problems than the SMART tests. AFAIK you need the BIOS to be in IDE mode for it to work. I recommend using a spare machine and putting one drive at a time on it so you can be sure which drive you are testing. Good luck.
 
Thanks for the site Uniballer! Its got some great tools on there!

Im gonna scan all my drives with that, see what it says and start over again on this box. Thank you all for all your responses!
 
Back
Top