Crash! Is this a hardware problem?

My server crashed without a reboot. On restarting I found this in the logs:
Code:
Sep 29 01:24:11 serv kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1297757535
Sep 29 01:24:11 serv kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=1297757535
Sep 29 01:24:11 serv kernel: g_vfs_done():ad0s1f[WRITE(offset=649438314496, length=10240)]error = 5
I'd like to confirm that this indeed is a hardware problem.
 
SirDice said:
It could be a drive error. You can install sysutils/smartmontools to look at the drive's SMART parameters.

Yes, sorry. By hardware, I was including drive error. Thanks for the smartmontools tip:

I ran on the drive and got this:

Code:
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD7500AACS-00D6B1
Serial Number:    WD-WCAU48287017
Firmware Version: 01.01A01
User Capacity:    750,156,374,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Oct  1 07:02:29 2009 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   163   163   021    Pre-fail  Always       -       6808
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1190
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0022   116   112   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

I'm not sure how to interpret it to be honest. There's a passed at the top.
 
I'm actually hoping it IS a hard-drive problem because I've just transferred all the data to a new hard-drive. The whole server runs on a single drive so is the cable issue still a possibility. It's on a rented dedicated server.
 
Smart didn't record any errors so it's safe to assume the drive itself is ok. Those DMA errors normally indicate a faulty drive (bad sectors i.e.). The cable could be another issue.
 
Note that S.M.A.R.T. is not *always* very reliable and can sometimes be totally crazy.

This being said, however, there are some not-so-good results, especially the big discrepancies and decreasing S.M.A.R.T. values in seek error rate, spin retry count and calibration retry count. This doesn't seem very healthy.
These errors will cause higher latencies in read/write operations (since seeking is frequently failing and a lot of recalib is needed) and maybe audible sound clicks when under stress. These are usually symptoms of a failing head mechanism.
 
The server actually crashed twice with no immediate restart, and was unreachable. After telling the server company of the above error in the logs, they gave me a new clean install of FreeBSD 7.2. and attached the 'maybe-broken' drive unmounted. I mounted it and tranferred the data over. The above test was performed after this copy - when the drive would have been under essentially zero load.

This was the sequence of events
1.
(a) I'm running FreeBSD 7.0 within VMWare on my MacPro out of my house - never crashed and sometimes has a pretty high load with a number of websites (3GB memory)
and
(b) Running a FreeBSD 7.2 on a dedicated host with a couple of websites - all is fine (8GB memory). Little write activity. Mails mainly.

2.
(a) Stop running 7.0. and transfer everything to (b), so ...
(b) B now has more sites running but the load is generally fine but it crashes twice with the above error.

So I've reverted back to situation (1) and (b) has a new disk.

What I'm trying to find out is whether the crash was because the disk had a problem, or there was something wrong with the setup. Now occasionally my sites get hit by a large number of visitors and the load may have spiked, but my understanding is that FreeBSD shouldn't actually have an unrecoverable crash. But I'm not a high level pro at these kind of things. My worry is that if I put my sites back on the (b) server the same will happen again. I want to either prevent it or have the drive at least reboot when things like that might happen.

I don't know if the load was high at the time because the server crashed. It might have been.
Ditto for the IO load.
I didn't use the diagnose tool. The company said they found no issues with the drive but when I showed them the error they agreed to change it without question.
 
Yes it sounds like a disk error.

I only asked for more details because i got randomly the same TIMEOUTs without the final FAILUREs under heavy IO load using 6 disks (ZFS) connected with a Intel onboard sata controller. nvm
 
Back
Top