ATA - DMA timeout ?

Hmm - bit strange but first the scenario. It's a server so I'm a bit unable booting from various analysis systems. ;)
It got 2 new HDD's installed and a fresh new installation of FreeBSD 8.1. Though after some time installing ports I the first HDD gets disconnected.
dmesg shows:

Code:
ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=463008256
ad4: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=463008256

The complete dmesg output - still a GENERIC kernel - nothing patched and yet - except sshd - nothing running.

A quick smartct test on ad4:

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE Serial ATA family
Device Model:     WDC WD2500YS-18SHB2
Serial Number:    WD-WCANY4023177
Firmware Version: 20.06C07
User Capacity:    250,000,000,000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Mar 27 22:29:40 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  33)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		 (7680) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  91) minutes.
Conveyance self-test routine
recommended polling time: 	 (   6) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   187   187   021    Pre-fail  Always       -       5625
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   176   176   140    Pre-fail  Always       -       187
  7 Seek_Error_Rate         0x000e   199   199   051    Old_age   Always       -       98
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12022
 10 Spin_Retry_Count        0x0012   100   253   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
194 Temperature_Celsius     0x0022   121   093   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   166   166   000    Old_age   Always       -       34
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   196   000    Old_age   Always       -       51
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So the problem is - anything software related or hardware ? :\
 
Due of 10000 character limit per post - addition:

Code:
Copyright (c) 1992-2011 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.2-RELEASE #0: Fri Feb 18 02:24:46 UTC 2011
    root@almeida.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC i386
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) 64 Processor 3500+ (2210.20-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x70ff1  Family = f  Model = 7f  Stepping = 1
  Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
  Features2=0x2001<SSE3,CX16>
  AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x11d<LAHF,SVM,ExtAPIC,CR8,Prefetch>
real memory  = 2147483648 (2048 MB)
avail memory = 2024030208 (1930 MB)
ACPI APIC Table: <Nvidia AWRDACPI>
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <Nvidia AWRDACPI> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, 7bde0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
cpu0: <ACPI CPU> on acpi0
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <memory, RAM> at device 0.0 (no driver attached)
pci0: <memory, RAM> at device 0.1 (no driver attached)
pci0: <memory, RAM> at device 0.2 (no driver attached)
pci0: <memory, RAM> at device 0.3 (no driver attached)
pci0: <memory, RAM> at device 0.4 (no driver attached)
pci0: <memory, RAM> at device 0.5 (no driver attached)
pci0: <memory, RAM> at device 0.6 (no driver attached)
pci0: <memory, RAM> at device 0.7 (no driver attached)
pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 3.0 on pci0
pci2: <ACPI PCI bus> on pcib2
mskc0: <Marvell Yukon 88E8056 Gigabit Ethernet> port 0x8c00-0x8cff mem 0xfddfc000-0xfddfffff irq 16 at device 0.0 on pci2
msk0: <Marvell Technology Group Ltd. Yukon EC Ultra Id 0xb4 Rev 0x02> on mskc0
msk0: Ethernet address: 00:16:e6:d3:d5:bf
miibus0: <MII bus> on msk0
e1000phy0: <Marvell 88E1149 Gigabit PHY> PHY 0 on miibus0
e1000phy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
mskc0: [ITHREAD]
pcib3: <ACPI PCI-PCI bridge> at device 4.0 on pci0
pci3: <ACPI PCI bus> on pcib3
vgapci0: <VGA-compatible display> mem 0xfb000000-0xfbffffff,0xe0000000-0xefffffff,0xfc000000-0xfcffffff irq 16 at device 5.0 on pci0
pci0: <memory, RAM> at device 9.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 10.0 on pci0
isa0: <ISA bus> on isab0
pci0: <serial bus, SMBus> at device 10.1 (no driver attached)
pci0: <memory, RAM> at device 10.2 (no driver attached)
ohci0: <OHCI (generic) USB controller> mem 0xfe02f000-0xfe02ffff irq 21 at device 11.0 on pci0
ohci0: [ITHREAD]
usbus0: <OHCI (generic) USB controller> on ohci0
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xfe02e000-0xfe02e0ff irq 22 at device 11.1 on pci0
ehci0: [ITHREAD]
usbus1: EHCI version 1.0
usbus1: <EHCI (generic) USB 2.0 controller> on ehci0
atapci0: <nVidia nForce MCP51 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf400-0xf40f at device 13.0 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci0
ata1: [ITHREAD]
atapci1: <nVidia nForce MCP51 SATA300 controller> port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xe000-0xe00f mem 0xfe02d000-0xfe02dfff
 irq 23 at device 14.0 on pci0
atapci1: [ITHREAD]
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]
atapci2: <nVidia nForce MCP51 SATA300 controller> port 0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xcc00-0xcc0f mem 0xfe02c000-0xfe02cfff
 irq 20 at device 15.0 on pci0
atapci2: [ITHREAD]
ata4: <ATA channel 0> on atapci2
ata4: [ITHREAD]
ata5: <ATA channel 1> on atapci2
ata5: [ITHREAD]
pcib4: <ACPI PCI-PCI bridge> at device 16.0 on pci0
pci4: <ACPI PCI bus> on pcib4
nfe0: <NVIDIA nForce 430 MCP13 Networking Adapter> port 0xc800-0xc807 mem 0xfe02b000-0xfe02bfff irq 21 at device 20.0 on pci0
miibus1: <MII bus> on nfe0
e1000phy1: <Marvell 88E1116 Gigabit PHY> PHY 1 on miibus1
e1000phy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
nfe0: Ethernet address: 00:16:e6:d3:d5:be
nfe0: [FILTER]
atrtc0: <AT realtime clock> port 0x70-0x73 irq 8 on acpi0
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: [FILTER]
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: [FILTER]
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
uart1: [FILTER]
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppc0: [ITHREAD]
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
plip0: [ITHREAD]
lpt0: <Printer> on ppbus0
lpt0: [ITHREAD]
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
pmtimer0 on isa0
orm0: <ISA Option ROMs> at iomem 0xd0000-0xd17ff,0xd2000-0xd2fff pnpid ORM0000 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
powernow0: <PowerNow! K8> on cpu0
Timecounter "TSC" frequency 2210197882 Hz quality 800
Timecounters tick every 1.000 msec
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 480Mbps High Speed USB v2.0
ugen0.1: <nVidia> at usbus0
uhub0: <nVidia OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <nVidia> at usbus1
uhub1: <nVidia EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
uhub0: 8 ports with 8 removable, self powered
ad4: 238418MB <WDC WD2500YS-18SHB2 20.06C07> at ata2-master UDMA100 SATA 3Gb/s
ad6: 238418MB <WDC WD2500YS-18SHB2 20.06C07> at ata3-master UDMA100 SATA 3Gb/s
GEOM_MIRROR: Device mirror/gm0 launched (1/2).
GEOM_MIRROR: Device gm0: rebuilding provider ad4.
GEOM: mirror/gm0s1: geometry does not match label (16h,63s != 255h,63s).
Root mount waiting for: usbus1
Root mount waiting for: usbus1
uhub1: 8 ports with 8 removable, self powered
Trying to mount root from ufs:/dev/mirror/gm0s1a
nfe0: link state changed to UP

Gmirror is running and removing ad4 via atacontrol adding it again, the rebuild process begins and stops at random.
 
That "reallocated sector count" (raw value) means the disk is failing. If you need it replaced cheap, a used one of the same size may or may not have a zero value (the last column, zero is default.). More than zero -- failing disk, although you can use it as backup target until it crashes its filesystem or worse... (I have posts which discuss rsync with the bwlimit parameter which I find useful in that case.)
(Not your primary backup by any means, just an incidental one).
 
Back
Top