SATA disks become "detached"

Prolixium · Jun 29, 2010

Hi -

Over the last couple years or so, I've been having issues every month or so (sometimes every week) where one of the two SATA disks in my FreeBSD box becomes detached. In the case of the disk that contains the swap partition, this results in a panic and the box reboots. In the case of the disk that contains /usr/home, I've been able to recover without a reboot by using atacontrol to reattach the ata device, and remount the filesystem.

This started around FreeBSD 6.x, and the box is now running FreeBSD 7.2-p3 with the same hardware. In fact, all hardware in the box (including disks) has been replaced since I initially thought this was hardware related, but apparently it is not. It's gone through 6.0, 6.1, 6.2, 7.0, 7.1, 7.2 upgrades, and all have experienced the same issues.

In the case where the disk holding /usr/home (ad12) detaches, here's the kernel messages:

Code:

Device ad12s1d went missing before all of the data could be written to it; expect data loss.
Jun 26 01:36:41 dax kernel: pid 57353 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
Jun 26 01:41:10 dax kernel: pid 44038 (rtorrent), uid 1000 inumber 12718081 on /usr/home: out of inodes
Jun 26 01:44:36 dax kernel: pid 8034 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
Jun 26 01:44:46 dax kernel: pid 38672 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
Jun 26 01:44:56 dax kernel: pid 21014 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
Jun 26 01:45:07 dax kernel: pid 57353 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
Jun 26 01:45:17 dax kernel: pid 8034 (httpd), uid 80 inumber 19642453 on /usr/home: out of inodes
[...]

I'm assuming the inodes error is just the kernel becoming confused since the filesystem is still mounted, but the disk has disappeared.

In the case of the disk holding the swap partition (ad8) detaching, it's a little different type of error. Sometimes the box hangs after tons of g_vfs_done errors:

Code:

subdisk8: detached
ad8: detached
g_vfs_done():ad8s3d[READ(offset=27569928192, length=2048)]error = 6
swap_pager: I/O error - pagein failed; blkno 9694,size 4096, error 6
g_vfs_done():ad8s3d[READ(offset=27569930240, length=2048)]error = 6
vm_fault: pager read error, pid 685 (devd)
g_vfs_done():ad8s1a[READ(offset=423264256, length=32768)]error = 6
g_vfs_done():ad8s3d[READ(offset=27569932288, length=2048)]error = 6
vnode_pager_getpages: I/O read error
vm_fault: pager read error, pid 685 (devd)
g_vfs_done():ad8s1a[READ(offset=98304, length=16384)]error = 6
g_vfs_done():ata4: FAILURE - already active DMA on this device
unknown: setting up DMA failed
ata4: FAILURE - already active DMA on this device
unknown: setting up DMA failed
ad8s1a[READ(offset=192741376, length=16384)]error = 6
g_vfs_done():ad8s3d[WRITE(offset=26008551424, length=16384)]error = 6
g_vfs_done():ad8s4d[WRITE(offset=118730604544, length=12288)]error = 6
g_vfs_done():ad8s3d[READ(offset=27569934336, length=2048)]error = 6
[...]

Or it will try to write a dump file, error out, then reboot:

Code:

g_vfs_done():ad8s1f[WRITE(offset=99211411456, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=99211673600, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=99211853824, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=100559978496, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=103449427968, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=103449493504, length=16384)]error = 6
/dev: got error 6 while accessing filesystem
panic: softdep_deallocate_dependencies: unrecovered I/O error
cpuid = 1
Uptime: 2d11h2m39s
Physical memory: 999 MB
Dumping 292 MB: 277 261 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5Attempt to write outside dump device boundaries.

** DUMP FAILED (ERROR 6) **
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

The hardware in the box is fairly standard. Core 2 Duo w/Intel ICH and 2x WD SATA disks.

Recently, I was told to try and change the disks to SATA150 from SATA300 via jumpers on the disks, citing SATA chipset firmware incompatibilities. I tried this, and it didn't resolve the issue.

Here's a dmesg.boot from the box, showing hardware, etc.:

http://www.prolixium.com/share/txt/freebsd/dmesg.boot.20100628.txt

Troubleshooting this is difficult because this is a dedicated server at a hosting provider a few states away. I do have serial console access (that is logged), though, and this is actually the only way I was able to see the reasons for the panics and hangs, as no logs could be written to the disk(s) when they're detached.

I submitted a bug for this quite awhile back, and it hasn't been touched:

http://www.freebsd.org/cgi/query-pr.cgi?pr=129426

I suspect this is due to the lack of information I'm able to provide, and no way of reproducing the problem on demand.

This box performs a variety of tasks: web/DNS/jabber server, IPv6 router, IPv4/IPv6 firewall, VPN termination, etc.

Any pointers where I should look, next? Ideas?

Thanks in advance!

- Mark

jb_fvwm2 · Jun 30, 2010

I have a pci-to-sata controller that
cannot reliably transfer data to the
target sata drive as fast as freebsd
can write to it. The resulting data
corruption hoses the bsdlabel etc
often.
.....
In your case, a dodgy sata chipset?
.....
What I've done is use the sata disk
instead for backups (the bwlimit
parameter in rsync can throttle the
9000 to 1000 ...) which makes the
sata disk, and controller, perfectly
reliable again.
....
Maybe you want some other chipset
controller on the box (SAS card,
scsi card, ide controller) to make
the sata problem a non-issue hopefully?

Prolixium · Jun 30, 2010

SATA controller is Intel. I doubt it's dodgy, and if so, I wouldn't know what else to use that would be better. (if this were a Silicon Image SATA controller or something else, I'd certainly be more suspect)

These detaches don't seem to be correlated to high disk I/O at all. Sometimes they happen when the box is sitting idle, and sometimes just in the middle of the night (nothing cron-related happening).

- Mark

jb_fvwm2 · Jun 30, 2010

Try tuning the box for torrents? Your first
post has an indication that might be the error.
(Or tuning the torrent /port/)...
Power supply, cable... stuff running in
cron that eats inodes suddenly...

Prolixium · Jul 1, 2010

I keep rTorrent running all the time, but rarely are there torrents actually being processed, which was the case with the last detach event. I stated that the inodes problem is an erroneous result of the disk being detached, the filesystem is not out of inodes. Also, if rTorrent/etc. is going to tank a box like this, I think there may be bigger problems in the FreeBSD world!

I can guarantee that the power supply and cables are not the problem. I mentioned in the original post that the box had /all/ hardware replaced. (actually, the motherboard was replaced /twice/, but that was for another non-hardware related issue)

- Mark

jb_fvwm2 · Jul 1, 2010

A few sysctl's to check in
the freebsd-questions list:
Volume 285 Issue 14
Volume 286 Issues 06; 09; 14
....
Also, if you are running pf ... maybe that
needs tuning
....
from a limited search of data here locally.
All I had time for.
...

roddierod · Jul 2, 2010

Prolixium said:

Hi -

Over the last couple years or so, I've been having issues every month or so (sometimes every week) where one of the two SATA disks in my FreeBSD box becomes detached. ...

Code:

subdisk8: detached
ad8: detached
g_vfs_done():ad8s3d[READ(offset=27569928192, length=2048)]error = 6
swap_pager: I/O error - pagein failed; blkno 9694,size 4096, error 6
g_vfs_done():ad8s3d[READ(offset=27569930240, length=2048)]error = 6
vm_fault: pager read error, pid 685 (devd)
g_vfs_done():ad8s1a[READ(offset=423264256, length=32768)]error = 6
g_vfs_done():ad8s3d[READ(offset=27569932288, length=2048)]error = 6
vnode_pager_getpages: I/O read error
vm_fault: pager read error, pid 685 (devd)
g_vfs_done():ad8s1a[READ(offset=98304, length=16384)]error = 6
g_vfs_done():ata4: FAILURE - already active DMA on this device
unknown: setting up DMA failed
ata4: FAILURE - already active DMA on this device
unknown: setting up DMA failed
ad8s1a[READ(offset=192741376, length=16384)]error = 6
g_vfs_done():ad8s3d[WRITE(offset=26008551424, length=16384)]error = 6
g_vfs_done():ad8s4d[WRITE(offset=118730604544, length=12288)]error = 6
g_vfs_done():ad8s3d[READ(offset=27569934336, length=2048)]error = 6
[...]

Or it will try to write a dump file, error out, then reboot:

Code:

g_vfs_done():ad8s1f[WRITE(offset=99211411456, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=99211673600, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=99211853824, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=100559978496, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=103449427968, length=16384)]error = 6
g_vfs_done():ad8s1f[WRITE(offset=103449493504, length=16384)]error = 6
/dev: got error 6 while accessing filesystem
panic: softdep_deallocate_dependencies: unrecovered I/O error
cpuid = 1
Uptime: 2d11h2m39s
Physical memory: 999 MB
Dumping 292 MB: 277 261 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5Attempt to write outside dump device boundaries.

** DUMP FAILED (ERROR 6) **
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

Is it possible that the disk has bad sectors, since it is a few years old? I had the same type of issue with a SCSI disk with these type of errors. I ran the SCSI utility of the control to verify the disk and it found a bad sector and remapped it. I haven't had the problem since. I'm not versed in SATA technology so not sure if you can do the same.

Prolixium · Jul 2, 2010

roddierod said:
Is it possible that the disk has bad sectors, since it is a few years old? I had the same type of issue with a SCSI disk with these type of errors. I ran the SCSI utility of the control to verify the disk and it found a bad sector and remapped it. I haven't had the problem since. I'm not versed in SATA technology so not sure if you can do the same.

If these counters are reliable, the SMART status indicates no errors from both disks. Here's an example from ad12:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   223   187   021    Pre-fail  Always       -       3808
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       26242
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
190 Airflow_Temperature_Cel 0x0022   058   055   045    Old_age   Always       -       42
194 Temperature_Celsius     0x0022   108   105   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

I'm not sure if there's a badblocks-type of application on FreeBSD, but I suppose I can cat /dev/ad{8,12} > /dev/null to read every sector on both disks, and see if anything pops up in the kernel buffer.

It's also been suggested I watch the "Power_Cycle_Count" and see if it increments when there's a panic or detach event, to rule out any mains power issues. I'm currently watching for that, too.

- Mark

fgordon · Jul 7, 2010

Hmmm it seems there are also write errors, so you problaby won`t find them when reading.

It`s bad but I had this before, harddisks are ok with S.M.A.R.T but when wrtiting e.g. with dd they failed - S.M.A.R.T though still sometimes reported "OK".....

Though one does expect S.M.A.R.T to work perfectly it does not on some drives - this is one reason I finally switched to ZFS....

Prolixium · Sep 26, 2010

Just in case anyone is still looking at this thread.. I've actually purchased a new dedicated server from my hosting provider. Completely different system (Xeon vs. Core 2), different chassis, different disks, and even in a different part of their DC.

I thought I was in the clear, but disks detached earlier this morning after a week or so of uptime. Same issue. I'm on 8.1-STABLE (built from sources on Tue Sep 14 01:03:07 EDT 2010) at this point.

I'm thinking of editing the ATA driver source to ignore errors that result in a disk detach. I suppose this is the worst idea ever, but I'm out of options, unless I want to switch OSes.

- Mark