FAILURE - READ_DMA48 issues with new SATA drives on FreeBSD 8.1-REL

Hello!

I have been working to get my new file server up and running this weekend. It has not been going very well. Any help you can offer is greatly appreciated!

The hardware is a Intel D945 Desktop Board with a SIGG 2 port PCI SATA controller. I have 2 IDE drives, 3 Hitachi 1TB drives (brand new), 2 Samsung 1TB drives (brand new) and 1 Seagate 1.5TB drive. I know the SIGG controller works because it was pulled from my prior server that is/was working (the Seagate 1.5TB drive and IDE drives are from that server as well).

I have tried the following items with no luck:

1. Having only 1 drive connected at a time with different cables
2. Removing the SIGG controller
3. Using only the SIGG controller
4. Different power supply
5. New SATA cables
6. AHCI (didn't detect anything)

The errors seen are:

Code:
Oct 10 22:36:33 titan kernel: ad0: 29325MB <Maxtor 6E030L0 NAR61590> at ata0-master UDMA100 
Oct 10 22:36:33 titan kernel: ad1: 190782MB <Seagate ST3200822A 3.01> at ata0-slave UDMA100 
Oct 10 22:36:33 titan kernel: ad4: 953869MB <SAMSUNG HD103SJ 1AJ10001> at ata2-master UDMA100 SATA 1.5Gb/s
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525151
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525164
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525105
Oct 10 22:36:33 titan kernel: ad6: 953869MB <SAMSUNG HD103SJ 1AJ10001> at ata3-master UDMA100 SATA 1.5Gb/s
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525151
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525164
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525105
Oct 10 22:36:33 titan kernel: ad8: 953869MB <Hitachi HDS721010CLA332 JP4OA25C> at ata4-master UDMA100 SATA
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525165
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525151
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525164
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525105
Oct 10 22:36:33 titan kernel: ad9: 953869MB <Hitachi HDS721010CLA332 JP4OA25C> at ata4-slave UDMA100 SATA
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525165
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525151
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525164
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad9: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525105
Oct 10 22:36:33 titan kernel: ad10: 953869MB <Hitachi HDS721010CLA332 JP4OA25C> at ata5-master UDMA100 SATA
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525165
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525151
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525164
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525167
Oct 10 22:36:33 titan kernel: ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=1953525105
Oct 10 22:36:33 titan kernel: ad11: 1430799MB <Seagate ST31500341AS CC1H> at ata5-slave UDMA100 SATA

Here is the full info from boot up http://pastebin.com/8WdEPrD5 (sorry, it was too long to post here).

I don't get any other error messages other the what is listed. If I attempt to create a zpool or do anything else with the 1TB drives the READ_DMA48 and READ_DMA errors show up.

Thanks for any assistance.

-Penta
 
da1 said:
Looks weird. Is the N/S bridge ok ?
Did you use this mobo on another system ?

This motherboard was used with a Windows XP and Windows Vista system. Neither had any issues.

I am half tempted to install Windows Server just to see if I encounter any issues there.

-Penta
 
I've been having the same problem. Ever sind I updated to FreeBSD 8.1 from 8.0, my Samsung hdd (and only that one) times out on read. With 8.1 head I now get a kernel panic. I have a 6 Sata Port AMD SB710 based board with 5 1TB hdds running in raid 3. The Samsung is the parity one, the 4 others are Seagates. I tried plugging in the Samsung in port 4 instead of 5, makes no difference. This is a very annoying problem, and now with kernel panic even crashes the computer for good. I suspect some change during the complete ata framework rewrite that happened between 8.0 and 8.1. I like how dummynet works again in 8.1, but if this goes on, I will have to go back to 8.0 or abandon freebsd as my server os. And that after 10 years. I will try something else first, tho:

Note that the hdd does not come back after a mere reset, only after a power cycle. Considering there was a firmware update for samsung hdds (which does not apply to this one), it might be the hdd, and I will swap it soon. It's still annoying
 
pentafive said:
This motherboard was used with a Windows XP and Windows Vista system. Neither had any issues.

I am half tempted to install Windows Server just to see if I encounter any issues there.

-Penta

But then again, Windows does silently pass over many things.

Whenever I had similar problems they were because of either hdd, cable, dirty pci slot, etc. What you are experiencing sounds to me like a mobo issue.

Here is what I would do:
1) run diagnostics (smartctl, diskinfo, atacontrol cap, manufacturer hdd control apps - to get some data in order to have a basic understanding of the situation)
2) move the hdd's + controller to another (working) pc and install FreeBSD there (to try to reproduce the error)
3) check the mobo for burns, leaks, etc
4) upgrade BIOS (mobo + controller - if appropriate)
5) install other *nix OS to see if it repeats

In my opinion, there is some hw problem. Of course, it can be 2 million other reasons for this to happen but atm, I don't know ...
 
Reset a loose cable or card? I'd first though check each drive with smartmontools, the Reallocated Sector Count r-value should be zero...
Code:
 smartctl -s on /dev/ad6 && smartctl -t short /dev/ad6 && sleep 60 && smartctl -a /dev/ad6

I know I needed to load additional ko's (geom_label geom_bsd geom_mbr ) between 7.0 and 8.0, don't know if that is applicable here
 
This sounds familiar, I had some problems which look like this when I had to resilver a disk in my ZFS storage. When I am home I can check the logfiles if it is still to be found.

The resilver stopped about every some minutes with errors and continued after the timeout. It was then completed from the fixit DVD without problems. After that, I removed AHCI from the kernel, after which the problem went away. Maybe you could try that? Using AHCI it can be seen on heavy load or scrubbing. I will write more details when I am home.

My system is 8.1-releng (sp?), 8GB, amd64 quad, HDs are Samsung.
 
Ok, I was home and, as promised, had a nice time spelunking in the logfiles from last year and smart informations from now. My home server is running 8.1-STABLE. The resilver seemed to cause problems with the NCQ of the AHCI leading to overflows and/or timeouts on slots. The disk on that channel does not show any problems. One other had logged errors which were caused by a displaced cable connector which was moved when I closed the casing. These reminded me of the READ_DMA48 error but where of a different kind. Since I see no real difference in performance when I compare plain ATA against AHCI I will keep it that way for the time being.

So, whoever uses soundproofing on the case, please spare the area where the connectors are close by. You will likely press the foam against the cable and shift it when you close the side panel.:stud
 
Original poster (pentafive), are you still here? Or I'll mark this one [solved].
 
I've picked up a 2TB Hitachi drive today and it too produces these "TIMEOUT" messages involving READ_DMA48.

The hard drive when used in the very same system under NetBSD 5.1 or Windows 7 does not produce any such (error) message.

I'm thus inclined to believe that this is a FreeBSD problem that needs to be back ported to RELENG_8 if/when it is fixed in HEAD.

I will further add that I was able to update the MBR on the disk from FreeBSD (using fdisk) and also read from the start using dd.

The error messages are always for an LBA number that is in the last cylinder.

size = 3,907,029,168

LBA errors at: 3907028727, 3907029151, 3907029164, 3907029165, 3907029167

If there was a hardware problem (cable, etc) then it would manifest itself at random blocks, including those at the start. That is not the behaviour that I'm observing.

See also:
http://www.freebsd.org/cgi/query-pr.cgi?pr=143805&cat=
http://www.freebsd.org/cgi/query-pr.cgi?pr=116270&cat=

I've opened a bug for this:

http://www.freebsd.org/cgi/query-pr.cgi?pr=151447
 
For me, this problem was resolved by upgrading VMWare Workstation from 6.5.3 to 7.1.2.

In addition, when I ran FreeBSD on the bare metal, the problem did not show up.
 
Sorry for the late reply. I have have been busy getting new hardware and testing various things.

After trying several different versions of *nix and tons of tools along with new motherboard and controllers. It turns out the issue is a drive problem, yes a drive issue with ALL 5 of the new HDDs from 2 different manufactures! It looks like all of the drives are locked with ATA Security. I don't have any clue how this happened. Last time I messed with ATA Security was with my old XBOX! :e

I patch FreeBSD with atacontrol patch to allow me to check ATA Security, here is the link and output from one of my HDDs.

http://www.roe.ch/ATA_Security

Code:
titan# atacontrol security ad5 
Security supported        yes
Security enabled          yes
[B][U]Drive locked              yes[/U][/B]
Security config frozen    no
Count expired             no
Security level            high
Enhanced erase supported  yes
Erase time                152 min
Enhanced erase time       152 min
Master password rev       fffe

I spoke with Samsung and Hitachi and they issued RMAs for the HDDs. Thats all I can do right now. There are several tools out there like ATAPWD, MHDD and HDDErase - but I am having issues with sata controller support.

Anyway - there you have it!

Thanks for all of the replies!

-Pentafive
 
Back
Top