ICH 7 and Samsung HD103SI problems

von_Gaden · Aug 20, 2011

Hi all!

I am experiencing SATA problems with MB Gigabyte G41MT-S2 and two HDD Samsung HD103SI 1AG01118.
Perhaps on heavy load (I am not sure, it's still a test system) SATA and/or disks block and system panics or just infinitely registers timeouts for SATA drives.

This is part of my DMESG:

Code:

atapci0: <Intel ICH7 SATA300 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf800-0xf80f at device 31.2 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci0
ata1: [ITHREAD]
.......
est0: <Enhanced SpeedStep Frequency Control> on cpu0
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr a280a2806000a28
device_attach: est0 attach returned 6
p4tcc0: <CPU Frequency Thermal Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr a280a2806000a28
device_attach: est1 attach returned 6
p4tcc1: <CPU Frequency Thermal Control> on cpu1
Timecounters tick every 1.000 msec
...........
ad0: 305245MB <Hitachi HDP725032GLA360 GM3OA52A> at ata0-master UDMA100 SATA
ad1: 953869MB <SAMSUNG HD103SI 1AG01118> at ata0-slave UDMA100 SATA
ad2: 953869MB <SAMSUNG HD103SI 1AG01118> at ata1-master UDMA100 SATA
acd0: DVDR <Optiarc DVD RW AD-7200S/1.83> at ata1-slave UDMA100 SATA
.........

I have the following file systems mounted:

Code:

/dev/ufs/root on / (ufs, local, noatime) /ad0p2/
devfs on /dev (devfs, local, multilabel)
/dev/ufs/tmp on /tmp (ufs, local, noatime, noexec, soft-updates) /ad0p4/
/dev/ufs/usr on /usr (ufs, local, noatime, soft-updates) /ad0p6/
/dev/ufs/var on /var (ufs, asynchronous, local, noatime, gjournal) (mirrored gjournal with journal on ad0, mirror ad1p3,ad2p3)
/dev/ufs/vartmp on /var/tmp (ufs, local, noatime, noexec, soft-updates) /ad0p3/

mount /dev/ufs/testufs1 /mnt/test (ad1p2) - UFS, soft updates

As I try to test transfer speed:

Code:

dd if=/dev/zero of=/var/tmp/test bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes transferred in 15.546602 secs (69066013 bytes/sec)

The test always passes for ad0 and for jounraled /dev/ufs/var
But whe I try

Code:

dd if=/dev/zero of=/mnt/test/test bs=1G count=1

ATA controller and / or disks always block. It even destroyed one of the gjournal mirror providers (ad1p3).
I tried to mirror ad1p2 and ad2p2 - the result was the same. System halted and the mirror was broken.

I saw a similar thread for Marvell SATA controller:
http://forums.freebsd.org/showthread.php?t=20412&highlight=ich7
but I am not sure - should I use FreeBSD 8-Stable instead of 8.2-REL or just we have problem with slower Samsung drives
Some details about the drives can be found here:
http://www.samsung.com/ae/consumer/...op-sata/HD103SI/index.idx?pagetype=prd_detail

Thanks in advance!

mav@ · Aug 20, 2011

Speaking about system panics, it would be nice if you show error messages.

wblock@ · Aug 20, 2011

von_Gaden said:

Even if you have a lot of memory, using 1G as a buffer size is probably a mistake. dd may allocate more than one buffer, the system might start swapping. Better to do something manageable, like 1024 1m blocks.

noatime and noexec are nonstandard, but should not cause this type of problem.

von_Gaden · Aug 21, 2011

Well, I'm using DD just for testing transfer speed and I've used the same bs on a lot of systems, but none of them blocked this way.
As for panics I'll try to copy the message (it's not written anywhere on the disks, because they become inaccessible)...
You may find it strange that I'm using mostly AMD platforms for my servers and I've never had such problems.
Interesting fact is that with the first drive (Hitachi, 7200 rpm) everything works fine, the problem occurs with Samsung drives (5600 rpm I think).

wblock@ · Aug 21, 2011

Would this be a bad time to mention that of the last four drive failures I've seen, three have been Samsung? Was the Hitachi okay on the same controller? Then blame the Samsungs.

(PS: nothing strange about using AMD on servers. They are fine.)

von_Gaden · Aug 21, 2011

I would blame Samsung but these drives are new - I personally unpacked them. And the Hitachi is 2 years old. Probably I should use some windows or dos HDD tests. If they pass I will try FreeBSD on different platform.
(My last 4 HDD failures were Hitachi because I used and distributed to my customers ONLY Hitachi HDDs for past 5-6 years...)

tingo · Aug 21, 2011

There isn't enough information provided - as a result we can't help you determine what the problem is.
In general, nothing of what you have described should give timeouts / lock up things.
If one drive works, try to swap the sata cable from that one to another, it might be bad cables.
Or it could be bad hardware somewhere else. If possible, try the Samsung hard drives in another machine.

von_Gaden · Aug 21, 2011

Thanks for all your advices!
I've already tried most of the possible solutions, cables are surely OK as they are usually Suspect#1.
All components - such as mainboard and Samsung HDDs (except well-working Hitachi HDD) are new which not necessarily means are working fine. Since there are no known issues with FreeBSD and ICH7 or Samsung HDDs I'll do my best (the next few days) to test everything to confirm hardware failure or incompatibility and I'll report here.

I suppose the problem occurs only on "heavier" load and GJournaled mirror has NO PROBLEMS (with transfer speed about 29 GB/sec) and journal on partition on Hitachi drive. Mirror breaks only because of SATA halt when accessing non-journaled mirror or just single partition on same Samsung drives.

wblock@ · Aug 21, 2011

Samsung has test software for their drives. I couldn't find any firmware updates.

von_Gaden · Aug 23, 2011

I've not tested on another system yet, but here is my PANIC:

Code:

unknown: TIMEOUT - WRITE_DMA retrying (1 retry left)g_vfs_done(): LBA=774946ufs
testufs1[WRITE(offset=396820480, length=131072)]
error=6
/mnt/test: got error 6 while accessing filesystem
panic: softdep deallocate dependencies: unrecovered I/O error
cpuid=1
KDB: stack backtrace:
.....

Cables are OK, and HDD tests pass too. I think SATA controller blocks because after error and reboot sometimes system cannot boot (bootable drive is "well-working" Hitachi) and displays DISC BOOT FAILURE. It boots again after power-cycle or sometimes reset.
And I found very similar thread here:
http://forums.freebsd.org/showthread.php?t=25844

von_Gaden · Aug 30, 2011

Solved!

After a lot of testing I found one of my new 1 TB Samsung HDDs defective! It was difficult to prove because most of tests and SMART status were OK so the drive seemed to work. I had to undo the mirror and check on either drives when panics occur. I noticed that one of the drives passes this "FreeBSD test" and the other always failed with panics. Only a long long surface test displayed an error.
Excuse me for bothering all of you!