HDD timeout on ZFS + SB7x0/SB8x0/SB9x0 SATA Controller

Beeblebrox · Nov 3, 2011

New HDD, New Board (Biostar A780L). For some time, the SATA HDD was giving a weird timeout and "connection lost" error, but it became increasingly serious. The error is:

Code:

swap_pager: indefinite wait buffer: bufobj: 0, blkno: 32262, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 66056, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 82746, size: 8192
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 44091, size: 4096
ahcich0: Timeout on slot 29 port 0
ahcich0: is 00000000 cs 000000ff ss e00000ff rs e00000ff tfd c0 serr 00000000 cmd 0004e017
ahcich0: AHCI reset...
ahcich0: SATA connect time=100us status=00000123
ahcich0: AHCI reset: device found
(ada0:ahcich0:0:0:0): Command timed out
(ada0:ahcich0:0:0:0): Retrying command

The controller is (from pciconf -lvc):

Code:

ahci0@pci0:0:17:0:	class=0x010601 card=0x43911002 chip=0x43911002 rev=0x00 hdr=0x00
vendor     = 'ATI Technologies Inc'
device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
class      = mass storage
subclass   = SATA
cap 01[60] = powerspec 2  supports D0 D3  current D0
cap 05[50] = MSI supports 4 messages, 64 bit 
cap 12[70] = SATA Index-Data Pair

Available BIOS settings for SATA are:

Code:

OnChip SATA Options: Native IDE (Default) / RAID / AHCI / Legacy IDE / IDE->AHCI

Previous setting was IDE(Default), but I changed it to AHCI after reading a thread re timeout problem at startup only. loader.conf loads AHCI & cuse4bsd. Problem shows up specially under combined heavy load (simultaneously building several ports and specially if those ports are large builds). My memory is also a little insufficient (1G) and I accept this could be a contributing factor.

Beeblebrox · Nov 10, 2011

This problem is now very serious:

I first blamed the hardware, then blamed my custom kernel. In fact, my system froze on me when I was correcting the previous post! So I decided to build a GENERIC kernel with full debug enabled. Then I started updating some ports (with MAKE_JOBS_NUMBER=3).

Now I got the same timeout error (shows up in tty0) during port build, and to the build it looks like:

Code:

uilib/ui4.cpp:10090: internal compiler error: in ggc_set_mark, at ggc-page.c:1285

Look at what it's doing to my brand-new HDD (from smart):

Code:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   100   100   051    -    0
  3 Spin_Up_Time            POS---   093   093   011    -    3150
  4 Start_Stop_Count        -O--CK   100   100   000    -    38
  8 Seek_Time_Performance   P-S--K   100   100   015    -    9871
  9 Power_On_Hours          -O--CK   100   100   000    -    256
 12 Power_Cycle_Count       -O--CK   100   100   000    -    38
195 Hardware_ECC_Recovered  -O-RC-   100   100   000    -    12555553
199 UDMA_CRC_Error_Count    -OSRCK   099   099   000    -    27
200 Multi_Zone_Error_Rate   -O-R--   100   100   000    -    5

Yesterday I re-built world before kernel and the timeout problem severity has gone down - meaning, it now timed out only twice! Before that, I had re-built world maybe 5-6 days ago and the severity had gone up so much, I was getting not only long-lasting timeouts but also 5-6 system freezes per day - most of them requiring a hard reset! Before booting into the debugger kernel, I also switched the bios setting back for the HDD to Native IDE from AHCI.

This error is a killer! I am starting to suspect a ZFS issue and thinking I should move root to UFS... Now the debug enabled kernel messages re the error. From dmesg, immediately at boot:

Code:

kernel: lock order reversal:
kernel: 1st 0xfffffe0010598248 filedesc structure (filedesc structure) @ /asp/src/sys/kern/kern_descrip.c:1197
kernel: 2nd 0xfffffe001052ccf0 zfs (zfs) @ /asp/src/sys/kern/vfs_subr.c:4245
kernel: KDB: stack backtrace:
kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kernel: kdb_backtrace() at kdb_backtrace+0x37
kernel: _witness_debugger() at _witness_debugger+0x65
kernel: witness_checkorder() at witness_checkorder+0x833
kernel: __lockmgr_args() at __lockmgr_args+0xd9d
kernel: vop_stdlock() at vop_stdlock+0x39
kernel: VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
kernel: _vn_lock() at _vn_lock+0x68
kernel: knlist_remove_kq() at knlist_remove_kq+0xfc
kernel: knote_fdclose() at knote_fdclose+0x177
kernel: kern_close() at kern_close+0xe8
kernel: amd64_syscall() at amd64_syscall+0x27b
kernel: Xfast_syscall() at Xfast_syscall+0xf7
kernel: --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8015abcdc, rsp = 0x7fffffffd868, rbp = 0x801807b20 ---

And the most recent error:

Code:

kernel: lock order reversal:
kernel: 1st 0xfffffe0018e3a448 filedesc structure (filedesc structure) @ /asp/src/sys/kern/kern_descrip.c:1197
kernel: 2nd 0xfffffe0004533cf0 devfs (devfs) @ /asp/src/sys/kern/vfs_subr.c:4245
kernel: KDB: stack backtrace:
kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kernel: kdb_backtrace() at kdb_backtrace+0x37
kernel: _witness_debugger() at _witness_debugger+0x65
kernel: witness_checkorder() at witness_checkorder+0x833
kernel: __lockmgr_args() at __lockmgr_args+0xd9d
kernel: vop_stdlock() at vop_stdlock+0x39
kernel: VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
kernel: _vn_lock() at _vn_lock+0x68
kernel: knlist_remove_kq() at knlist_remove_kq+0xfc
kernel: knote_fdclose() at knote_fdclose+0x177
kernel: kern_close() at kern_close+0xe8
kernel: amd64_syscall() at amd64_syscall+0x27b
kernel: Xfast_syscall() at Xfast_syscall+0xf7
kernel: --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8015abcdc, rsp = 0x7fffffffd498, rbp = 0x80180e230 ---
kernel: lock order reversal:
kernel: 1st 0xfffffe0018e3a448 filedesc structure (filedesc structure) @ /asp/src/sys/kern/kern_descrip.c:1197
kernel: 2nd 0xfffffe000b4f4a78 pseudofs (pseudofs) @ /asp/src/sys/kern/vfs_subr.c:4245
kernel: KDB: stack backtrace:
kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kernel: kdb_backtrace() at kdb_backtrace+0x37
kernel: _witness_debugger() at _witness_debugger+0x65
kernel: witness_checkorder() at witness_checkorder+0x833
kernel: __lockmgr_args() at __lockmgr_args+0xd9d
kernel: vop_stdlock() at vop_stdlock+0x39
kernel: VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
kernel: _vn_lock() at _vn_lock+0x68
kernel: knlist_remove_kq() at knlist_remove_kq+0xfc
kernel: knote_fdclose() at knote_fdclose+0x177
kernel: kern_close() at kern_close+0xe8
kernel: amd64_syscall() at amd64_syscall+0x27b
kernel: Xfast_syscall() at Xfast_syscall+0xf7
kernel: --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8015abcdc, rsp = 0x7fffffffd498, rbp = 0x80180e230 ---

jb_fvwm2 · Nov 10, 2011

Cannot suggest much, but what I can suggest...

1... keep logged the Reallocated_Sector_Count (last value) and
the last four in that smartctl list (or last six -- 195-200? in this
case), to be wary if they are increasing. (BTW if you have any valuable
data on the disk, do you have backups. Some disk failures show one or
two timeouts writing to the disk before it is forever lost on shutdown).

2... you can restart port builds.

Code:

 cd /usr/ports/lang/gcc46
# does [FILE]work[/FILE] already exist?
make build && yell || yell  # yell or ttyload or some other notification, just restart the build.  Most times it may complete without error.
 OTOH if you really like the above command, you can automate it:
....
make build && yell || yell && make build || yell && make build || yell && make build || yell && yell 
# (If your motherboard supports [port] audio/yell [/port], or you can craft one from a snippet of mp3 file and an alias).

Beeblebrox · Nov 10, 2011

@ jb_fvwm2: Hey, thanks for posting.

2. The port build broke because of HDD timeout. There is no internal compiler error; it just looks that way to the make process.

1. The ECC errors (195) keep changing, but generally on the increase - now at 1124369 (posted as 12555553) but 2 days ago was 29365.

I have already changed both the first board & disk through warranty because as soon as I got these errors in the first set I changed BOTH. Now same errors on completely new hardware.

Mods: This is a bug / deficiency / or whatever. Advise how I can help debug. Maybe I need to change the board to another model or move root from ZFS. This is just INSANE and I really need some help here...

Beeblebrox · Jan 4, 2012

can't make buildworld

Unless anyone has a better suggestion I am preparing to file a Bug Report (PR) about this problem because:
1. I wanted to solve GPU related issues (unlikely but) just in case there was undue strain placed on cpu/hdd from a badly configured gpu. This has been solved.
2. After latest update to /usr/src, buildworld breaks with seg.fault 11 message, but actually due to swap_pager timeout and "indefinite wait buffer". So now I will be no longer able to update world or kernel.
3. I have monitored cpu + mem usage during process #2 and found that:
- CPU usage is not very heavy before system freeze
- GPU fixes have indeed had an effect as the freezes are shorter.
- mem/swap info: RAM 1 GB / swap 2 GB. Usage: max RAM - 65% / max swap - 58%

EDIT: PR filed: http://www.freebsd.org/cgi/query-pr.cgi?pr=163815

peetaur · Jan 5, 2012

Do you use expanders, or just 1 SATA port per disk?

What is the disk? (show # smartctl -i output

throAU · Jan 5, 2012

Wouldn't the increase in SMART errors for ECC and UDMA CRC tend to indicate faulty hardware? Just because a drive is new, it doesn't mean it isn't DOA.

Faulty cabling? Now I'm not a hardware guru, but I didn't think that SMART counters could be written to by the OS, and I didn't think that software could cause ECC or CRC errors in the drive, as they are lower level than the OS?

Just throwing it out there - perhaps your supplier has had a dodgy batch of drives?

throAU · Jan 5, 2012

Further to the above - you note that problems have increased since you updated your source. Perhaps you've simply reached an area of the disk that has more defects than earlier parts?

I guess if someone can confirm my suspicions in the previous post regarding the ability of the OS to affect SMART counters, the possibility of software issue could be ruled out.

I gather you're running ZFS on a single drive?

Crivens · Jan 5, 2012

I did not find what kind of disk you are using. I had problems with a certain kind of samsung drives which have firmware problems. Also, when using the cables which come with some hardware the cable turns out to be the problem. You may want to try some other cable and make sure the case is not doing some pressure on the connections when closed.

peetaur · Jan 5, 2012

I found his HDD on his PR page.

http://www.freebsd.org/cgi/query-pr.cgi?pr=163815

- HDD is SAMSUNG HD322HJ, 320GB, ATA-8-ACS revision 3b, all FS on ZFS.

So it seems to be a Samsung. Crivens, is that the same type of disk you had an issue with? Does that mean he should upgrade the disk firmware, or just replace the disk?

Beeblebrox · Jan 5, 2012

Hello gents & sorry for the late response, did not expect an answer so quickly. As to Q's:

@peetaur: Only one HDD (no raid etc) Desktop system. Layout is GPT with 2GB swap at beginning, then partitioned sections using ZFS. Some ext4 slices at the end. Swap slice format is linux-swap. Disk info:

Code:

Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD322HJ
LU WWN Device Id: 5 0000f0 00b0c1257
Firmware Version: 1AC01118
User Capacity:    320,072,933,376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b

@throAU:

Perhaps you've simply reached an area of the disk that has more defects than earlier parts?

1. I had an MSI board before this one and had to change it because it was faulty. Even though it was faulty, never gave me such an error.
2. Because the MSI board was frying my HDD'S, I also sent my HDD to be exchanged under warranty. I also have a second HDD of the exact same model, but the importer is completely different.
3. After receiving this new board + HDD I started install and hit this problem within 5-6 hours of operation. I dis-assembled everything and sent the board and the HDD for replacement under warranty.
4. After the replacements arrived, I took a copy of Inquisitor and ran it on the new setup. The run gave "smart" error and ended testing there. I examined the SMART output after the Inquisitor run and HDD showed 5 errors in that short time. The HDD looked a a bit used (recycled) so I sent it in for warranty again. I received a very clean HDD when it came back.
5. My assumption (maybe wrongly) with step 4 was that Linux + BSD share the HDD controller driver source software so the problem would replicate accros O/S.
6. Timeout errors list blocks randomly, there is really no pattern. Surface scan through MHDD gives no errors - the disk is in fact clean.
7. My system uses tmpfs (which writes to swap). Tested without tmpfs, with /tmp on / (root) - same errors observed.

@crivens: Sata cable has clip to secure the connection. I have tested the problem with an open case and the HDD lying on the side close to me so I could hear what the HDD did when it timed-out.

My Conclusions:
Since the MSI board using same model HDD did not have this problem, and since I have thoroughly checked my HDD's the problem is one of incompatibility between the controller on the board and this model HDD. My power supply and RAM are clean and CPU passes Inquisitor burn tests. I also thought that maybe the "linux-swap" format could cause problems, but then how do you explain Inquisitor results? Also the time-outs are not random and do not happen when I am say using my browser or or doing regular work on the PC - they happen only under heavy HDD usage (compiling).

One last bit of info: The HDD is not immediately recognized at boot-up and needs some time for the system to find it - a time-out error at start-up (but I need to find & post exact message).

I still have my full-debug kernel (from Nov-11) and a completely clean HDD of the same model. Willing to run any tests you can suggest.

Crivens · Jan 5, 2012

peetaur said:
I found his HDD on his PR page.

http://www.freebsd.org/cgi/query-pr.cgi?pr=163815

So it seems to be a Samsung. Crivens, is that the same type of disk you had an issue with? Does that mean he should upgrade the disk firmware, or just replace the disk?

I should vote for replacing.

In the last 2 years, 2 of these disks have pulled up their bits, curled up their spindles and went to texas in my server, which does not see too much usage.
I replaced the ZFS Raid build from 1.5TB Samsungs with a lot of 500GB 2.5" disks. They provide more bandwidth as well as not taking 6 hours for a resilver in case something happens. Also, they are now from 2 different manufacturers, different vendors, bought in different shops, comming from different batches. I do not like the thought of a manufacturing problem killing me a second drive while a resilver is in progrss and the complete pool will be trashed.

Gladly, each time one of these disks developed problems, ZFS tole me about it way before SMART did, leaving me with no data loss and only mild inconviniences (replacing drives, copy one or two files back from backup).

Crivens · Jan 5, 2012

Beeblebrox said:
@crivens: Sata cable has clip to secure the connection. I have tested the problem with an open case and the HDD lying on the side close to me so I could hear what the HDD did when it timed-out.

The clips do not mean much, but they are better than nothing.
I have a bunch of cables which came with the disk mountings that are real, well, crap. Replaced them and SMART errors went down. Also, I once had the problem that the soundproofing of the case showed a bit against a cable, causing it to twist a little bit in the drive side connector and producing crc errors by that way.
But if you use good cables and have the complete setup open and free while doing the tests, then these causes may not need to apply.

Beeblebrox · Jan 5, 2012

I changed the sata cable to my veteran cable from the MSI board and ran
# make -j4 buildworld
At first it looked very promising and I was completely surprised. Then 2 timeouts. First one:

Code:

Jan  5 19:07:46 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 136385, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 47774, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 113829, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 101127, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 24993, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 39516, size: 8192
Jan  5 19:08:26 kernel: ahcich0: Timeout on slot 24 port 0
Jan  5 19:08:26 kernel: ahcich0: is 00000000 cs f80007ff ss ff0007ff rs ff0007ff tfd c0 serr 00000800 cmd 0004fb17
Jan  5 19:08:26 kernel: ahcich0: AHCI reset...
Jan  5 19:08:26 kernel: ahcich0: SATA connect time=100us status=00000123
Jan  5 19:08:26 kernel: ahcich0: AHCI reset: device found
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Request requeued
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Command timed out
<MANY REPEATS>
Jan  5 19:08:26 kernel: ahcich0: AHCI reset: device ready after 100ms
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 132983, size: 12288
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 133530, size: 12288
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 34081, size: 4096
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 113722, size: 8192
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 78475, size: 8192
Jan  5 19:08:26 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 34847, size: 4096
Jan  5 19:08:26 kernel: ahcich0: Timeout on slot 29 port 0
Jan  5 19:08:26 kernel: ahcich0: is 00000000 cs 00000007 ss e0000007 rs e0000007 tfd c0 serr 00000000 cmd 0004e017
Jan  5 19:08:26 kernel: ahcich0: AHCI reset...
Jan  5 19:08:26 kernel: ahcich0: SATA connect time=100us status=00000123
Jan  5 19:08:26 kernel: ahcich0: AHCI reset: device found
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Request requeued
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jan  5 19:08:26 kernel: (ada0:ahcich0:0:0:0): Command timed out
<MANY REPEATS>
Jan  5 19:08:26 kernel: ahcich0: AHCI reset: device ready after 100ms
Jan  5 19:10:56 kernel: ahcich0: Timeout on slot 12 port 0
Jan  5 19:10:56 kernel: ahcich0: is 00000000 cs ffff8001 ss fffff001 rs fffff001 tfd c0 serr 00000000 cmd 0004ef17
Jan  5 19:10:56 kernel: ahcich0: AHCI reset...
Jan  5 19:10:56 kernel: ahcich0: SATA connect time=100us status=00000123
Jan  5 19:10:56 kernel: ahcich0: AHCI reset: device found
Jan  5 19:10:56 kernel: (ada0:ahcich0:0:0:0): Command timed out
Jan  5 19:10:56 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jan  5 19:10:56 kernel: (ada0:ahcich0:0:0:0): Request requeued
Jan  5 19:10:56 kernel: (ada0:ahcich0:0:0:0): Retrying command
<MANY REPEATS>
Jan  5 19:10:56 kernel: ahcich0: AHCI reset: device ready after 100ms

Second Timeout:

Code:

Jan  5 19:30:52 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 76407, size: 4096
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 152288, size: 4096
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 34588, size: 4096
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 153628, size: 8192
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 102305, size: 4096
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37906, size: 4096
Jan  5 19:30:53 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 53022, size: 4096
Jan  5 19:30:53 kernel: ahcich0: Timeout on slot 2 port 0
Jan  5 19:30:53 kernel: ahcich0: is 00000000 cs 007fffe0 ss 007ffffc rs 007ffffc tfd c0 serr 00000800 cmd 0004e517
Jan  5 19:30:53 kernel: ahcich0: AHCI reset...
Jan  5 19:30:53 kernel: ahcich0: SATA connect time=100us status=00000123
Jan  5 19:30:53 kernel: ahcich0: AHCI reset: device found
Jan  5 19:30:53 kernel: (ada0:ahcich0:0:0:0): Request requeued
Jan  5 19:30:53 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jan  5 19:30:53 kernel: (ada0:ahcich0:0:0:0): Command timed out
<MANY REPEATS - like above>
Jan  5 19:30:53 kernel: ahcich0: AHCI reset: device ready after 100ms

The second freeze lasted about 2 minutes and ended with the system re-booting itself. I would like to draw your attention to the fact that the time index for when the HDD is "back online" is the same as when it went off-line. This in my opinion means a total system lock as it also looses track of time. I also have these in my settings:

Code:

/boot/loader.conf:  hint.ahci.0.msi=0
/etc/sysctl.conf:  vfs.read_max=32

However, In my mind I just keep coming back to the results I got from my Inquisitor run on the hardware, and the fact that my MSI board did not do this. Also, from sysrescd package, AIDA is unable to find the controller and goes to indefinite stall, while MHDD shows a number of controller errors before finding the disk (can only do in IDE mode; cannot find HDD at all when BIOS setting is sata)

EDIT: I have decided to take the issue up with the board manufacturer. Will post any answer/solution from them.

Beeblebrox · Jan 17, 2012

SOLVED - well, sort of...

in /etc/sysctl.conf:

Code:

vm.defer_swapspace_pageouts: 1

makes swap a lower priority than normal mem (like swapieness in linux). My testing builds did quite well and gave no timeouts when RAM usage went up to near 80%. Meaning, don't run to swap every 10 seconds, do bulk writes when you run out of RAM.

However, during some other and harder builds (with j4 for example) I again got timeouts, (less number of T/O's and less error messages per T/O) but I also noticed that in this type of testing, normal RAM was not being used and the system was hitting swap and the HDD pretty hard. In this case, HDD traffic is trippled I guess, because it writes to swap then read from swap and writes to /usr.

It seems that the system keeps a reserve of free mem for contingency and if there is a better way to decrease HDD swap use under multiple-thread instances my error will be cleared-out for good I think.

peetaur · Jan 19, 2012

If there is a problem with the disk, avoiding usage of the disk doesn't fix the root cause of the problem. You should think about the disk too, rather than working around it.

I vaguely remember someone saying something about some specific Samsungs (probably the green ones) and bad firmware. I don't know if it applies to your disk. So I have 2 suggestions.

I have a Crucial SSD that has some timeouts that is clearly the SSDs fault (as a root cause although FreeBSD handles it badly, including a panic I can cause with 2 different commands). Instead of running heavy IO to cause the problem, I can also just mount the disk and unplug it, and plug it back in. I get SCSI / SMP timeouts in /var/log/messages and on tty1, and the disk becomes unreadable afterwards, even if I move it to another disk bay, but plugging a different disk in that bay shows that the bay works fine. Rebooting while the disk is plugged in fixes everything (didn't try rebooting with the disk removed). So I suggest you try this as a test (assuming you have a backup of the data or accept the risk of data corruption), to see if it is just the disk. (However, if this does not cause the problem, it does not mean it is not the disk)

And the second suggestion is try another hd; maybe try attaching it to the mirror, or replace enough disks in your raidz to keep enough redundancy to run degraded when one fails. Then you will see whether both disks fail, or only that particular Samsung.

peetaur · Jan 19, 2012

Here is a similar thread.
The OP in that thread is running Samsung Spinpoint F4EG HD204UI disks.
You are running Samsung Spinpoint F1 DT HD322HJ disks, which may or may not be similar enough to say the issue is related.

Beeblebrox · Feb 12, 2012

I found some old HDD's I kept in storage the other day. And I do mean OLD! Seagate ST3660A 545 MB, circa 1996 I think, but still works. Connected it to the mobo and made the whole thing as swap. Guess what? No timeouts on either HDD even though I threw 2 separate large builds, at the same time and both with -j3 enabled. Hmmm Mystery....

peetaur · Feb 12, 2012

You actually have a motherboard that supports 545 MB disks? Wasn't that back when the disks had no controller and you needed different IDE controller cards?

But I think this matches my guess that it is the disks. Others have timeout problems with Seagate Greens, and Samsung Spinpoints.

Beeblebrox · Feb 12, 2012

@peetaur

You actually have a motherboard that supports 545 MB disks?

Back then my boards and HDD were high quality but I had a shit OS. Now my boards and HDD are shit but I have a high quality OS. Life is like a box of chocolates...

But I think this matches my guess that it is the disks.

I don't understand how you reach this conclusion. I have 2 swaps - 1 on the sata, 1 on the IDE.

olav · Feb 15, 2012

So basically the property

Code:

vm.defer_swapspace_pageouts = 1

Will increase stability in low memory condition, but slightly decrease the performance?

Beeblebrox · Feb 15, 2012

In that post what I was looking for was something similar to swapiness in linux. As an example of the usage: Use RAM to 90% then fall to swap as last resort.

Unfortunately that code did not work because dmesg gave:

Code:

WARNING: sysctl vm.defer_swapspace_pageouts: does not exist

Sorry for that bro, I let it slip by but I am sure something similar exists in FreeBSD; just not have had the time to search. Also, a setting similar to swapiness would probably work best in combination with a "bulk write" setting, allowing for swap writes in large chunks instead of small writes which keep the hdd busy unnecessarily.

HDD timeout on ZFS + SB7x0/SB8x0/SB9x0 SATA Controller

Beeblebrox

Beeblebrox

jb_fvwm2

Beeblebrox

Beeblebrox

peetaur

throAU

throAU

Crivens

Administrator

peetaur

Beeblebrox

Crivens

Administrator

Crivens

Administrator

Beeblebrox

Beeblebrox

peetaur

peetaur

Beeblebrox

peetaur

Beeblebrox

olav

Beeblebrox