Random freezes on new system

Pawtuxet · Apr 25, 2014

Hi everyone,

I have assembled a new system that has turned out to have a recurring problem; every so often it will lock up completely. All connections time out, the console is unresponsive and if the zfs pool was busy, the disk activity LEDs will remain in the state they were in when it happened. As far as I can tell, it is frozen through and through.

I made a post on the "Base System / Storage" board, since I was convinced this all started when I attached the Supermicro AOC-USAS2-L8e card and configured the ZFS pool - the system seemed to run just fine for two weeks prior to this. I've now verified that it will also happen without that card connected, and that the issue may in fact have been present from the beginning.

I can't seem to pin down a trigger. Moving data around seems to push forward the point at which a freeze will occur. I can fairly reliably provoke a freeze within a couple of hours, by repeatedly copying a 50GB file over a Samba share, but it can also happen if the system is just sitting more or less idle.

I've tried to rule out a few things.

Memory: I've tried two separate sets of memory, 2x4 GB and 4x8 GB. No difference.
Network: Using a PCI-E 1x Marvell NIC instead of the Realtek 8111F on the motherboard made no difference.
Overheating: I suspected overheating at first and still haven't ruled it out, even though healthd -d has yet to report anything higher than 36 degrees. Last time I tried scrubbing the zpool, it ran for 2.5 hours before locking up and then immediately locked up another 3 times, minutes after rebooting, before I aborted it.
HBA card: Still freezes when not present.
Firmware: The motherboard is running its latest firmware and so is the SSD system drive. I don't think there's anything else with an updateable firmware, other then the HBA card, which is running its latest IT firmware from Supermicro (16), even if it isn't the latest firmware for the LSI 2008 controller (18).
FreeBSD version: Started out with 10.0-RELEASE and switched to 10.0-STABLE. No difference.

What I haven't tried yet.

Buying a different motherboard.
Moving the system to a regular 2.5" harddrive, instead of the SSD (I've seen some weird behavior from SSD drives. Even though it ran just fine as the old server's system drive, it's now managed by a different and faster controller and it might be worth a shot).
Temporarily installing Windows to see if it happens regardless of the OS.

Hardware:
PSU: Corsair RM450
Motherboard: ASUS P8H77-M Pro
CPU: Intel Core i3-3250
RAM: Corsair XMS3

Any ideas or suggestions would be welcome.

Code:

# uname -a
FreeBSD dingo.pawtuxet.dk 10.0-STABLE FreeBSD 10.0-STABLE #0 r264493: Tue Apr 15 12:42:40 CEST 2014     dingo@dingo.pawtuxet.dk:/usr/obj/usr/src/sys/GENERIC  amd64

Code:

# dmesg
Copyright (c) 1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 10.0-STABLE #0 r264493: Tue Apr 15 12:42:40 CEST 2014
    dingo@dingo.pawtuxet.dk:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.4 (tags/RELEASE_34/final 197956) 20140216
CPU: Intel(R) Core(TM) i3-3250 CPU @ 3.50GHz (3500.07-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x306a9  Family = 0x6  Model = 0x3a  Stepping = 9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x3d9ae3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,POPCNT,TSCDLT,XSAVE,OSXSAVE,AVX,F16C>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Standard Extended Features=0x281<GSFSBASE,SMEP,ENHMOVSB>
  TSC: P-state invariant, performance statistics
real memory  = 34359738368 (32768 MB)
avail memory = 32979369984 (31451 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 SMT threads
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0 <Version 2.0> irqs 0-23 on motherboard
Cuse4BSD v0.1.33 @ /dev/cuse
kbd1 at kbdmux0
random: <Software, Yarrow> initialized
acpi0: <ALASKA A M I> on motherboard
acpi0: Power Button (fixed)
acpi0: reservation of 67, 1 (4) failed
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
Event timer "HPET4" frequency 14318180 Hz quality 440
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7800000-0xf7bfffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
agp0: <IvyBridge desktop GT1 IG> on vgapci0
agp0: aperture size is 256M, detected 262140k stolen memory
vgapci0: Boot video device
xhci0: <Intel Panther Point USB 3.0 controller> mem 0xf7d00000-0xf7d0ffff irq 16 at device 20.0 on pci0
xhci0: 32 byte context size.
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
pci0: <simple comms> at device 22.0 (no driver attached)
ehci0: <Intel Panther Point USB 2.0 controller> mem 0xf7d17000-0xf7d173ff irq 23 at device 26.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
hdac0: <Intel Panther Point HDA Controller> mem 0xf7d10000-0xf7d13fff irq 22 at device 27.0 on pci0
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci3: <ACPI PCI bus> on pcib3
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0xf0004000-0xf0004fff,0xf0000000-0xf0003fff irq 16 at device 0.0 on pci3
re0: Using 1 MSI-X message
re0: Chip rev. 0x48000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Ethernet address: d8:50:e6:41:59:24
pcib4: <ACPI PCI-PCI bridge> irq 18 at device 28.6 on pci0
pci4: <ACPI PCI bus> on pcib4
atapci0: <Marvell ATA controller> port 0xd040-0xd047,0xd030-0xd033,0xd020-0xd027,0xd010-0xd013,0xd000-0xd00f mem 0xf7c10000-0xf7c101ff irq 18 at device 0.0 on pci4
ata2: <ATA channel> at channel 0 on atapci0
ata3: <ATA channel> at channel 1 on atapci0
ehci1: <Intel Panther Point USB 2.0 controller> mem 0xf7d16000-0xf7d163ff irq 23 at device 29.0 on pci0
usbus2: EHCI version 1.0
usbus2 on ehci1
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <Intel Panther Point SATA300 controller> port 0xf110-0xf117,0xf100-0xf103,0xf0f0-0xf0f7,0xf0e0-0xf0e3,0xf0d0-0xf0df,0xf0c0-0xf0cf irq 19 at device 31.2 on pci0
ata4: <ATA channel> at channel 0 on atapci1
ata5: <ATA channel> at channel 1 on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
atapci2: <Intel Panther Point SATA300 controller> port 0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf090-0xf097,0xf080-0xf083,0xf070-0xf07f,0xf060-0xf06f irq 19 at device 31.5 on pci0
ata6: <ATA channel> at channel 0 on atapci2
ata7: <ATA channel> at channel 1 on atapci2
acpi_button0: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
ppc1: <Parallel port> port 0x378-0x37f irq 5 on acpi0
ppc1: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc1
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
orm0: <ISA Option ROM> at iomem 0xc0000-0xce7ff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
ppc0: cannot reserve I/O port range
est0: <Enhanced SpeedStep Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
p4tcc1: <CPU Frequency Thermal Control> on cpu1
est2: <Enhanced SpeedStep Frequency Control> on cpu2
p4tcc2: <CPU Frequency Thermal Control> on cpu2
est3: <Enhanced SpeedStep Frequency Control> on cpu3
p4tcc3: <CPU Frequency Thermal Control> on cpu3
Timecounters tick every 1.000 msec
hdacc0: <Realtek ALC892 HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek ALC892 Audio Function Group> at nid 1 on hdacc0
pcm0: <Realtek ALC892 (Rear Analog 7.1/2.0)> at nid 20,22,21,23 and 24,26 on hdaa0
pcm1: <Realtek ALC892 (Front Analog)> at nid 27 and 25 on hdaa0
pcm2: <Realtek ALC892 (Rear Digital)> at nid 30 on hdaa0
pcm3: <Realtek ALC892 (Onboard Digital)> at nid 17 on hdaa0
hdacc1: <Intel Panther Point HDA CODEC> at cad 3 on hdac0
hdaa1: <Intel Panther Point Audio Function Group> at nid 1 on hdacc1
pcm4: <Intel Panther Point (HDMI/DP 8ch)> at nid 5 on hdaa1
pcm5: <Intel Panther Point (HDMI/DP 8ch)> at nid 7 on hdaa1
random: unblocking device.
usbus0: 5.0Gbps Super Speed USB v3.0
usbus1: 480Mbps High Speed USB v2.0
usbus2: 480Mbps High Speed USB v2.0
ugen1.1: <Intel> at usbus1
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
ugen0.1: <0x8086> at usbus0
uhub1: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ugen2.1: <Intel> at usbus2
uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus2
uhub1: 8 ports with 8 removable, self powered
uhub0: 2 ports with 2 removable, self powered
uhub2: 2 ports with 2 removable, self powered
ugen1.2: <vendor 0x8087> at usbus1
uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1
ugen2.2: <vendor 0x8087> at usbus2
uhub4: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus2
uhub3: 6 ports with 6 removable, self powered
uhub4: 8 ports with 8 removable, self powered
ugen2.3: <Logitech> at usbus2
uhub5: <Logitech Logitech BT Mini-Receiver, class 9/0, rev 2.00/49.00, addr 3> on usbus2
uhub5: 3 ports with 1 removable, bus powered
ugen2.4: <Logitech> at usbus2
ukbd0: <Logitech Logitech BT Mini-Receiver, class 0/0, rev 2.00/49.00, addr 4> on usbus2
kbd2 at ukbd0
ugen2.5: <Logitech> at usbus2
ada0 at ata4 bus 0 scbus2 target 0 lun 0
ada0: <Samsung SSD 840 PRO Series DXM06B0Q> ATA-9 SATA 3.x device
ada0: Serial Number S12RNEAD401503J
ada0: 600.000MB/s transfers (SATA 3.x, UDMA5, PIO 8192bytes)
ada0: 244198MB (500118192 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad8
ugen2.6: <vendor 0x2548> at usbus2
SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #3 Launched!
Timecounter "TSC-low" frequency 1750035404 Hz quality 1000
Trying to mount root from ufs:/dev/ada0p2 [rw]...
WARNING: / was not properly dismounted
ums0: <Logitech Logitech BT Mini-Receiver, class 0/0, rev 2.00/49.00, addr 5> on usbus2
ums0: 14 buttons and [XYZT] coordinates ID=2
ums0: 8 buttons and [XYZT] coordinates ID=5
umodem0: <vendor 0x2548 product 0x1002, class 0/0, rev 1.10/10.00, addr 6> on usbus2
umodem0: data interface 1, has CM over data, has break
ums1: <vendor 0x2548 product 0x1002, class 0/0, rev 1.10/10.00, addr 6> on usbus2
ums1: 3 buttons and [XY] coordinates ID=0
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to deny, logging disabled
pid 830 (xfsettingsd), uid 1001: exited on signal 11 (core dumped)
info: [drm] Initialized drm 1.1.0 20060810
drmn0: <Intel IvyBridge> on vgapci0
info: [drm] MSI enabled 1 message(s)
info: [drm] AGP at 0xe0000000 256MB
iicbus0: <Philips I2C bus> on iicbb0 addr 0xff
iic0: <I2C generic I/O> on iicbus0
iic1: <I2C generic I/O> on iicbus1
iicbus2: <Philips I2C bus> on iicbb1 addr 0x0
iic2: <I2C generic I/O> on iicbus2
iic3: <I2C generic I/O> on iicbus3
iicbus4: <Philips I2C bus> on iicbb2 addr 0x0
iic4: <I2C generic I/O> on iicbus4
iic5: <I2C generic I/O> on iicbus5
iicbus6: <Philips I2C bus> on iicbb3 addr 0x0
iic6: <I2C generic I/O> on iicbus6
iic7: <I2C generic I/O> on iicbus7
iicbus8: <Philips I2C bus> on iicbb4 addr 0x0
iic8: <I2C generic I/O> on iicbus8
iic9: <I2C generic I/O> on iicbus9
iicbus10: <Philips I2C bus> on iicbb5 addr 0x0
iic10: <I2C generic I/O> on iicbus10
iic11: <I2C generic I/O> on iicbus11
iicbus12: <Philips I2C bus> on iicbb6 addr 0x0
iic12: <I2C generic I/O> on iicbus12
iic13: <I2C generic I/O> on iicbus13
iicbus14: <Philips I2C bus> on iicbb7 addr 0x0
iic14: <I2C generic I/O> on iicbus14
iic15: <I2C generic I/O> on iicbus15
info: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
info: [drm] Driver supports precise vblank timestamp query.
drmn0: taking over the fictitious range 0xe0000000-0xf0000000
info: [drm] GMBUS timed out, falling back to bit banging on pin 7 [gmbus bus dpd]
info: [drm] Initialized i915 1.6.0 20080730

wblock@ · Apr 25, 2014

If it's hardware, it might be possible to trigger with another operating system, or memtest.

For software, I'd start with making sure that any UFS filesystems have SUJ disabled. Soft updates alone are fine.

Pawtuxet · Apr 26, 2014

wblock@ said:
If it's hardware, it might be possible to trigger with another operating system, or memtest.

For software, I'd start with making sure that any UFS filesystems have SUJ disabled. Soft updates alone are fine.

Thanks for the suggestion! I've disabled Journaling on the system partition:

Code:

# mount
/dev/ada0p2 on / (ufs, local, soft-updates)

It seems to be largely unnecessary for SSD drives anyway, so I'll leave it off even if it doesn't make a difference.

Pawtuxet · Apr 26, 2014

Disabling Journaling sadly did not seem to make a difference, but thanks for the suggestion.

chrbr · Apr 26, 2014

My reply is just guessing, but since you have already tried a lot to fix the issue it might be a lucky guess.

In the past I have had sudden stops of operation with SATA drives. It has been under Linux, but may be it would have happened with FreeBSD as well. The crashes happened with unpredictable frequencies as well as in your case. The root cause has been the positioning of the SATA cables. Since I ordered them that they touch the metal chassis of the computer as little as possible everything is fine.

I am not sure if smartctl is applicable in your case. If so the output of smartctl -a /dev/xxx should give some information about the sanity of the drives. On my system a part of the output is as below:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   143   140   021    Pre-fail  Always       -       3841
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       893
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1661
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       891
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       155
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       737
194 Temperature_Celsius     0x0022   106   100   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   191   000    Old_age   Always       -       48
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

As far as I have read in different discussions the item UDMA_CRC_Error_Count should be related to the cable issues. May be it would be worth to check the cables and the output of smartctl on your system as well. I wish you good luck and success in fixing the issue!

Pawtuxet · Apr 27, 2014

The CRC_Error_Count attribute is still sitting at 0 for the system drive, so it's probably not the same issue. The rest of the SMART values also seem fairly unremarkable. Except POR_Recovery_Count which, depressingly, is likely the number of times the system has frozen so far.

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       7100
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       140
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       54
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   070   056   000    Old_age   Always       -       30
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       76
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9420914763

But your suggestion have given me a couple of ideas as to what else I could try. Such as trying a different SATA cable or trying a different controller on the motherboard. And it just occured to me, that at the same time I installed the HBA card, I also moved the system drive into a hotswap bay using a 3.5" adapter, where previously it was just lying around and not sharing power with 8 other drives. I'll try going back to the former configuration.

And thanks! I'm about ready for some success, here.

ralphbsz · Apr 27, 2014

Pawtuxet said:
... the disk activity LEDs will remain in the state they were in when it happened. ...

That is really weird. If one takes that disk activity light literally, it would say that if the system crashes while a disk IO is in progress, the disk IO never actually finishes. And that is nonsense ... unless you have a broken disk drive, all IOs will finish within a few seconds, and most in a dozen milliseconds.

This makes me suspect that the problem is the disk interface. Bug? Unlikely, the normal SATA interfaces are thoroughly tested, since most people have them. Bizarre incompatibility between motherboard / your SSD / FreeBSD? Possible. Hardware damage on the motherboard, the SSD, or the SATA cable? More likely.

You say that your SSD has to share power with 8 hard drives. That's a pretty hefty load on the power supply 12V and 5V rails, although much less than modern power supplies (300-500W) are rated for. Can you move the hard drives temporarily to a separate power supply, and see whether the problem goes away?

Pawtuxet · Apr 28, 2014

ralphbsz said:
You say that your SSD has to share power with 8 hard drives. That's a pretty hefty load on the power supply 12V and 5V rails, although much less than modern power supplies (300-500W) are rated for. Can you move the hard drives temporarily to a separate power supply, and see whether the problem goes away?

I picked a PSU that would cover about 160% of the expected maximum draw of all components together - it really should be sufficient. But it's something I hadn't tried, so I separated the SSD and gave it its own, entirely separate power supply while I was doing the other tests (different controller and SATA cables). If anything, this made it lock up much, much faster. Within minutes, every time, regardless of which PSU it was connected to.

ralphbsz said:
This makes me suspect that the problem is the disk interface. Bug? Unlikely, the normal SATA interfaces are thoroughly tested, since most people have them. Bizarre incompatibility between motherboard / your SSD / FreeBSD? Possible. Hardware damage on the motherboard, the SSD, or the SATA cable? More likely.

Could it be RF or EM intereference? The components are all mounted on a board, sitting in a closet under the TV, so there's no actual case to shield it. There are two RF transmitters nearby, a Sonos PlayBar and a WiFi router, which was actually sitting right up against the wall in the next closet. I moved the router a little farther away - this was about 14 hours ago, and despite my efforts during the day, have not been able to make it lock up. I did not think a wireless router would be able to affect other devices so drastically, if that actually has been the cause. Testing continues!

ralphbsz · Apr 28, 2014

Pawtuxet said:
Could it be RF or EM intereference? The components are all mounted on a board, sitting in a closet under the TV, so there's no actual case to shield it. There are two RF transmitters nearby, a Sonos PlayBar and a WiFi router, which was actually sitting right up against the wall in the next closet. I moved the router a little farther away - this was about 14 hours ago, and despite my efforts during the day, have not been able to make it lock up. I did not think a wireless router would be able to affect other devices so drastically, if that actually has been the cause. Testing continues!

Yikes. You have massive EMI interference going both ways. On one hand, you have a 2.5 GHz radio transmitter (a.k.a. wireless router), which sends out very pulsed and chopped stuff, going right into the motherboard and SATA cables. On the other hand, you have a ~GHz system with lots of PC board traces that act like antennas. And remember: internal SATA cables are NOT shielded (but at least the are differential), and they run at GHz frequencies.

I would definitely put the computer in a case. They make some very compact and space-efficient cases. If that's impossible, try using eSATA cables (they are shielded), or putting some sheet metal around the board.

Pawtuxet · Apr 29, 2014

A case may end up being necessary, but I'm going to try shielding the inside of the cabinet first. I haven't been able to find a Micro-ATX case that holds 8 3.5" drives as well as the system drive, or has 6 (or 2x3) 5.25" expansion slots, or I would've just gone with that to begin with.

Anyway, simply moving the router away didn't make any difference. EMI is probably the problem, but I think the source is something cruder. Like the nearby refrigerator or the front-door intercom buzzers that exist in all surrounding apartments, along with my own.

It's my understanding that, as long as I'm not concerned about interference emitted from the server, an ungrounded shield is adequate. I'd like to limit the number of connectors I have to remember disconnecting, in order to access the computer.

wblock@ · Apr 29, 2014

I really doubt this is due to noise. Notebooks have internal antennas close to their circuitry, and it's not a problem. Motherboards are designed to avoid this. Case shielding is generally desirable to keep systems from broadcasting RF noise.

Proving that the router is innocent should be easy. Move it somewhere else in the house. My main suspect would be the unusual disk controller. After that, a power supply problem. Does it have a UPS?

Pawtuxet · Apr 29, 2014

wblock@ said:
I really doubt this is due to noise. Notebooks have internal antennas close to their circuitry, and it's not a problem. Motherboards are designed to avoid this. Case shielding is generally desirable to keep systems from broadcasting RF noise.

That was generally what I came away with, after researching a caseless build. There are many, many computers running in wooden or acrylic cases, built into desks and cabinets. Some have been shielded, but most haven't, and when they are, the reasoning is usually so that they won't interfere with other gear - not for their own protection. But I'm going to give it a try. There might be a significant source of noise in one of the other apartments, that I don't know about.

wblock@ said:
Proving that the router is innocent should be easy. Move it somewhere else in the house. My main suspect would be the unusual disk controller. After that, a power supply problem. Does it have a UPS?

I had the machine running for a day without the disk controller card attached, and it still locked up a couple of times. Unless the card has somehow damaged the motherboard, it can probably be ruled out. Unless you meant that it's one of the built-in controllers that is unusual.

I do not have a UPS - they are rarely needed here. But I can try moving it to a different outlet.

And thanks! I appreciate all the suggestions.

Pawtuxet · May 4, 2014

I've given up trying to pinpoint a reason for the recurring lockups and implemented a blanket "fix". The motherboard has been replaced with an ASRock Z77 Pro4, a UPS has been attached (.pdf) and everything moved into a Lian-Li PC-A10 case. The dream of the invisible server has thusly died, but at least it now seems to bloody work, having run for just short of two days.

Code:

# uptime
 4:16pm  up 1 day, 18:26, 3 users, load averages: 0,23 0,25 0,28

I'm inclined to blame the motherboard for the trouble I've experienced. Perhaps a fault developed after it was put to use.
Interference can't be ruled out, but then again, it ran just fine the first two weeks.

Eitherway, thanks for all the suggestions!

ralphbsz · May 6, 2014

Seems to be a good, but painful and expensive solution. I use a smaller Lian-Li case myself; they seem very well made and easy to use. And our UPS was put to good use this morning, when a tree took out power at 3AM (and the house was on generator from 6AM to 7AM so we could get ready for work and school).