Diagnosing a hanging system

Hi all,

I've just built a 6x3TB raidz2 FreeBSD NAS as a project to learn about FreeBSD, full specs below. The system hangs seemingly randomly and I can't figure out why, or even whether it's a hardware or software problem, so I would be very grateful for any advice on how to diagnose it.

When I say "hang", I mean the terminal doesn't echo keystrokes and network connections die (ssh and ping fail). The first few times, it hung within about 48 hours of boot. The last time it happened (yesterday) it had been running for a little over 8 days.

I log CPU and hard drive temps every 5 minutes and there's no unusual activity there. I then set up a script to log the output of top to a new file every second so I could get a second-by-second snapshot of everything and again, nothing unusual there that I could see -- the last successful output is pasted below. I've run memtest on the memory and full SMART tests on the hard drives, no problems reported.

The only clue I have so far is in my second-by-second logging of top. As I say, I ran:

while sleep 1; top -b > `date "+File%M%S.txt"`; done

to save the output round-robin style. The last file with a successful top output is File3737.txt, at 11:37:37 (pasted below). Then it goes on to create File3738.txt to File3745.txt but with zero file size (so I assume 'top' has failed but the shell command is still running). It skips File3746.txt and File3804.txt entirely (ie the files from the previous hour are still there), and the last zero-byte file it creates is File3810.txt.

So this means it takes about 23 seconds to crash -- it's not instant. The only thing I can think of is a race condition that quickly uses up some system resources, but I don't know how to diagnose this further.

Hardware is:
Asrock C2550D4I
16GB Kingston ECC RAM, (2x KVR16E11/8)
6x3TB WD Reds
1x120GB Kingston V300 SSD
Seasonic SS-300SFD 80 Plus PSU

Last top file is

Code:
last pid: 81320;  load averages:  0.12,  0.08,  0.08  up 8+14:40:08    11:37:37

26 processes:  1 running, 25 sleeping


Mem: 1692K Active, 75M Inact, 13G Wired, 374M Buf, 2943M Free

ARC: 11G Total, 2283M MFU, 9415M MRU, 18K Anon, 27M Header, 20M Other

Swap: 3881M Total, 3881M Free



  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND

18043 adam          1  52    0 13144K  2884K wait    2  10:38   0.00% sh

17068 root          1  20    0 20120K  3520K select  1   6:54   0.00% top

14910 root         32  20    0  8304K  2576K rpcsvc  2   1:01   0.00% nfsd

  805 root          1  20    0 20600K  6244K select  0   0:19   0.00% sendmail

  812 root          1  20    0 12564K  2452K nanslp  2   0:03   0.00% cron

  657 root          1  20    0 10472K  2404K select  2   0:03   0.00% syslogd

 1168 root          1  20    0 10424K  2312K select  3   0:01   0.00% rpcbind

  407 root          1  20    0  9512K  4992K select  0   0:00   0.00% devd

  808 smmsp         1  20    0 20600K  5928K pause   1   0:00   0.00% sendmail

14909 root          1  20    0 14448K  3944K select  0   0:00   0.00% nfsd

 1184 root          1  20    0 16612K  4336K select  1   0:00   0.00% mountd

  802 root          1  20    0 55676K  7032K select  2   0:00   0.00% sshd

  859 root          1  20    0 43732K  2944K wait    1   0:00   0.00% login

18033 root          1  20    0 43732K  2968K wait    1   0:00   0.00% login

  500 root          1  46    0 10592K  2368K select  0   0:00   0.00% dhclient

16955 root          1  20    0 19600K  3584K pause   0   0:00   0.00% csh

  569 _dhcp         1  20    0 10592K  2488K select  3   0:00   0.00% dhclient

81320 adam          1  72    0 20120K  3000K CPU3    3   0:00   0.00% top

/boot/loader.conf

Code:
coretemp_load="YES"

/var/log/dmesg.today

Code:
Copyright (c) 1992-2016 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017
    root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz (2400.06-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x406d8  Family=0x6  Model=0x4d  Stepping=8
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,AESNI,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 17179869184 (16384 MB)
avail memory = 16554504192 (15787 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <INTEL  TIANO   >
WARNING: L1 data cache covers less APIC IDs than a core
0 < 1
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-23 on motherboard
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff8101d970, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <ALASKA A M I > on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: _OSC returned error 0x10
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> mem 0xdf740000-0xdf75ffff irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> mem 0xdf720000-0xdf73ffff irq 20 at device 3.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> mem 0xdf500000-0xdf51ffff irq 22 at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <PCI-PCI bridge> irq 23 at device 1.0 on pci3
pci4: <PCI bus> on pcib4
ahci0: <Marvell 88SE9172 AHCI SATA controller> port 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xdf410000-0xdf4101ff irq 23 at device 0.0 on pci4
ahci0: AHCI v1.00 with 2 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
pcib5: <PCI-PCI bridge> irq 23 at device 5.0 on pci3
pci5: <PCI bus> on pcib5
pcib6: <PCI-PCI bridge> irq 23 at device 0.0 on pci5
pci6: <PCI bus> on pcib6
vgapci0: <VGA-compatible display> port 0xb000-0xb07f mem 0xde000000-0xdeffffff,0xdf000000-0xdf01ffff irq 23 at device 0.0 on pci6
vgapci0: Boot video device
pcib7: <PCI-PCI bridge> irq 21 at device 7.0 on pci3
pci7: <PCI bus> on pcib7
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xa000-0xa01f mem 0xdf300000-0xdf37ffff,0xdf380000-0xdf383fff irq 21 at device 0.0 on pci7
igb0: Using MSIX interrupts with 5 vectors
igb0: Ethernet address: d0:50:99:c0:e9:d9
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: netmap queues/slots: TX 4/1024, RX 4/1024
pcib8: <PCI-PCI bridge> irq 23 at device 9.0 on pci3
pci8: <PCI bus> on pcib8
igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0x9000-0x901f mem 0xdf200000-0xdf27ffff,0xdf280000-0xdf283fff irq 23 at device 0.0 on pci8
igb1: Using MSIX interrupts with 5 vectors
igb1: Ethernet address: d0:50:99:c0:e9:da
igb1: Bound queue 0 to cpu 0
igb1: Bound queue 1 to cpu 1
igb1: Bound queue 2 to cpu 2
igb1: Bound queue 3 to cpu 3
igb1: netmap queues/slots: TX 4/1024, RX 4/1024
pcib9: <ACPI PCI-PCI bridge> mem 0xdf700000-0xdf71ffff at device 4.0 on pci0
pci9: <ACPI PCI bus> on pcib9
ahci1: <Marvell 88SE9230 AHCI SATA controller> port 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem 0xdf610000-0xdf6107ff irq 23 at device 0.0 on pci9
ahci1: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci1: quirks=0x900<NOBSYRES,ALTSIG>
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich4: <AHCI channel> at channel 2 on ahci1
ahcich5: <AHCI channel> at channel 3 on ahci1
ahcich6: <AHCI channel> at channel 4 on ahci1
ahcich7: <AHCI channel> at channel 5 on ahci1
ahcich8: <AHCI channel> at channel 6 on ahci1
ahcich9: <AHCI channel> at channel 7 on ahci1
pci0: <base peripheral, IOMMU> at device 15.0 (no driver attached)
ehci0: <Intel Avoton USB 2.0 controller> mem 0xdf763000-0xdf7633ff irq 23 at device 22.0 on pci0
usbus0: EHCI version 1.0
usbus0 on ehci0
ahci2: <Intel Avoton AHCI SATA controller> port 0xe0d0-0xe0d7,0xe0c0-0xe0c3,0xe0b0-0xe0b7,0xe0a0-0xe0a3,0xe040-0xe05f mem 0xdf762000-0xdf7627ff irq 19 at device 23.0 on pci0
ahci2: AHCI v1.30 with 4 3Gbps ports, Port Multiplier not supported
ahcich10: <AHCI channel> at channel 0 on ahci2
ahcich11: <AHCI channel> at channel 1 on ahci2
ahcich12: <AHCI channel> at channel 2 on ahci2
ahcich13: <AHCI channel> at channel 3 on ahci2
ahci3: <Intel Avoton AHCI SATA controller> port 0xe090-0xe097,0xe080-0xe083,0xe070-0xe077,0xe060-0xe063,0xe020-0xe03f mem 0xdf761000-0xdf7617ff irq 19 at device 24.0 on pci0
ahci3: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported
ahcich14: <AHCI channel> at channel 0 on ahci3
ahcich15: <AHCI channel> at channel 1 on ahci3
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart2: <16550 or compatible> port 0x248-0x24f irq 3 on acpi0
orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
fdc0: <Enhanced floppy controller> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
ppc0: cannot reserve I/O port range
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
est1: <Enhanced SpeedStep Frequency Control> on cpu1
coretemp2: <CPU On-Die Thermal Sensors> on cpu2
est2: <Enhanced SpeedStep Frequency Control> on cpu2
coretemp3: <CPU On-Die Thermal Sensors> on cpu3
est3: <Enhanced SpeedStep Frequency Control> on cpu3
usbus0: 480Mbps High Speed USB v2.0
Timecounters tick every 1.000 msec
nvme cam probe device init
ugen0.1: <Intel> at usbus0
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada0: Serial Number WD-WCC4N6FRJHAP
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors)
ada0: quirks=0x1<4K>
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada1: Serial Number WD-WCC4N6APRXTD
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors)
ada1: quirks=0x1<4K>
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada2: Serial Number WD-WCC4N4FCNAPP
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 2861588MB (5860533168 512 byte sectors)
ada2: quirks=0x1<4K>
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada3: Serial Number WD-WCC4N7NEC83H
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors)
ada3: quirks=0x1<4K>
ada4 at ahcich4 bus 0 scbus4 target 0 lun 0
ada4: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada4: Serial Number WD-WCC4N3HS636U
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 2861588MB (5860533168 512 byte sectors)
ada4: quirks=0x1<4K>
ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada5: Serial Number WD-WCC4N5KC95K6
ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 2861588MB (5860533168 512 byte sectors)
ada5: quirks=0x1<4K>
ada6 at ahcich15 bus 0 scbus15 target 0 lun 0
ada6: <KINGSTON SV300S37A120G 60AABBF0> ATA8-ACS SATA 3.x device
ada6: Serial Number 50026B766C038E30
ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada6: Command Queueing enabled
ada6: 114473MB (234441648 512 byte sectors)
pass6 at ahcich9 bus 0 scbus9 target 0 lun 0
pass6: <Marvell Console 1.01> Removable Processor SCSI device
pass6: Serial Number HKDP221516WL
pass6: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #2 Launched!
Timecounter "TSC-low" frequency 1200028860 Hz quality 1000
Trying to mount root from ufs:/dev/ada6s1a [rw]...
WARNING: / was not properly dismounted
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
uhub0: 8 ports with 8 removable, self powered
ugen0.2: <vendor 0x8087> at usbus0
uhub1: <vendor 0x8087 product 0x07db, class 9/0, rev 2.00/0.02, addr 2> on usbus0
uhub1: 4 ports with 4 removable, self powered
ugen0.3: <Apple, Inc.> at usbus0
uhub2: <Apple, Inc. Keyboard Hub, class 9/0, rev 2.00/96.15, addr 3> on usbus0
uhub2: 3 ports with 2 removable, bus powered
ugen0.4: <Apple, Inc> at usbus0
ukbd0: <Apple, Inc Apple Keyboard, class 0/0, rev 2.00/0.71, addr 4> on usbus0
kbd2 at ukbd0
ugen0.5: <American Megatrends Inc.> at usbus0
uhub3: <7-port Hub> on usbus0
uhub3: 5 ports with 5 removable, self powered
ugen0.6: <American Megatrends Inc.> at usbus0
ukbd1: <Keyboard Interface> on usbus0
kbd3 at ukbd1
uhid0: <Apple, Inc Apple Keyboard, class 0/0, rev 2.00/0.71, addr 4> on usbus0
ums0: <Mouse Interface> on usbus0
ums0: 3 buttons and [Z] coordinates ID=0
igb1: link state changed to UP
Limiting closed port RST response from 327 to 200 packets/sec
Limiting closed port RST response from 328 to 200 packets/sec
Limiting closed port RST response from 326 to 200 packets/sec
Limiting closed port RST response from 328 to 200 packets/sec

I'm out of ideas, and don't know what else to try. My next move was going to be trying a previous version of FreeBSD or another operating system entirely. That should at least help determine if it's hardware or software, but it could take weeks of running it until I'm convinced the problem has gone away. Can anyone else think of what I might be able to look at?

Thanks in advance!
Adam
 
Code:
Limiting closed port RST response from 327 to 200 packets/sec 
Limiting closed port RST response from 328 to 200 packets/sec 
Limiting closed port RST response from 326 to 200 packets/sec 
Limiting closed port RST response from 328 to 200 packets/sec
I'm wondering if these are symptoms or an indication of the cause. You can get the same messages when you're being DoS'ed for example. But they can also be the result of the machine hanging.
 
So that dmesg.today file was last edited at 3.01am on 15 March, which would have been about two days into its uptime, and it hung 6 days later. Could be a symptom but I don't think it was the proximate cause.

I only have three other devices on my network (android phone, a desktop linux PC, and macbook pro laptop), none of which ought to be doing anything odd. The only thing I can think of which may have caused that is around that time I was playing with various ways of creating log files every second, which involved creating a folder with hundreds/thousands of files in. I then browsed to that folder on my mac over samba and nfs. I don't know how samba works but if it encounters a folder with a large number of files I can imagine it could flood the network with requests like that...?

I might try and recreate that later today. I also noticed some odd behaviour across samba and ZFS when using files that took their names directly from the output of date which I've been meaning to try and recreate, but that is entirely unrelated to my hanging problem so is a lower priority!
 
I don't know how samba works but if it encounters a folder with a large number of files I can imagine it could flood the network with requests like that...?
Possible but I would not expect the entire machine to lock up. It would be slow and Samba may throw some errors but it shouldn't lock everything up.
 
This is a tough one for sure. Here is how I would handle it.

First I would google the heck out of that specific motherboard, because even though things should just work... sometimes they don't. Maybe you can find a hint about a BIOS problem, recommended BIOS settings, etc. (You may have chosen this board after a ton of compatibility research in which case nevermind.)

Next I would flog the machine hard to see if I could cause the hang. Maybe something like make buildworld? Maybe a different test to get the network interfaces working hard? If you can cause the hang with load future testing would go faster.

Even though the RAM has passed memtest, I would pull one of the sticks, then the other. This doesn't really smell like a RAM issue, but...

I'd consider completely disabling all the zfs features and seeing how that goes. Misbehaving zfs can cause a lockup... at least I think it can based on posts I have seen here, I've only dabbled in it!

Good luck and please let us know how it goes.
 
Quick update, I've found a bunch of errors being reported on the 1.0V sensor when I went into the motherboard's IPMI web interface. Things like "Lower Critical - Going Low", "Lower Non-Recoverable - Going Low" and "Lower Non-Critical - Going Low". All of those sound bad but I have no idea what they mean!

The errors seem to be occurring after the system halts rather than before, from looking at the times they are logged, so again, might be a symptom/clue rather than a cause. I've contacted ASRock so will see what they say.
 
Did you compiled kernel with QoS called ALTQ for PF?
Something like this?

Code:
options         ALTQ            # ALTQ can be used with PF to provide Quality of Service (QOS)
options         ALTQ_CBQ        # Class Based Queuing (CBQ)
options         ALTQ_RED        # Random Early Detection (RED)
options         ALTQ_RIO        # RED In/Out
options         ALTQ_HFSC       # Hierarchical Packet Scheduler (HFSC)
options         ALTQ_PRIQ       # Priority Queuing (PRIQ)
 
Hi IPTrace,

I just downloaded and installed the most recent memstick installer image for amd64, so I'm not sure exactly how the kernel was compiled. Can you explain a bit more about what those options mean and how I might be able to check how the kernel is configured?
 
These options are added manually to kernel options file /usr/src/sys/amd64/conf/GENERIC or to another own file in this path.
So if you don't know about it I'm sure you don't have it.

If you use a clean memstick installer, there is no ALTQ options compiled.
ALTQ supports Qualty of Service (QoS) for Packet Filter (PF - firewall derives from OpenBSD).
It means you can "sort" traffic based on IP, protocol (TCP/UDP...) etc., speed up some traffic or slow it down.

https://www.freebsd.org/doc/handbook/firewalls-pf.html
 
When I say "hang", I mean the terminal doesn't echo keystrokes and network connections die (ssh and ping fail). The first few times, it hung within about 48 hours of boot. The last time it happened (yesterday) it had been running for a little over 8 days.
I go into some first-level hang troubleshooting here.
I then set up a script to log the output of top to a new file every second so I could get a second-by-second snapshot of everything and again, nothing unusual there that I could see -- the last successful output is pasted below.
As mentioned in my linked post above, I'd suggest running top(1) or systat(1) on the console. If your hang involves the disk subsystem, the console output should continue even though the system can no longer read or write to the disk.
Code:
WARNING: L1 data cache covers less APIC IDs than a core
0 < 1
This would seem to indicate that something is getting mis-reported to the kernel. I don't know if it is related to the issue you're seeing, but you should see if the latest BIOS update fixes it. Also, check to see if this motherboard is affected by the C2000 fault and if so, get the manufacturer to replace it for you.
 
Quick update: shortly after this, the system refused to even POST.

IPMI still worked and it was still showing occasional errors on the 1.0V rail. I contacted ASRock who told me to update the BOIS and BMC, but by then, that function wasn't working. And shortly after that, the 1.0V rail had failed entirely and was reporting 0.08V. I'm currently awaiting going through the RMA process.

That board has two 'known' issues, mentioned in these forums and elsewhere on the internet. One is a BIOS/BMC Watchdog Timer issue, whereby the watchdog keeps writing to the firmware every 10 seconds, and after about 2 years uses up all the firmware write-cycles. When that happens, the system won't boot anymore. This has, I think, been fixed by the latest BIOS/BMC updates.

The other is the C2000 fault Terry pointed out. There's less information out about it, but it's related to a hardware defect in a clock signal component, causing it to physically wear out after a "long" period of use, where long can be 18-40 months depending on use.

My board was bought new, but had a manufacturing date of around June/July 2014, so it is physically old enough to suffer from either of these issues, especially if I had actually received a refurbished/used board that has been reset/wiped. Or it could be a completely separate hardware/manufacturing error. I have no idea if either of those two known issues would cause the 1.0V rail to fail like that.

In any case, ASRock technical support were okay, not very fast at replying to email and not massively helpful, but they weren't dismissive and usually replied within 24 hours. They didn't mention either of those two issues, for example, or how they might present themselves. The vendor was a bit more friendly and helpful, although no more knowledgable. They've arranged for a courier to collect the board and will deliver a new one to me as soon as they are back in stock. Although given all this I'm a bit tempted to use a different board.

Oh, and the weird network flooding thing appears to be a dodgy netgear switch on my network. If I plug and un-plug network cables too much (as I was doing whilst trying to diagnose things) then it hangs and needs to be unplugged from the power.

I'm happy that FreeBSD doesn't seem to have been the problem! From what little time I spent with it, I really like it's simplicity. Can't wait to get a new board so I can continue to set up my NAS and play!

TL;DR: Hardware problem, replacing a faulty motherboard.

Many thanks to everyone who offered advice!

Adam
 
Back
Top