Hi all,
I've just built a 6x3TB raidz2 FreeBSD NAS as a project to learn about FreeBSD, full specs below. The system hangs seemingly randomly and I can't figure out why, or even whether it's a hardware or software problem, so I would be very grateful for any advice on how to diagnose it.
When I say "hang", I mean the terminal doesn't echo keystrokes and network connections die (ssh and ping fail). The first few times, it hung within about 48 hours of boot. The last time it happened (yesterday) it had been running for a little over 8 days.
I log CPU and hard drive temps every 5 minutes and there's no unusual activity there. I then set up a script to log the output of top to a new file every second so I could get a second-by-second snapshot of everything and again, nothing unusual there that I could see -- the last successful output is pasted below. I've run memtest on the memory and full SMART tests on the hard drives, no problems reported.
The only clue I have so far is in my second-by-second logging of top. As I say, I ran:
to save the output round-robin style. The last file with a successful top output is File3737.txt, at 11:37:37 (pasted below). Then it goes on to create File3738.txt to File3745.txt but with zero file size (so I assume 'top' has failed but the shell command is still running). It skips File3746.txt and File3804.txt entirely (ie the files from the previous hour are still there), and the last zero-byte file it creates is File3810.txt.
So this means it takes about 23 seconds to crash -- it's not instant. The only thing I can think of is a race condition that quickly uses up some system resources, but I don't know how to diagnose this further.
Hardware is:
Asrock C2550D4I
16GB Kingston ECC RAM, (2x KVR16E11/8)
6x3TB WD Reds
1x120GB Kingston V300 SSD
Seasonic SS-300SFD 80 Plus PSU
Last top file is
/boot/loader.conf
/var/log/dmesg.today
I'm out of ideas, and don't know what else to try. My next move was going to be trying a previous version of FreeBSD or another operating system entirely. That should at least help determine if it's hardware or software, but it could take weeks of running it until I'm convinced the problem has gone away. Can anyone else think of what I might be able to look at?
Thanks in advance!
Adam
I've just built a 6x3TB raidz2 FreeBSD NAS as a project to learn about FreeBSD, full specs below. The system hangs seemingly randomly and I can't figure out why, or even whether it's a hardware or software problem, so I would be very grateful for any advice on how to diagnose it.
When I say "hang", I mean the terminal doesn't echo keystrokes and network connections die (ssh and ping fail). The first few times, it hung within about 48 hours of boot. The last time it happened (yesterday) it had been running for a little over 8 days.
I log CPU and hard drive temps every 5 minutes and there's no unusual activity there. I then set up a script to log the output of top to a new file every second so I could get a second-by-second snapshot of everything and again, nothing unusual there that I could see -- the last successful output is pasted below. I've run memtest on the memory and full SMART tests on the hard drives, no problems reported.
The only clue I have so far is in my second-by-second logging of top. As I say, I ran:
while sleep 1; top -b > `date "+File%M%S.txt"`; done
to save the output round-robin style. The last file with a successful top output is File3737.txt, at 11:37:37 (pasted below). Then it goes on to create File3738.txt to File3745.txt but with zero file size (so I assume 'top' has failed but the shell command is still running). It skips File3746.txt and File3804.txt entirely (ie the files from the previous hour are still there), and the last zero-byte file it creates is File3810.txt.
So this means it takes about 23 seconds to crash -- it's not instant. The only thing I can think of is a race condition that quickly uses up some system resources, but I don't know how to diagnose this further.
Hardware is:
Asrock C2550D4I
16GB Kingston ECC RAM, (2x KVR16E11/8)
6x3TB WD Reds
1x120GB Kingston V300 SSD
Seasonic SS-300SFD 80 Plus PSU
Last top file is
Code:
last pid: 81320; load averages: 0.12, 0.08, 0.08 up 8+14:40:08 11:37:37
26 processes: 1 running, 25 sleeping
Mem: 1692K Active, 75M Inact, 13G Wired, 374M Buf, 2943M Free
ARC: 11G Total, 2283M MFU, 9415M MRU, 18K Anon, 27M Header, 20M Other
Swap: 3881M Total, 3881M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
18043 adam 1 52 0 13144K 2884K wait 2 10:38 0.00% sh
17068 root 1 20 0 20120K 3520K select 1 6:54 0.00% top
14910 root 32 20 0 8304K 2576K rpcsvc 2 1:01 0.00% nfsd
805 root 1 20 0 20600K 6244K select 0 0:19 0.00% sendmail
812 root 1 20 0 12564K 2452K nanslp 2 0:03 0.00% cron
657 root 1 20 0 10472K 2404K select 2 0:03 0.00% syslogd
1168 root 1 20 0 10424K 2312K select 3 0:01 0.00% rpcbind
407 root 1 20 0 9512K 4992K select 0 0:00 0.00% devd
808 smmsp 1 20 0 20600K 5928K pause 1 0:00 0.00% sendmail
14909 root 1 20 0 14448K 3944K select 0 0:00 0.00% nfsd
1184 root 1 20 0 16612K 4336K select 1 0:00 0.00% mountd
802 root 1 20 0 55676K 7032K select 2 0:00 0.00% sshd
859 root 1 20 0 43732K 2944K wait 1 0:00 0.00% login
18033 root 1 20 0 43732K 2968K wait 1 0:00 0.00% login
500 root 1 46 0 10592K 2368K select 0 0:00 0.00% dhclient
16955 root 1 20 0 19600K 3584K pause 0 0:00 0.00% csh
569 _dhcp 1 20 0 10592K 2488K select 3 0:00 0.00% dhclient
81320 adam 1 72 0 20120K 3000K CPU3 3 0:00 0.00% top
/boot/loader.conf
Code:
coretemp_load="YES"
/var/log/dmesg.today
Code:
Copyright (c) 1992-2016 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: Intel(R) Atom(TM) CPU C2550 @ 2.40GHz (2400.06-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x406d8 Family=0x6 Model=0x4d Stepping=8
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,AESNI,RDRAND>
AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
AMD Features2=0x101<LAHF,Prefetch>
Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
TSC: P-state invariant, performance statistics
real memory = 17179869184 (16384 MB)
avail memory = 16554504192 (15787 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <INTEL TIANO >
WARNING: L1 data cache covers less APIC IDs than a core
0 < 1
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-23 on motherboard
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff8101d970, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <ALASKA A M I > on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: _OSC returned error 0x10
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> mem 0xdf740000-0xdf75ffff irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> mem 0xdf720000-0xdf73ffff irq 20 at device 3.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> mem 0xdf500000-0xdf51ffff irq 22 at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <PCI-PCI bridge> irq 23 at device 1.0 on pci3
pci4: <PCI bus> on pcib4
ahci0: <Marvell 88SE9172 AHCI SATA controller> port 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xdf410000-0xdf4101ff irq 23 at device 0.0 on pci4
ahci0: AHCI v1.00 with 2 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
pcib5: <PCI-PCI bridge> irq 23 at device 5.0 on pci3
pci5: <PCI bus> on pcib5
pcib6: <PCI-PCI bridge> irq 23 at device 0.0 on pci5
pci6: <PCI bus> on pcib6
vgapci0: <VGA-compatible display> port 0xb000-0xb07f mem 0xde000000-0xdeffffff,0xdf000000-0xdf01ffff irq 23 at device 0.0 on pci6
vgapci0: Boot video device
pcib7: <PCI-PCI bridge> irq 21 at device 7.0 on pci3
pci7: <PCI bus> on pcib7
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xa000-0xa01f mem 0xdf300000-0xdf37ffff,0xdf380000-0xdf383fff irq 21 at device 0.0 on pci7
igb0: Using MSIX interrupts with 5 vectors
igb0: Ethernet address: d0:50:99:c0:e9:d9
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: netmap queues/slots: TX 4/1024, RX 4/1024
pcib8: <PCI-PCI bridge> irq 23 at device 9.0 on pci3
pci8: <PCI bus> on pcib8
igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0x9000-0x901f mem 0xdf200000-0xdf27ffff,0xdf280000-0xdf283fff irq 23 at device 0.0 on pci8
igb1: Using MSIX interrupts with 5 vectors
igb1: Ethernet address: d0:50:99:c0:e9:da
igb1: Bound queue 0 to cpu 0
igb1: Bound queue 1 to cpu 1
igb1: Bound queue 2 to cpu 2
igb1: Bound queue 3 to cpu 3
igb1: netmap queues/slots: TX 4/1024, RX 4/1024
pcib9: <ACPI PCI-PCI bridge> mem 0xdf700000-0xdf71ffff at device 4.0 on pci0
pci9: <ACPI PCI bus> on pcib9
ahci1: <Marvell 88SE9230 AHCI SATA controller> port 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem 0xdf610000-0xdf6107ff irq 23 at device 0.0 on pci9
ahci1: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci1: quirks=0x900<NOBSYRES,ALTSIG>
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich4: <AHCI channel> at channel 2 on ahci1
ahcich5: <AHCI channel> at channel 3 on ahci1
ahcich6: <AHCI channel> at channel 4 on ahci1
ahcich7: <AHCI channel> at channel 5 on ahci1
ahcich8: <AHCI channel> at channel 6 on ahci1
ahcich9: <AHCI channel> at channel 7 on ahci1
pci0: <base peripheral, IOMMU> at device 15.0 (no driver attached)
ehci0: <Intel Avoton USB 2.0 controller> mem 0xdf763000-0xdf7633ff irq 23 at device 22.0 on pci0
usbus0: EHCI version 1.0
usbus0 on ehci0
ahci2: <Intel Avoton AHCI SATA controller> port 0xe0d0-0xe0d7,0xe0c0-0xe0c3,0xe0b0-0xe0b7,0xe0a0-0xe0a3,0xe040-0xe05f mem 0xdf762000-0xdf7627ff irq 19 at device 23.0 on pci0
ahci2: AHCI v1.30 with 4 3Gbps ports, Port Multiplier not supported
ahcich10: <AHCI channel> at channel 0 on ahci2
ahcich11: <AHCI channel> at channel 1 on ahci2
ahcich12: <AHCI channel> at channel 2 on ahci2
ahcich13: <AHCI channel> at channel 3 on ahci2
ahci3: <Intel Avoton AHCI SATA controller> port 0xe090-0xe097,0xe080-0xe083,0xe070-0xe077,0xe060-0xe063,0xe020-0xe03f mem 0xdf761000-0xdf7617ff irq 19 at device 24.0 on pci0
ahci3: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported
ahcich14: <AHCI channel> at channel 0 on ahci3
ahcich15: <AHCI channel> at channel 1 on ahci3
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart2: <16550 or compatible> port 0x248-0x24f irq 3 on acpi0
orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
fdc0: <Enhanced floppy controller> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
ppc0: cannot reserve I/O port range
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
est1: <Enhanced SpeedStep Frequency Control> on cpu1
coretemp2: <CPU On-Die Thermal Sensors> on cpu2
est2: <Enhanced SpeedStep Frequency Control> on cpu2
coretemp3: <CPU On-Die Thermal Sensors> on cpu3
est3: <Enhanced SpeedStep Frequency Control> on cpu3
usbus0: 480Mbps High Speed USB v2.0
Timecounters tick every 1.000 msec
nvme cam probe device init
ugen0.1: <Intel> at usbus0
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada0: Serial Number WD-WCC4N6FRJHAP
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors)
ada0: quirks=0x1<4K>
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada1: Serial Number WD-WCC4N6APRXTD
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors)
ada1: quirks=0x1<4K>
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada2: Serial Number WD-WCC4N4FCNAPP
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 2861588MB (5860533168 512 byte sectors)
ada2: quirks=0x1<4K>
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada3: Serial Number WD-WCC4N7NEC83H
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors)
ada3: quirks=0x1<4K>
ada4 at ahcich4 bus 0 scbus4 target 0 lun 0
ada4: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada4: Serial Number WD-WCC4N3HS636U
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 2861588MB (5860533168 512 byte sectors)
ada4: quirks=0x1<4K>
ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada5: Serial Number WD-WCC4N5KC95K6
ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 2861588MB (5860533168 512 byte sectors)
ada5: quirks=0x1<4K>
ada6 at ahcich15 bus 0 scbus15 target 0 lun 0
ada6: <KINGSTON SV300S37A120G 60AABBF0> ATA8-ACS SATA 3.x device
ada6: Serial Number 50026B766C038E30
ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada6: Command Queueing enabled
ada6: 114473MB (234441648 512 byte sectors)
pass6 at ahcich9 bus 0 scbus9 target 0 lun 0
pass6: <Marvell Console 1.01> Removable Processor SCSI device
pass6: Serial Number HKDP221516WL
pass6: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #2 Launched!
Timecounter "TSC-low" frequency 1200028860 Hz quality 1000
Trying to mount root from ufs:/dev/ada6s1a [rw]...
WARNING: / was not properly dismounted
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
uhub0: 8 ports with 8 removable, self powered
ugen0.2: <vendor 0x8087> at usbus0
uhub1: <vendor 0x8087 product 0x07db, class 9/0, rev 2.00/0.02, addr 2> on usbus0
uhub1: 4 ports with 4 removable, self powered
ugen0.3: <Apple, Inc.> at usbus0
uhub2: <Apple, Inc. Keyboard Hub, class 9/0, rev 2.00/96.15, addr 3> on usbus0
uhub2: 3 ports with 2 removable, bus powered
ugen0.4: <Apple, Inc> at usbus0
ukbd0: <Apple, Inc Apple Keyboard, class 0/0, rev 2.00/0.71, addr 4> on usbus0
kbd2 at ukbd0
ugen0.5: <American Megatrends Inc.> at usbus0
uhub3: <7-port Hub> on usbus0
uhub3: 5 ports with 5 removable, self powered
ugen0.6: <American Megatrends Inc.> at usbus0
ukbd1: <Keyboard Interface> on usbus0
kbd3 at ukbd1
uhid0: <Apple, Inc Apple Keyboard, class 0/0, rev 2.00/0.71, addr 4> on usbus0
ums0: <Mouse Interface> on usbus0
ums0: 3 buttons and [Z] coordinates ID=0
igb1: link state changed to UP
Limiting closed port RST response from 327 to 200 packets/sec
Limiting closed port RST response from 328 to 200 packets/sec
Limiting closed port RST response from 326 to 200 packets/sec
Limiting closed port RST response from 328 to 200 packets/sec
I'm out of ideas, and don't know what else to try. My next move was going to be trying a previous version of FreeBSD or another operating system entirely. That should at least help determine if it's hardware or software, but it could take weeks of running it until I'm convinced the problem has gone away. Can anyone else think of what I might be able to look at?
Thanks in advance!
Adam