Solved FreeBSD 10.3 kernel panic / Intel NUC 6th gen

Hello all,

One month ago, I bought an Intel NUC 6th gen (NUC6i5SYH) coupled with a 2,5" 2To hard drive and 8Go RAM.
I installed a Plex Media Server to host my digital life :)
Everything went fine for weeks, but suddenly, my FreeBSD started to have kernel panics.

I'm happily using FreeBSD for years now, and I know how to deal with the most of the problems I have, but I confess I don't have the skills to debug something like a kernel panic...
I don't know where to start to understand what's going on.

Which files will be useful, and where do I have to look to narrow things down and find who's guilty of making my FreeBSD go crazy.

Thanks in advance,

--
Léo.
 
If nothing has changed (updates, etc) it's likely caused by hardware failures.
 
Hi SirDice,

Nothing has really changed, and my FreeBSD is updated (with freebsd-update process).
I thought the same, so I've replaced the hard drive with a new one.
I've replaced the RAM with another one.
I reinstalled FreeBSD from scratch.

Same problem...

Do you think it could be the NUC unit ?
Is there a way to have any clue about where to look ?
Because maybe it's hardware, or maybe it could be a BIOS setting ?
And if I have to change it, the vendor will ask me some more details about why I want to replace it.

Regards,

--
Léo.
 
I read on few forums that I could have more informations about a crash in "/var/crash".

Code:
Inside /var/crash/info.core... I have :
Unread portion of the kernel message buffer:
panic: handle_written_inodeblock: Invalid link count 65535 for inodedep 0xfffff80007a64200
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff80b91632 at softdep_disk_write_complete+0x1902
#4 0xffffffff809e1053 at bufdone_finish+0x33
#5 0xffffffff809e0eb7 at bufdone+0x77
#6 0xffffffff808b00f4 at g_io_deliver+0x244
#7 0xffffffff808b00f4 at g_io_deliver+0x244
#8 0xffffffff808addbb at g_disk_done+0xfb
#9 0xffffffff8030dafc at adadone+0x45c
#10 0xffffffff802ec59d at xpt_done_process+0x5ad
#11 0xffffffff8039b5b5 at ahci_ch_intr_direct+0x105
#12 0xffffffff80397bbd at ahci_intr_one+0x2d
#13 0xffffffff8091c99b at intr_event_execute_handlers+0xab
#14 0xffffffff8091cde6 at ithread_loop+0x96
#15 0xffffffff8091a4ea at fork_exit+0x9a
#16 0xffffffff80d3be0e at fork_trampoline+0xe
Uptime: 1m32s
Dumping 446 out of 8055 MB:..4%..11%..22%..33%..44%..51%..61%..72%..83%..94%

Reading symbols from /boot/kernel/fdescfs.ko.symbols...done.
Loaded symbols for /boot/kernel/fdescfs.ko.symbols
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219  pcpu.h: No such file or directory.
  in pcpu.h
(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff80950cc2 in kern_reboot (howto=260)
  at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff809510a5 in vpanic (fmt=<value optimized out>,
  ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80950f33 in panic (fmt=0x0)
  at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff80b91632 in softdep_disk_write_complete (
  bp=<value optimized out>) at /usr/src/sys/ufs/ffs/ffs_softdep.c:11442
#5  0xffffffff809e1053 in bufdone_finish (bp=0xfffffe01e7e43170) at buf.h:421
#6  0xffffffff809e0eb7 in bufdone (bp=<value optimized out>)
  at /usr/src/sys/kern/vfs_bio.c:3834
#7  0xffffffff808b00f4 in g_io_deliver (bp=0xfffff801321652e8,
  error=<value optimized out>) at /usr/src/sys/geom/geom_io.c:681
#8  0xffffffff808b00f4 in g_io_deliver (bp=0xfffff801321651f0,
  error=<value optimized out>) at /usr/src/sys/geom/geom_io.c:681
#9  0xffffffff808addbb in g_disk_done (bp=0xfffff801321653e0)
  at /usr/src/sys/geom/geom_disk.c:254
... ... ...

Bunch of lines, but I don't know if there's something useful in it ?

I feel quite useless in front of such informations... :)
If someone has better understanding than me, don't hesitate !

--
Léo.
 
Code:
handle_written_inodeblock: Invalid link count 65535 for inodedep 0xfffff80007a64200
Looks like a filesystem corruption that's so bad the system panics. Boot to single user mode and run fsck -y.

Do you regularly turn this machine off? Make sure you do a proper shutdown(8). Just turning it off the hard way can and will cause filesystem errors.
 
Hi SirDice,

Thanks for the feedback.
I run in single user mode and ran fsck -y with success.
I was able to boot again and the system ran well for 10 hours or so.

Then, it crashes again.

I never turn off this machine, but if I had to, I would absolutely use shutdown to properly shut it down.
Nevertheless, each kernel panic cause a filesystem corruption because I have to turn it off the hard way.

Are there any tests I can run, or tools that could help me to spot what's causing the panic ?

Thanks,

--
Léo.
 
Yeah, if it regularly panics it'll definitely screw up the filesystem. But it does beg the question, are the crashes the result of filesystem corruption or is the filesystem corruption the result of the crashes? Or maybe a bit of both, a crash caused a filesystem corruption which in turn causes more crashes and more corruption.

I would definitely try sysutils/smartmontools to see if the disk is still good. If you have lots of "uncorrectable" errors it might be time to replace the disk. I'd also try something like dd if=/dev/ada0 of=/dev/null. This will read the entire disk, end to end, and copies it to /dev/null. Basically a no-op but it does read everything and will croak if there are read errors.
 
Before posting my message here, I replaced the HDD with a brand new one, and did a FreeBSD 10.3 fresh install.
As the problem still occurs after a clean FreeBSD 10.3 fresh reinstall on a brand new HDD, I suppose the panic is related to the hardware, and not to the filesystem corruption.

I follow your advice, and installed sysutils/smartmontools, and ran smartctl -a /dev/ada0 :

Code:
[root@nikky /var/crash]# smartctl -a /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Seagate Samsung SpinPoint M9T
Device Model:  ST2000LM003 HN-M201RAD
Serial Number:  S377J9DGB00142
LU WWN Device Id: 5 0004cf 2111f1ab8
Firmware Version: 2BE10001
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Form Factor:  2.5 inches
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Wed Jul 13 18:00:25 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
           was never started.
           Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0)   The previous self-test routine completed
           without error or no self-test has ever
           been run.
Total time to complete Offline
data collection:      (22200) seconds.
Offline data collection
capabilities:         (0x5b) SMART execute Offline immediate.
           Auto Offline data collection on/off support.
           Suspend Offline collection upon new
           command.
           Offline surface scan supported.
           Self-test supported.
           No Conveyance Self-test supported.
           Selective Self-test supported.
SMART capabilities:  (0x0003)   Saves SMART data before entering
           power-saving mode.
           Supports SMART auto save timer.
Error logging capability:  (0x01)   Error logging supported.
           General Purpose Logging supported.
Short self-test routine
recommended polling time:     (  1) minutes.
Extended self-test routine
recommended polling time:     ( 370) minutes.
SCT capabilities:     (0x003f)   SCT Status supported.
           SCT Error Recovery Control supported.
           SCT Feature Control supported.
           SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  100  100  051  Pre-fail  Always  -  0
  2 Throughput_Performance  0x0026  252  252  000  Old_age  Always  -  0
  3 Spin_Up_Time  0x0023  091  091  025  Pre-fail  Always  -  2799
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  5
  5 Reallocated_Sector_Ct  0x0033  252  252  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  252  252  051  Old_age  Always  -  0
  8 Seek_Time_Performance  0x0024  252  252  015  Old_age  Offline  -  0
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  79
 10 Spin_Retry_Count  0x0032  252  252  051  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  5
191 G-Sense_Error_Rate  0x0022  252  252  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0022  252  252  000  Old_age  Always  -  0
194 Temperature_Celsius  0x0002  064  060  000  Old_age  Always  -  34 (Min/Max 26/44)
195 Hardware_ECC_Recovered  0x003a  100  100  000  Old_age  Always  -  0
196 Reallocated_Event_Count 0x0032  252  252  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  252  252  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  252  252  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0036  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x002a  100  100  000  Old_age  Always  -  1
223 Load_Retry_Count  0x0032  252  252  000  Old_age  Always  -  0
225 Load_Cycle_Count  0x0032  099  099  000  Old_age  Always  -  12452
241 Total_LBAs_Written  0x0032  097  095  000  Old_age  Always  -  4201588
242 Total_LBAs_Read  0x0032  095  095  000  Old_age  Always  -  7537553

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Completed [00% left] (0-65535)
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Seems everything's fine...

I also ran dd if=/dev/ada0 of=/dev/null, it seems to be quite a long process, so I'll post the results later.

--
Léo.
 
What's the idea of that dd if=/dev/ada0 of=/dev/null ? :cool:

You should run smartctl -t long /dev/ada0 and see if it completes.
 
You should run smartctl -t long /dev/ada0 and see if it completes.

There's nothing wrong with doing that, but that command does not necessarily test the entire drive. The dd (which I'd be quite tempted to do add bs=1M to, to speed things up) verifies that 100% of the drive is readable. The smartctl test does not do that, although it may well do something else useful which a simple dd test does not do.

Additionally, dd is the traditional standard way of testing any form of block device, dating back to the origins of Unix, and it works for all block devices, not just drives with SMART support.
 
AFAIK long test examines the whole disk surface.
It might cover 100% of the blocks, it might not. Being an opaque test provided by a service which is less standards-based and more vendor-choice, you can't easily be certain exactly what it has tested, only the status it chooses to return. The dd test is guaranteed to cover 100% of the currently mapped blocks (drives have unmapped spare blocks as well).

In electronics manufacturing, it is not unusual to do less than fully comprehensive testing based on what is statistically sufficient to catch errors up to a chosen level of reliability. A vendor could easily decide to implement the long test without testing 100% of the blocks, but instead perform a randomised sampling which they believe will catch errors 99.99% of the time, for example. They may decide that the drive's built in bad block remapping will be sufficient to cover most of the cases which might be missed. That decision could be made as a result of a desire to speed up factory testing and reduce costs. This is very much more likely to be true of drives which are optimised to be cheap, rather than for performance or reliability.

When odd things are happening, it's better to do both any available SMART tests (which may well test more things than can be done via normal ATA/SCSI commands), and the basic traditional dd test. Additionally, the dd test may expose problems outside the drive itself (e.g. if there's cabling or controller issues, basically everything between the CPU and the drive's controller).
 
Hi there,

The test finally ends.
Here is the result :

Code:
[root@nikky /var/crash]# dd if=/dev/ada0 of=/dev/null
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes transferred in 164576.526038 secs (12154825 bytes/sec)
You have new mail in /var/mail/root
[root@nikky /var/crash]#

Amazing that the server didn't panic for nearly two days during that test.

Last time it panics, it was during the start of net-p2p/transmission-daemon or news/sabnzbdplus (both download tools).
And the time before that, it was during a movie transcode with multimedia/plexmediaserver (home media center).

Seems that heavy handle of files causes the crash.
Would it be the SATA driver in FreeBSD 10.3 someway lacking?

--
Léo.
 
Hi tingo,

Thanks for the reply.
The 2,5" hard drive is in a Intel NUC (NUC6i5SYH), I'm not sure I can replace anything... can I ?

--
Léo.
 
I don't know Intel NUCs personally, but machines in the same size class from other other vendors have a proprietary cable to connect the hard drive. On the machines I know, this cable can be replaced.
 
Is the BIOS / UEFI on the machine updated to the latest one available? If not, a BIOS update is cheaper than a new cable...
 
Tingo,

Indeed !
I didn't think about updating it, my BIOS was the original one, dated back to november 2015.
I've just updated it, we'll see if it crashes again.

Thanks :)

--
Léo.
 
Hello all,

After all this time, no new kernel panics has occured.
Yay!

It seems that the BIOS update have fixed the problem.
Thank for helping me pointing it out.

--
Léo.
 
Back
Top