Had a few crashes

azathoth · Oct 26, 2017

I think because my second internal 500g drive with ufs is store for transmission-qt...

Box would just go down hard.

I ran fsck and it says seconday gps table corrupt which is weird because I did a gpart destroy before formmating with newfs -U

I think maybe fsck wrote some stuff as ROOT and my USER g couldn't read? causing a soft update panic?

I thought it also might be bad ram chip.....but stable sofar after I did fsck twice and ran chmod -R g: /mnt/a where I mount the drive....

transmission-qt now happy.....

Was just weird because I ran fsck originally when it faulted, but it went down hard again.....

and I think I may have fixed with the chmod, fingers crossed no more crashes.

11.1amd64

aragats · Oct 27, 2017

I would advise installing sysutils/smartmontools and checking it's report, in particular:

Code:

# smartctl -a /dev/ada0 | grep Reallocated_Sector

If the last number is not zero, your disk is degraded and will fail soon. You can also run a test:

Code:

# smartctl -t short /dev/ada0
# smartctl -l selftest /dev/ada0

(Replace /dev/ada0 with your device name)

azathoth · Oct 27, 2017

So far so good solid running since yesterday..
root@nofapp:~ # smartctl -a /dev/ada0 | grep Reallocated_Sector
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
root@nofapp:~ # smartctl -a /dev/ada1 | grep Reallocated_Sector
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

oh wow ada0 might fail soon?

Code:

root@nofapp:~ # smartctl -a /dev/ada0 > test
root@nofapp:~ # less test
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: HITACHI HDS725050KLA360
Serial Number: ZBHVW3NH
LU WWN Device Id: 5 000cca 20eda4ef3
Firmware Version: K2AOAB0A
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA/ATAPI-7 T13/1532D revision 1
Local Time is: Fri Oct 27 18:36:06 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (10419) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 174) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 159 159 050 Pre-fail Offline - 207
3 Spin_Up_Time 0x0007 110 110 024 Pre-fail Always - 591 (Average 677)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 1130
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 136 136 020 Pre-fail Offline - 31
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 27549
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1119
192 Power-Off_Retract_Count 0x0032 099 099 050 Old_age Always - 1485
193 Load_Cycle_Count 0x0012 099 099 050 Old_age Always - 1485
194 Temperature_Celsius 0x0002 141 141 000 Old_age Always - 39 (Min/Max 15/57)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

(END)

azathoth · Oct 28, 2017

crap just had 3 crashes!
Back up after going single user and fsck -y /dev/ada1 and ada0
Then after doign it twice i couldnt reboot.
Had some weird failure.
rebooted got 1 more crash.
Then rebooted cam eup and chmod -R g: /mnt/a and /usr/home/g
now root top shows some fsck thing happening...
fhew
jeez
I wonder if ada0 is getting flaky?

azathoth · Oct 28, 2017

Can someone give a bit of guidance as to what to do here? should I dump data and reinstall to ada1 as the os drive?

azathoth · Oct 28, 2017

Code:

root@nofapp:~ # smartctl -l selftest /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27550

azathoth · Oct 28, 2017

Moving files, guna reinstall to ada1 as os disk.... Now when I do gpart destroy that cleans all the stuff off the disk right?
I did that b4 doing newfs -U on the disk....per the handbook.....but still get this crap about gps secondary corrupt....so not sure how to sanitize the disk completely before doing the fresh install....

me oh my!

azathoth · Oct 28, 2017

I dunno howto move this thread to storage...

azathoth · Oct 28, 2017

Is a full install onto /dev/ada1 the best way to switch the os stuff and boot thingy to the second disk? or can I liek copy over stuff and install a bootlaoder from where I am now?

_martin · Oct 28, 2017

Do you have crash dump from that crash ? Or crash hard means it just powered off ?
It can be all HW related so it's hard to say. If you can afford having the machine down (or even better if you have easy access to the box) do a battery test on that machine and try to isolate the issue.

I suggest (not including SMART as it was already suggested):

a) test RAM - memtest+ or others (i.e. boot straight to memtest+)
b) test CPU - some math, i.e. finding md5 collisions , spread around all cpus/threads
c) use different disk and test again
d) PSU - if you have another power supply at hand swap it with the current one and do test again

I would rather manually boot to other disk, especially if you are just testing. But you can install bootloader to the secondary disk, remove the first one and let it boot.

azathoth · Oct 28, 2017

Where would I find the crash dump?
How test cpu....
I think I remember memtest is a iso image I can burn to usb key and boot to?

azathoth · Oct 28, 2017

Code:

root@nofapp:~ # ls /var/crash/
minfree

root@nofapp:~ # less /var/crash/minfree
2048

_martin · Oct 28, 2017

Yes, /var/crash is the default path for crashes. Check if you have crashes enabled:

 # sysctl kern.coredump

kern.coredump: 1

Also check if you have dumpdev set in /etc/rc.conf, at least to "AUTO":

 # grep dumpdev /etc/rc.conf

dumpdev="AUTO"

This option assumes you have at least one swap device defined where system can dump.
You have to specify this disk with dumpon(8) or if you are not comfortable doing that you can reboot the machine.

You can test CPU by doing some stress test on it. As I mentioned - by some cracking, computing something. etc. Maybe ports have some tools in benchmarks too, but I never used them.

azathoth · Oct 28, 2017

_martin said:
Yes, /var/crash is the default path for crashes. Check if you have crashes enabled:

# sysctl kern.coredump kern.coredump: 1

Also check if you have dumpdev set in /etc/rc.conf, at least to "AUTO":

# grep dumpdev /etc/rc.conf dumpdev="AUTO"

This option assumes you have at least one swap device defined where system can dump.
You have to specify this disk with dumpon(8) or if you are not comfortable doing that you can reboot the machine.

You can test CPU by doing some stress test on it. As I mentioned - by some cracking, computing something. etc. Maybe ports have some tools in benchmarks too, but I never used them.

Code:

root@nofapp:~ # sysctl kern.coredump
kern.coredump: 1

azathoth · Oct 28, 2017

ah dumpdev was set to no ok I set now to auto

_martin · Oct 28, 2017

Ok, so after this if your system panics you should be able to get the dump ( _if the panic actually happened in your case).
You can verify that you have set it properly with:
dumpon -l
You need to see the some device displayed. Also verify /var/crash has enough free space left, at least few gigabytes in your case.

And I still recommend doing those tests I mentioned earlier. Bad RAM and bad disks are usually the issue, sometimes CPU and rarely PSU. Unless it's something different, which may also be the case.

azathoth · Oct 28, 2017

_martin said:
Ok, so after this if your system panics you should be able to get the dump ( _if the panic actually happened in your case).
You can verify that you have set it properly with:
dumpon -l
You need to see the some device displayed. Also verify /var/crash has enough free space left, at least few gigabytes in your case.

And I still recommend doing those tests I mentioned earlier. Bad RAM and bad disks are usually the issue, sometimes CPU and rarely PSU. Unless it's something different, which may also be the case.

ok just got a crash!!

looking up on how to analyze!!

azathoth · Oct 28, 2017

Code:

root@nofapp:/var/crash # ls
bounds          info.0          info.last       minfree         vmcore.0        vmcore.last
root@nofapp:/var/crash # kgdb kernel.debug vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...kernel.debug: No such file or directory.

Can't open a vmcore without a kernel
(kgdb)

azathoth · Oct 28, 2017

Code:

Oct 28 17:19:39 nofapp savecore: reboot after panic: ufs_dirbad: /mnt/a: bad dir ino 61074817 at offset 0: mangled entry
Oct 28 17:19:39 nofapp savecore: writing core to /var/crash/vmcore.0
Oct 28 17:20:59 nofapp kernel: info: [drm] Initialized drm 1.1.0 20060810

why would the whole box go down jsut because of a problem with the second internal drive???

Code:

root@nofapp:/var/crash # mount
/dev/ada0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/ada1 on /mnt/a (ufs, local, soft-updates)
root@nofapp:/var/crash # cat /etc/fstab
# Device        Mountpoint      FStype  Options Dump    Pass#
/dev/ada0p2     /               ufs     rw      1       1
/dev/ada0p3     none            swap    sw      0       0
/dev/ada1       /mnt/a          ufs     rw      2       2

azathoth · Oct 28, 2017

azathoth · Oct 28, 2017

fsck -y two times /dev/ada1

did this b4 tho

azathoth · Oct 28, 2017

azathoth · Oct 28, 2017

Code:

SALVAGE? yes

SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

14212 files, 103268609 used, 14993626 free (3330 frags, 1873787 blocks, 0.0% fragmentation)

***** FILE SYSTEM STILL DIRTY *****

***** FILE SYSTEM WAS MODIFIED *****

***** PLEASE RERUN FSCK *****
root@nofapp:/var/crash # fsck -y /dev/ada1
** /dev/ada1
** Last Mounted on /mnt/a
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
14212 files, 103268609 used, 14993626 free (3330 frags, 1873787 blocks, 0.0% fragmentation)

***** FILE SYSTEM MARKED CLEAN *****

root@nofapp:/var/crash # df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/ada0p2    447G    181G    231G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
/dev/ada1      451G    394G     21G    95%    /mnt/a
root@nofapp:/var/crash # chown -R g: /mnt/a

ran this last time tho

_martin · Oct 29, 2017

Meh, I think you should put those picture down. They are really distracting and not sure if in sync with rules.

As you can see the crash and the panic string you can rule out PSU as an issue. Filesystem inconsistency caused panic. So main suspects are RAM and disk (though you can't still rule out CPU).

You didn't do the kernel debugging command properly, but for the time being it doesn't matter. Do that RAM check with memtest+ and check with other disk (or stop using the suspicious one).

You can also do a check where you leave the big disk out and start writing data to temp for example (assuming sh):

 while true ; do for i in `seq 8`; do echo creating blob$i; dd if=/dev/zero of=/tmp/blob${i} bs=1024M count=16; done; done

This will create 8 16GB files in /tmp. To stress writes a bit.

azathoth · Nov 4, 2017

_martin said:
Meh, I think you should put those picture down. They are really distracting and not sure if in sync with rules.

As you can see the crash and the panic string you can rule out PSU as an issue. Filesystem inconsistency caused panic. So main suspects are RAM and disk (though you can't still rule out CPU).

You didn't do the kernel debugging command properly, but for the time being it doesn't matter. Do that RAM check with memtest+ and check with other disk (or stop using the suspicious one).

You can also do a check where you leave the big disk out and start writing data to temp for example (assuming sh):

while true ; do for i in `seq 8`; do echo creating blob$i; dd if=/dev/zero of=/tmp/blob${i} bs=1024M count=16; done; done

This will create 8 16GB files in /tmp. To stress writes a bit.

pics for halloween

hmmm everything stable last week or so......I wonder if soft updates on second non problem drive caused some problem when combined with transmission-qt.......which does some kinda recheck stuff after a fault