System Crash - what to do?

balanga · Feb 5, 2018

My system has just crashed and on reboot it now crashes on logon.

I've gone into single user mode and run fsck -y -t /dev/ada0S3 but that hasn't cleared up the problem. What should I look at?

Connecting the disk to another system to check it out can be done easily enough, but what should I look for?

chrbr · Feb 5, 2018

You could check the health of the disk using sysutils/smartmontools.

ralphbsz · Feb 5, 2018

We can not debug "crash". We need way more data. What exactly does the console display? Are there any IO error messages in the log? Any crash dumps collected?

How does /dev/ada0S3 relate to the file systems? What file system and device is /home on (or wherever the home directories for login are)? What happens if you log on as root after going multi-user?

balanga · Feb 5, 2018

ralphbsz said:
We can not debug "crash". We need way more data. What exactly does the console display? Are there any IO error messages in the log? Any crash dumps collected?

How does /dev/ada0S3 relate to the file systems? What file system and device is /home on (or wherever the home directories for login are)? What happens if you log on as root after going multi-user?

As soon as I press enter after the initial password prompt there is a console display which flashes past very quickly and reboots. I don't know anything crash dumps but since I can't mount the filesystem I wouldn't be able to find them. The system boots from /dev/ada0s3. It doesn't crash if I boot single user, but am unable to clean the filesystem with fsck... maybe there are some other parameters I should use...

ShelLuser · Feb 5, 2018

So what are you using? What FreeBSD version, what filesystem, what kind of hardware, did you compile the system yourself or did you use binaries, are you using any specific shells, does this reboot happen for every user that logs on or just one (and if so: do the users share the same shell or not)?

That's just the tip of the iceberg. You're not giving us enough information to even make wild guesses.

(edit)

But to answer the main question ("what to do"): Narrow down possible causes. I gave a brief example above. For all I know this could be a shell going haywire.

balanga · Feb 5, 2018

chrbr said:
You could check the health of the disk using sysutils/smartmontools.

This is after booting FreeBSD from another disk [/dev/da0]:-

Code:

root@FreeBSD:~ # smartctl -a /dev/ada0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.0-RELEASE-p8 i386] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Travelstar Z5K320
Device Model:     Hitachi HTS543232A7A384
Serial Number:    120602E2M3421L167KJP
LU WWN Device Id: 5 000cca 706d0ee21
Firmware Version: ES2OA70K
User Capacity:    320,072,933,376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Feb  5 22:06:09 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
                   was never started.
                   Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)   The previous self-test routine completed
                   without error or no self-test has ever
                   been run.
Total time to complete Offline
data collection:        (   45) seconds.
Offline data collection
capabilities:             (0x5b) SMART execute Offline immediate.
                   Auto Offline data collection on/off support.
                   Suspend Offline collection upon new
                   command.
                   Offline surface scan supported.
                   Self-test supported.
                   No Conveyance Self-test supported.
                   Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
                   power-saving mode.
                   Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
                   General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     (  93) minutes.
SCT capabilities:           (0x003d)   SCT Status supported.
                   SCT Error Recovery Control supported.
                   SCT Feature Control supported.
                   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   211   211   033    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1299
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       2529
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1123
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       75
193 Load_Cycle_Count        0x0012   070   070   000    Old_age   Always       -       309456
194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 3/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               60%      1158         -
# 2  Extended offline    Completed without error       00%       409         -
# 3  Short offline       Completed without error       00%       407         -
# 4  Short offline       Completed without error       00%       381         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Not sure how to interpret any of this.

topcat · Feb 5, 2018

balanga said:
I've gone into single user mode and run fsck -y -t /dev/ada0S3 but that hasn't cleared up the problem.

Did fsck succeed? A screenshot perhaps?

balanga · Feb 6, 2018

topcat said:
Did fsck succeed? A screenshot perhaps?

After several attempts I have been able to mount the filesystem when booting from a different disk. I don't know if fsck creates a log anywhere. I'll attempt to boot from the original disk after doing a backup.

balanga · Feb 6, 2018

fsck succeeded when booting from a different disk, but when attempting to boot from the disk it crashed immediately on bootup when attempting to login.

I have mounted it again and would like to see if any evidence of the crash might have been left on the disk, or is there a boot option to provide a verbose boot log. Or there any way to pause the system when it crashes?

After recording the problem on my phone I see this:-

Code:

panic: ufs_dirbad: /: bad dir ino 28410624 at offset 2048: mangled entry

How can I find out what this indicates?

Maelstorm · Feb 6, 2018

I'm just jumping in this thread so here's what I have.

The HDD itself is good. All the raw values are either 0 or below threshold.

As for this:

Code:

panic: ufs_dirbad: /: bad dir ino 28410624 at offset 2048: mangled entry

That indicates a file system problem. It means that an entry in the partition that has / as a mount point is corrupted to the point that the kernel can't handle it and so it causes the kernel to panic. In other words, it is reporting the mount point of the file system that has the problem. In this case, it's the / file system. It is also giving you the inode number of 28410624 which is critical to fixing the system.

So how do you fix it? You use fsdb(8), the UFS file system debugger. You are not the only one to have this problem. A quick google search turned up this:

http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/

One thing that I would change in the instructions in the blog is do not use -y as an option to fsck. Let it ask you questions. I have found that when using -y, fsck will use the journal to try and repair the file system and it may miss errors. So when it asks to use the journal, answer NO.

As with using any disk utilities, take a backup image of the drive/partition before you use any such tools that has the ability to destroy the data. In the worst case scenario, you may have to end up reformatting the disk/partition to recover it. If the corruption is widespread, you could have a hardware problem. I've seen faulty memory do strange things.

I hope this helps.

chrbr · Feb 6, 2018

balanga said:
Not sure how to interpret any of this.

This has been answered already one post above. Especially for 1) and 5) it is good to see zeros there. Please follow the advise of Maelstorm. I hope you will manage to fix the issue soon!

topcat · Feb 6, 2018

Maelstorm said:
The HDD itself is good. All the raw values are either 0 or below threshold.
I've seen faulty memory do strange things.

This. FS problem can be caused by bad memory or other hardware issues, not necessarily a bad disk.

Maelstorm · Feb 7, 2018

topcat said:
This. FS problem can be caused by bad memory or other hardware issues, not necessarily a bad disk.

I've seen some crazy things with a faulty PATA cable.

balanga · Feb 7, 2018

Thanks to all those that helped, I am now able to login without crashing, but I can't help wondering if FreeBSD left any evidence of the crash... I was only able to find out what the problem was by recording the boot process on my phone.

Having returned to what I thought was a normal system, I now see that there is no /root directory and there are lots of files under /lost+found.

Now,,,, where did I put that backup?

balanga · Feb 7, 2018

Backup does not boot up.... Halts after

Code:

Trying to mount root from ufs:/dev/ada0s3a [rw]...
mountroot: unable to remount devfs under /dev (error 2)
mountroot: unable to unlink /dev/dev (error 2)

???