Rebuild seems to have broken my system

Lorem-Ipsum · Mar 23, 2012

I just spent the morning moving my machine to a new case and replacing the power supply.
It's a fairly old machine and still uses IDE. On attempting to boot, it initially booted up as far as remounting the root fs, failed to find it and I had to reboot. On the second boot attempt it hung for 20 mins on:

Code:

FreeBSD/x86 bootstrap loader, Revision 1.1
(root@obrian.cse.buffalo.edu, Tue Jan 3 06:40:01 UTC 2012)
Loading /boot/defaults/loader.conf
/boot/kernel/kernel data=0xb49073 data=0xf54f8+0xbda74 syms=[0x4+0xaffd0-

Then boots as far as :

Code:

Mounting from ufs:/dev/ada0p2 failed with error 19.

Loader variables:
vfs.root.mountfrom=ufs:/dev/ada0p2
vfs.root.mountfrom.options=rw

Manual root filesystem specification:
<fstype>:<device> [options]
Mount <device> using filesystem <fstype> 
and with the specified (optional) option list.

### EXAMPLE SHOWN ###

mountroot>

I'm fairly new to FreeBSD and am a bit apprehensive of doing anything that could compromise the integrity of the data on the disk as my backups are a bit old.

I have a second spare hard drive that I could potentially put a fresh install on and transfer settings and data across but I'd really like to solve this.

I would boot into single user mode and see what I can do but it takes 20 mins or so to get to the bootloader.

Any ideas as I've run out?

kpa · Mar 23, 2012

Boot from the install cd or usb stick and post listing of # gpart show.

Beeblebrox · Mar 23, 2012

what does this show?
mountroot> ?
It should give you a list of mountable devices. Does the list show your HDD+partition? Is the name different? Let's assume the device name is changed and shows ada1p2 then go:
mountroot> ufs:/dev/ada1p2

SirDice · Mar 23, 2012

It's possible you connected the drive to the wrong SATA or IDE port. That would change the device number.

Make sure the drive is master on the primary IDE or on the first SATA port.

Lorem-Ipsum · Mar 23, 2012

kpa said:
Boot from the install cd or usb stick and post listing of # gpart show.

Thanks, I'll give that a shot.

Beeblebrox said:
what does this show?
mountroot> ?
It should give you a list of mountable devices. Does the list show your HDD+partition? Is the name different? Let's assume the device name is changed and shows ada1p2 then go:
mountroot> ufs:/dev/ada1p2

Scarily the output was blank. I'm guessing it's to do with the IDE cable.

SirDice said:
It's possible you connected the drive to the wrong SATA or IDE port. That would change the device number.

Make sure the drive is master on the primary IDE or on the first SATA port.

That's my thinking. I just had to pop out to get a longer IDE cable as the new case cites the HDD's a bit further away.

EDIT: Seems to be solved. The BIOS battery seems to be dead and the motherboard had reverted to jumper configuration over cable select.

EDIT2: Not so solved, just got an IO error, kernel panic and now my drive can't be found. Fingers crossed the drive isn't dead.

The live cd cannot mount it or even list the partitions in /dev. However the drive is listed as ada0. However Fdisk states # fdisk: could not detect sector size.

# gpart show is blank.

I'm starting to think the drive is dead.

SirDice · Mar 23, 2012

If it's IDE, make sure it's one of those 80 wire cables.

Beeblebrox · Mar 23, 2012

OK, you have too much going on at once. I suggest you STOP and take a 1-2 hour break. Go watch TV or wash your dog or something.

When you get back you should un-plug the HDD, place it somewhere safe and run full diagnostics on your hardware: memtest for RAM, inspect all your cables, clean the internals of your power supply and fan even read some of this. Let's get de-bugging your HDD the VERY last, because that's where your data is and you don't want to trash that by pushing too hard to solve this.

Take a look at inquisitor for stress test, and you should always have a copy of mfsBSD at hand for emergency recovery. After inquisitor results come back clean (remember, no HDD) we can discuss how to use smartctl from mfsBSD or how to just back-up your data and run MHDD for surface scanning.

Lorem-Ipsum · Mar 23, 2012

Beeblebrox said:
OK, you have too much going on at once. I suggest you STOP and take a 1-2 hour break. Go watch TV or wash your dog or something.

When you get back you should un-plug the HDD, place it somewhere safe and run full diagnostics on your hardware: memtest for RAM, inspect all your cables, clean the internals of your power supply and fan even read some of this. Let's get de-bugging your HDD the VERY last, because that's where your data is and you don't want to trash that by pushing too hard to solve this.

Take a look at inquisitor for stress test, and you should always have a copy of mfsBSD at hand for emergency recovery. After inquisitor results come back clean (remember, no HDD) we can discuss how to use smartctl from mfsBSD or how to just back-up your data and run MHDD for surface scanning.

Thanks, took the dog out for a walk and am a bit more relaxed now.

I've removed the FreeBSD drive and booted from an Archlinux Drive that I used to use in this machine. I've run memtest and a few other diagnostic tools and they all come back fine so I think the drive has failed.

I'm not really surprised as it's an old drive so I'm going to install FreeBSD on the drive I'm currently running archlinux on and then think about getting my data back over the weekend. I'm not too concerned as the only things I'll have to rebuild from scratch are easily done.

EDIT: Downloading a copy of mfsBSD now.

Beeblebrox · Mar 23, 2012

Good to hear!

1. Once you have an up-and-running system, connect your HDD and review the SMART values. In FreeBSD it's smartctl(8)(), (adjust according to the distro that's running) and you run this to view the existing log:
# smartctl -a /dev/ada0
and this to run a HDD (non-destructive) test:
# smartctl -t <type> /dev/ada0
But look through the man page at least for details of those specific flags. Of course, first turn-on SMART before the test with
# smartctl --smart=on /dev/ada0

2. Recover your data (with mfsBSD preferably)

3. Once you are safe with that HDD, start running destructive tests on it like MHDD (part of sysresccd or available on its own) or manufacturer's own "surface scan" utilities (ex. Seagate has a good scan program)

The surface scan results will tell you whether your HDD is trashed or not.

Lorem-Ipsum · Mar 23, 2012

Thanks.

The first command returned:

Code:

smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-RELEASE i386] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 5400.1
Device Model:     ST340015A
Serial Number:    5LA6ASWG
Firmware Version: 3.01
User Capacity:    40,020,664,320 bytes [40.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Mar 23 22:54:51 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  420) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  28) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   059   048   025    Pre-fail  Always       -       101190228
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   095   095   036    Pre-fail  Always       -       58
  7 Seek_Error_Rate         0x000f   068   058   030    Pre-fail  Always       -       154836132419
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4479
 10 Spin_Retry_Count        0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   020    Old_age   Always       -       2523
194 Temperature_Celsius     0x0022   022   048   000    Old_age   Always       -       22
195 Hardware_ECC_Recovered  0x001a   100   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   094   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       502
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 464 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 464 occurred at disk power-on lifetime: 4479 hours (186 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 22 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000022 = 34

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 22 00 00 e0 00      00:00:00.453  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:29.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:29.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:03.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:00:00.039  READ DMA

Error 463 occurred at disk power-on lifetime: 4479 hours (186 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 22 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000022 = 34

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 22 00 00 e0 00      00:00:00.449  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:29.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:03.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:00:00.039  READ DMA
  c8 00 01 23 00 00 e0 00      00:06:29.000  READ DMA

Error 462 occurred at disk power-on lifetime: 4479 hours (186 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 22 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000022 = 34

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 22 00 00 e0 00      00:00:00.447  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:03.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:00:00.039  READ DMA
  c8 00 01 23 00 00 e0 00      00:06:29.000  READ DMA
  c8 00 01 23 00 00 e0 00      00:00:00.018  READ DMA

Error 461 occurred at disk power-on lifetime: 4479 hours (186 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 22 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000022 = 34

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 22 00 00 e0 00      00:00:00.449  READ DMA
  c8 00 01 22 00 00 e0 00      00:00:00.039  READ DMA
  c8 00 01 23 00 00 e0 00      00:06:29.000  READ DMA
  c8 00 01 23 00 00 e0 00      00:00:00.018  READ DMA
  e7 00 01 00 00 00 e0 00      00:06:02.000  FLUSH CACHE

Error 460 occurred at disk power-on lifetime: 4479 hours (186 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 22 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000022 = 34

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 23 00 00 e0 00      00:00:26.000  READ DMA
  c8 00 01 23 00 00 e0 00      00:00:00.018  READ DMA
  e7 00 01 00 00 00 e0 00      00:06:02.000  FLUSH CACHE
  c8 00 01 22 00 00 e0 00      00:00:26.000  READ DMA
  c8 00 01 22 00 00 e0 00      00:06:37.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1401         -

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

# gpart show produces:

Code:

=>      34  78165293  ada0  GPT  (37G)
        34       128     1  freebsd-boot  (64k)
       162  73400192     2  freebsd-ufs  (35G)
  73400354   3907584     3  freebsd-swap  (1.9G)
  77307938    857389        - free -  (418M)

Mounting showed it's not clean so I'm running an interactive FSCK now and if I can get the filesystem clean I'll mount it and rsync the data.

If not I'll look at other options.

wblock@ · Mar 23, 2012

Given the errors and reallocated sector count, copying the data off of that drive ought to be the first step. It could fail at any time.

Lorem-Ipsum · Mar 23, 2012

wblock@ said:
Given the errors and reallocated sector count, copying the data off of that drive ought to be the first step. It could fail at any time.

How would you recommend I do that?

I'm still trying to work out where this drive came from. I'm starting to worry that it may have been a HDD I was recovering, left in the machine, forgot about, installed FreeBSD to play about on and then left it in there when I started using the machine as a firewall/murmur server.

EDIT: I've managed to force it to mount the filesystem read only and I'm currently rsyncing the data I need to my main machine.

I'm going to try to reinstall FreeBSD on my spare drive and have the server back up by morning...... I can see it's going to be a long night.

wblock@ · Mar 24, 2012

If a drive is failing, I'd use dd(1) to copy it, or maybe mount it read-only. But a block copy gives some other possibilities, and doesn't require running the drive while working on the data. If the data is copied to a file, it can be mounted with mdconfig(8).

Lorem-Ipsum · Mar 24, 2012

wblock@ said:
If a drive is failing, I'd use dd(1) to copy it, or maybe mount it read-only. But a block copy gives some other possibilities, and doesn't require running the drive while working on the data. If the data is copied to a file, it can be mounted with mdconfig(8).

I worked out what data was most important, rsynced that across and then rsynced most of my configs.

I've installed FreeBSD on my spare drive and I'm just installing some of the bare essentials.

I've put the dying drive in a storage case and I'm going to see how much data I can recover from it when I get the time. Could be good practice.

EDIT: I've got most of the data I needed back so I'm going to make this solved.

Can FreeBSD be configured to warn me the next time my HDD starts to die?

DutchDaemon · Mar 24, 2012

Add

Code:

daily_status_smart_enable="YES"

to /etc/rc.conf and watch your daily (nightly) output, or run /usr/local/etc/periodic/daily/smart from cron on a higher interval, or script something around smartctl(8).

Lorem-Ipsum · Mar 24, 2012

DutchDaemon said:
Add

Code:

daily_status_smart_enable="YES"

to /etc/rc.conf and watch your daily (nightly) output, or run /usr/local/etc/periodic/daily/smart from cron on a higher interval, or script something around smartctl(8).

Great, thanks.

Rebuild seems to have broken my system

Lorem-Ipsum

kpa

Beeblebrox

SirDice

Administrator

Lorem-Ipsum

SirDice

Administrator

Beeblebrox

Lorem-Ipsum

Beeblebrox

Lorem-Ipsum

wblock@

Lorem-Ipsum

wblock@

Lorem-Ipsum

DutchDaemon

Administrator

Lorem-Ipsum