fsck / icheck -b ?

monkeyboy

Member

Reaction score: 10
Messages: 99

excuse my ignorance but what is the replacement for the old
icheck -b BN function, whereby you can find out which inode (file) contains a particular block?

when a disk throws a block error (sector error), I would like to know which file it is in...

I don't see that fsck has that option... it would also be nice to have an ncheck -i INUM replacement as well, although that can be done by find (but very slowly)...
 
OP
OP
M

monkeyboy

Member

Reaction score: 10
Messages: 99

vivek said:
May be:
Code:
ls -i
???
all ls -i does is list the inumber for each file.

the question is, if you have a BAD SECTOR number, how do you find out which file is using that sector...

the original procedure was icheck -b BNUMBER, which gives you the inumber, then ncheck -i INUMBER to find the file pathname(s) (it may not be unique because of links).
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 7,289
Messages: 29,735

I did some searching and it seems some of the info you want can be traced using fsdb(8). Not sure how to use it though, never needed to.
 

sprewell

Member

Reaction score: 8
Messages: 35

I just ran into this problem cuz of some bad sectors so I thought I'd write up what I did to recover. What follows is based on this howto for linux and the equivalent for FreeBSD, expanded with my specific commands and output.

I woke up this morning to find my system had crashed and rebooted overnight. I have a 30 GB UFS/FFS root partition for most system software and the remainder of the disk is a zfs pool for all user data: this zfs pool would not mount. Looking in /var/log/messages, I found the following lines:

Code:
Dec  5 09:16:45 anonymized kernel: ZFS filesystem version 6
Dec  5 09:16:45 anonymized kernel: ZFS storage pool version 6
Dec  5 09:16:45 anonymized kernel: ad4: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=548191
Dec  5 09:16:45 anonymized kernel: g_vfs_done():ad4s1a[READ(offset=280641536, length=16384)]error = 5
Dec  5 09:16:45 anonymized kernel: ad4: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=548191
Dec  5 09:16:45 anonymized kernel: g_vfs_done():ad4s1a[READ(offset=280641536, length=16384)]error = 5
Dec  5 09:16:45 anonymized kernel: ad4: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=548223
Dec  5 09:16:45 anonymized kernel: g_vfs_done():ad4s1a[READ(offset=280657920, length=4096)]error = 5
Dec  5 09:16:45 anonymized kernel: vnode_pager_getpages: I/O read error
Dec  5 09:16:45 anonymized kernel: vm_fault: pager read error, pid 119 (zfs)
Searching for info about this error online, I found that smartctl will give you better info about your hard disks so I installed the sysutils/smartmontools port and ran # /usr/local/sbin/smartctl -a /dev/ad4 to get the following output (with some irrelevant lines snipped to fit into this forum's post character limit):

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   172   159   021    Pre-fail  Always       -       4358
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       77
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       2966
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       75
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       74
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       77
194 Temperature_Celsius     0x0022   095   079   000    Old_age   Always       -       52
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Error 7 occurred at disk power-on lifetime: 2955 hours (123 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 84 5d 08 e0  Error: UNC at LBA = 0x00085d84 = 548228

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 7f 5d 08 00 00      04:09:15.429  READ DMA
  c8 00 20 5f 5d 08 00 00      04:09:13.530  READ DMA
  c8 00 20 5f 5d 08 00 00      04:09:10.888  READ DMA

Error 6 occurred at disk power-on lifetime: 2955 hours (123 days + 3 hours)
  40 51 00 74 5d 08 e0  Error: UNC at LBA = 0x00085d74 = 548212

Error 5 occurred at disk power-on lifetime: 2955 hours (123 days + 3 hours)
  40 51 00 74 5d 08 e0  Error: UNC at LBA = 0x00085d74 = 548212

Error 4 occurred at disk power-on lifetime: 2951 hours (122 days + 23 hours)
  40 51 00 84 5d 08 e0  Error: UNC at LBA = 0x00085d84 = 548228

Error 3 occurred at disk power-on lifetime: 2951 hours (122 days + 23 hours)
  40 51 00 74 5d 08 e0  Error: UNC at LBA = 0x00085d74 = 548212
The useful info is the Current_Pending_Sector with a RAW_VALUE of 2 and those two bad sectors repeatedly listed in the subsequent error log at LBA locations 548212 and 548228. The disk can't read those sectors for whatever reason and that's causing zfs to fail. What people apparently normally do in this situation is backup all their data and write zeros to the entire partition; this triggers the disk firmware to reallocate the bad sectors, which it can only do on a write to the bad sector, not on a read. I didn't want to go to all that trouble so I found the above FreeBSD steps after much searching and set about finding the relevant file. If I could just write to the files with the bad sectors, that would trigger the disk firmware to clean up the sectors. I found the relevant slice using # fdisk /dev/ad4
Code:
******* Working on device /dev/ad4 *******
parameters extracted from in-core disklabel are:
cylinders=1240341 heads=16 sectors/track=63 (1008 blks/cyl)

Media sector size is 512
Warning: BIOS sector numbering starts with sector 1
Information from DOS bootblock is:
The data for partition 1 is:
sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
    start 63, size 62914257 (30719 Meg), flag 80 (active)
        beg: cyl 0/ head 1/ sector 1;
        end: cyl 1023/ head 15/ sector 63
The data for partition 2 is:
sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
    start 62914320, size 104857200 (51199 Meg), flag 0
        beg: cyl 1023/ head 255/ sector 63;
        end: cyl 1023/ head 15/ sector 63
The data for partition 3 is:
sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
    start 167771520, size 838860624 (409599 Meg), flag 0
        beg: cyl 1023/ head 255/ sector 63;
        end: cyl 1023/ head 15/ sector 63
The data for partition 4 is:
<UNUSED>
The two sector numbers were less than 62914320 so I knew it was the first slice. bsdlabel gave me the relevant partition # bsdlabel /dev/ad4s1
Code:
# /dev/ad4s1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a:  1048576        0    4.2BSD        0     0     0 
  b:  4088480  1048576      swap                    
  c: 62914257        0    unused        0     0         # "raw" part, don't edit
  d:  4141056  5137056    4.2BSD        0     0     0 
  e:  1048576  9278112    4.2BSD        0     0     0 
  f: 52587569 10326688    4.2BSD        0     0     0
Both sector numbers were less than 1048576 + 63 (the slice started at sector 63, these numbers are offsets) so I knew the problem was in /dev/ad4s1a. df told me this was the root partition:
Code:
Filesystem          1K-blocks      Used   Avail Capacity  Mounted on
/dev/ad4s1a            507630    283440  183580    61%    /
devfs                       1         1       0   100%    /dev
/dev/ad4s1e            507630    101188  365832    22%    /tmp
/dev/ad4s1f          25464900  20797632 2630076    89%    /usr
/dev/ad4s1d           1999598    175910 1663722    10%    /var
Next, I had to find the relevant inode that used this sector using # fsdb -r /dev/ad4s1a As the fsdb manpage notes, you need to use the right offset to find the owner of a "block" (much experimentation found that what you need is actually a sector number) so I got the numbers 548191 and 548223 from /var/log/messages above and subtracted 63 for the slice offset to get 548128 and 548160. Running # findblk 548128 548160 in the fsdb console returned:
Code:
548128: data block of inode 33025
548160: data block of inode 33025
You can check the inode with # inode 33025 # blocks which returned 137032 and 137040, the above numbers divided by 4 (sector size is 512 bytes and fragment size is 2048 bytes, divide the two to get the factor of 4, though the naming for all this is really confusing and threw me off for awhile). Since both sectors had the same inode, I ran % find / -inum 33025 to at last find the corrupted file, /lib/libgeom.so.4. Finally, I downloaded base from ftp.freebsd.org and extracted a fresh libgeom.so.4 to overwrite the corrupted one. After a bit of moving around and renaming the old library, the smartctl output showed the Current_Pending_Sector RAW_VALUE reset to 0! fsdb showed the same 137032 and 137040 fragments used for the relevant inodes, implying only the bad sectors were mapped out by the hard drive firmware. I then rebooted and everything was back to normal. :)

I spent most of a day tracking this down and fixing it so I thought I'd write it up for others. I'm new to this low-level disk stuff so someone can correct me if I got any details wrong. I wonder how other operating systems handle disk corruption like this, certainly doesn't seem right that I have to do all this by hand. I also read that if many more disk sectors go bad like this, it probably means that the disk is going bad, but a few disk sectors doesn't necessarily mean anything.
 

sprewell

Member

Reaction score: 8
Messages: 35

Aah, you're right, I forgot to include one useful command that I used to figure out the fragment size % dumpfs /dev/ad4s1a | more The info you're looking for is near the top, fragment size is called fsize.
 
Top