1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[FreeNAS] replacing a drive under ZFS after a total failure

Discussion in 'Storage' started by joefish75, Jan 13, 2012.

  1. joefish75

    joefish75 New Member

    Messages:
    2
    Thanks Received:
    0
    I have replaced a failed drive in my NAS and am having real difficulty getting ZFS to recognise this. There have been a few stages to this problem, so I'll describe them here:

    I setup my FreeNAS with 8.0.0 a few months ago with 3x Samsung 2T F4EG (HD204UI) drives in raidz1 configuration. A few weeks ago some S.M.A.R.T. errors appeared (197/0xC5 Current_Pending_Sector had a raw value of 2). A few days later the drive failed altogether, first slowing a scrub right down with multiple errors appearing on that drive (read and write).

    I exported the pool, shutdown, removed the failed drive and installed 8.0.2 onto a new USB stick, which I booted and imported the freenas-1.db to. The NAS found the (degraded) pool and all was as I expected. I did a scrub (in the degraded pool) just to make sure the two remaining drives were fine.

    This is where the confusion starts. I replaced the failed drive, popped in an identical Samsung drive, and noticed in the GUI that I had two ada2 drives. On the terminal, I 'offline'd the failed drive. In the web GUI I tried to replace the drive with the new one, but when the resilvering was done, it still had the failed drive in the list, stayed as 'degraded' and had the status 'replacing', even though it had finished.

    Now when I did a scrub, I got some fresh problems:

    Code:
    zpool status -v
      pool: pool
     state: DEGRADED
    status: One or more devices has experienced an error resulting in data
    	corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
    	entire pool from backup.
       see: http://www.sun.com/msg/ZFS-8000-8A
     scrub: scrub completed after 10h9m with 1 errors on Mon Jan  9 19:38:49 2012
    config:
    
    	NAME                                            STATE     READ WRITE CKSUM
    	pool                                            DEGRADED     2     0     0
    	  raidz1                                        DEGRADED     2     0     0
    	    gptid/41542893-cfe7-11e0-8f22-78acc0f799d0  ONLINE       0     0     0
    	    3150351496029849676                         UNAVAIL      0     0     0  was /dev/gpt/ada1
    	    ada1p2                                      ONLINE       2     0     0
    
    errors: List of errors unavailable (insufficient privileges)
    


    How do I get rid of 315051496029849676 forever, and make the new drive (in the ada1 slot now) resilver properly? I can live with one file loss, but I don't want more to go!

    Does ada1p2 also appear to have a real failure, or could this have happened some other way?
     
  2. soulreaver1

    soulreaver1 New Member

    Messages:
    37
    Thanks Received:
    4
    You should try use this for replacing devices (quote from zfs admin manual by Oracle)
    You've inserted new disk already, so just need to do last two steps.
     
  3. joefish75

    joefish75 New Member

    Messages:
    2
    Thanks Received:
    0
    Excellent, the problem is won't I run the risk that I replace it with one of the old drives? The original drive was ada1, but now it appears that ada1p2 is still online (so maybe that's the old ada2).

    I could do:

    Code:
    zpool replace pool 3150351496029849676 /dev/ada2
    


    But I definitely don't want to do this if I am replacing the drive with one that is actually one of the only working old drives!
     
  4. phoenix

    phoenix Moderator Staff Member Moderator

    Messages:
    3,425
    Thanks Received:
    755
    Add the following to /boot/loader.conf:
    Code:
    kern.geom.label.gptid.enable="0"                # Disable the auto-generated GPT UUIDs for disks
    kern.geom.label.ufsid.enable="0"                # Disable the auto-generated UFS UUIDs for filesystems


    Then reboot the system.

    That will remove the gptid/blahblahblah entry from the status output, and show you the actual device node for it instead. Then you'll know which device to use in the replace command. Most likely it'll be ada2.

    Note: ada(4) devices are enumerated (named) numerically starting at 0, based on the order that they are detected by the OS. It doesn't matter which slot they are plugged into; the numbering is based on the order that they are used. For example, connect SATA ports 0, 1, 2, you get ada0, ada1, ada2. Unplug port 1 and reboot, and you get ada0 and ada1 (as in, ada2 is now showing as ada1). You cannot count on device nodes remaining the same.

    This is why it's usually a good idea to label the disks via either glabel(8) (for labelling the entire disk) or gpart(8) (to label partitions). And then use the label devices when creating the vdevs. For example (a 16-bay chassis, where columns are letters, and rows are numbers), the following makes it very easy to tell which disk is which in the chassis:
    Code:
    $ zpool status
      pool: storage
     state: ONLINE
     scan: scrub repaired 0 in 30h34m with 0 errors on Tue Jan 17 03:12:28 2012
    config:
    
            NAME             STATE     READ WRITE CKSUM
            storage          ONLINE       0     0     0
              raidz2-0       ONLINE       0     0     0
                gpt/disk-a1  ONLINE       0     0     0
                gpt/disk-a2  ONLINE       0     0     0
                gpt/disk-a3  ONLINE       0     0     0
                gpt/disk-a4  ONLINE       0     0     0
                gpt/disk-b1  ONLINE       0     0     0
              raidz2-1       ONLINE       0     0     0
                gpt/disk-b2  ONLINE       0     0     0
                gpt/disk-b3  ONLINE       0     0     0
                gpt/disk-b4  ONLINE       0     0     0
                gpt/disk-c1  ONLINE       0     0     0
                gpt/disk-c2  ONLINE       0     0     0
              raidz2-2       ONLINE       0     0     0
                gpt/disk-c3  ONLINE       0     0     0
                gpt/disk-c4  ONLINE       0     0     0
                gpt/disk-d1  ONLINE       0     0     0
                gpt/disk-d2  ONLINE       0     0     0
                gpt/disk-d3  ONLINE       0     0     0
            cache
              gpt/cache      ONLINE       0     0     0
    
    errors: No known data errors
     
  5. soulreaver1

    soulreaver1 New Member

    Messages:
    37
    Thanks Received:
    4
    Are you sure about that? In my home NAS I have a 2 SATA devices which are discovered as ad4 an ad6. If I disconnect ad4 and then reboot, second device is still present as ad6. The only way to change their Ids is to replace SATA cables (change ports).
     
  6. phoenix

    phoenix Moderator Staff Member Moderator

    Messages:
    3,425
    Thanks Received:
    755
    ad(4) is not the same as ada(4).

    IDE ports (and SATA ports accessed via the ata(4) subsystem) are, by default, always numbered the same, as you've noticed. ad0 is the primary master, ad1 is the primary slave, ad2 is the secondary master, ad3 is the secondary slave, ad4 is the first SATA port, ad6 is the second SATA port, ad8 is the third SATA port, and so on.

    SATA ports accessed via the cam(4) layer (whether via the ATA_CAM shims or ahci(4)), are enumerated in numerical order, just like SCSI devices always have been. ada0 is the first detected SATA device, ada1 is the second detected SATA device, ada2 is the third detected SATA device, regardless of which SATA port they are connected to.