[FreeNAS] replacing a drive under ZFS after a total failure

Place to ask questions about partitioning, labelling, filesystems, encryption or anything else related to storage area.

[FreeNAS] replacing a drive under ZFS after a total failure

Postby joefish75 » 13 Jan 2012, 11:49

I have replaced a failed drive in my NAS and am having real difficulty getting ZFS to recognise this. There have been a few stages to this problem, so I'll describe them here:

I setup my FreeNAS with 8.0.0 a few months ago with 3x Samsung 2T F4EG (HD204UI) drives in raidz1 configuration. A few weeks ago some S.M.A.R.T. errors appeared (197/0xC5 Current_Pending_Sector had a raw value of 2). A few days later the drive failed altogether, first slowing a scrub right down with multiple errors appearing on that drive (read and write).

I exported the pool, shutdown, removed the failed drive and installed 8.0.2 onto a new USB stick, which I booted and imported the freenas-1.db to. The NAS found the (degraded) pool and all was as I expected. I did a scrub (in the degraded pool) just to make sure the two remaining drives were fine.

This is where the confusion starts. I replaced the failed drive, popped in an identical Samsung drive, and noticed in the GUI that I had two ada2 drives. On the terminal, I 'offline'd the failed drive. In the web GUI I tried to replace the drive with the new one, but when the resilvering was done, it still had the failed drive in the list, stayed as 'degraded' and had the status 'replacing', even though it had finished.

Now when I did a scrub, I got some fresh problems:

Code: Select all
zpool status -v
  pool: pool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 10h9m with 1 errors on Mon Jan  9 19:38:49 2012
config:

   NAME                                            STATE     READ WRITE CKSUM
   pool                                            DEGRADED     2     0     0
     raidz1                                        DEGRADED     2     0     0
       gptid/41542893-cfe7-11e0-8f22-78acc0f799d0  ONLINE       0     0     0
       3150351496029849676                         UNAVAIL      0     0     0  was /dev/gpt/ada1
       ada1p2                                      ONLINE       2     0     0

errors: List of errors unavailable (insufficient privileges)


How do I get rid of 315051496029849676 forever, and make the new drive (in the ada1 slot now) resilver properly? I can live with one file loss, but I don't want more to go!

Does ada1p2 also appear to have a real failure, or could this have happened some other way?
joefish75
Junior Member
 
Posts: 2
Joined: 09 Jan 2012, 22:45

Postby soulreaver1 » 16 Jan 2012, 10:23

You should try use this for replacing devices (quote from zfs admin manual by Oracle)
The following are the basic steps for replacing a disk:
â–  Offline the disk, if necessary, with the zpool offline command.
â–  Remove the disk to be replaced.
â–  Insert the replacement disk.
â–  Run the zpool replace command. For example: [CMD="zpool"] replace pool_name old_device new_device[/CMD] (New device should be given as /dev/name).
â–  Bring the disk online with the zpool online command.



You've inserted new disk already, so just need to do last two steps.
User avatar
soulreaver1
Junior Member
 
Posts: 37
Joined: 05 Oct 2011, 18:35
Location: Warsaw, Poland

Postby joefish75 » 18 Jan 2012, 11:25

Excellent, the problem is won't I run the risk that I replace it with one of the old drives? The original drive was [FILE]ada1[/FILE], but now it appears that [FILE]ada1p2[/FILE] is still online (so maybe that's the old [FILE]ada2[/FILE]).

I could do:

Code: Select all
zpool replace pool 3150351496029849676 /dev/ada2


But I definitely don't want to do this if I am replacing the drive with one that is actually one of the only working old drives!
joefish75
Junior Member
 
Posts: 2
Joined: 09 Jan 2012, 22:45

Postby phoenix » 18 Jan 2012, 21:44

Add the following to [file]/boot/loader.conf[/file]:
Code: Select all
kern.geom.label.gptid.enable="0"                # Disable the auto-generated GPT UUIDs for disks
kern.geom.label.ufsid.enable="0"                # Disable the auto-generated UFS UUIDs for filesystems


Then reboot the system.

That will remove the gptid/blahblahblah entry from the status output, and show you the actual device node for it instead. Then you'll know which device to use in the replace command. Most likely it'll be ada2.

Note: [man=4]ada[/man] devices are enumerated (named) numerically starting at 0, based on the order that they are detected by the OS. It doesn't matter which slot they are plugged into; the numbering is based on the order that they are used. For example, connect SATA ports 0, 1, 2, you get ada0, ada1, ada2. Unplug port 1 and reboot, and you get ada0 and ada1 (as in, ada2 is now showing as ada1). You cannot count on device nodes remaining the same.

This is why it's usually a good idea to label the disks via either [man=8]glabel[/man] (for labelling the entire disk) or [man=8]gpart[/man] (to label partitions). And then use the label devices when creating the vdevs. For example (a 16-bay chassis, where columns are letters, and rows are numbers), the following makes it very easy to tell which disk is which in the chassis:
Code: Select all
$ zpool status
  pool: storage
 state: ONLINE
 scan: scrub repaired 0 in 30h34m with 0 errors on Tue Jan 17 03:12:28 2012
config:

        NAME             STATE     READ WRITE CKSUM
        storage          ONLINE       0     0     0
          raidz2-0       ONLINE       0     0     0
            gpt/disk-a1  ONLINE       0     0     0
            gpt/disk-a2  ONLINE       0     0     0
            gpt/disk-a3  ONLINE       0     0     0
            gpt/disk-a4  ONLINE       0     0     0
            gpt/disk-b1  ONLINE       0     0     0
          raidz2-1       ONLINE       0     0     0
            gpt/disk-b2  ONLINE       0     0     0
            gpt/disk-b3  ONLINE       0     0     0
            gpt/disk-b4  ONLINE       0     0     0
            gpt/disk-c1  ONLINE       0     0     0
            gpt/disk-c2  ONLINE       0     0     0
          raidz2-2       ONLINE       0     0     0
            gpt/disk-c3  ONLINE       0     0     0
            gpt/disk-c4  ONLINE       0     0     0
            gpt/disk-d1  ONLINE       0     0     0
            gpt/disk-d2  ONLINE       0     0     0
            gpt/disk-d3  ONLINE       0     0     0
        cache
          gpt/cache      ONLINE       0     0     0

errors: No known data errors
Freddie

Help for FreeBSD: Handbook, FAQ, man pages, mailing lists.
User avatar
phoenix
MFC'd
 
Posts: 3349
Joined: 17 Nov 2008, 05:43
Location: Kamloops, BC, Canada

Postby soulreaver1 » 19 Jan 2012, 08:36

phoenix wrote:
Note: [man=4]ada[/man] devices are enumerated (named) numerically starting at 0, based on the order that they are detected by the OS. It doesn't matter which slot they are plugged into; the numbering is based on the order that they are used. For example, connect SATA ports 0, 1, 2, you get ada0, ada1, ada2. Unplug port 1 and reboot, and you get ada0 and ada1 (as in, ada2 is now showing as ada1). You cannot count on device nodes remaining the same.


Are you sure about that? In my home NAS I have a 2 SATA devices which are discovered as ad4 an ad6. If I disconnect ad4 and then reboot, second device is still present as ad6. The only way to change their Ids is to replace SATA cables (change ports).
User avatar
soulreaver1
Junior Member
 
Posts: 37
Joined: 05 Oct 2011, 18:35
Location: Warsaw, Poland

Postby phoenix » 19 Jan 2012, 15:30

[man=4]ad[/man] is not the same as [man=4]ada[/man].

IDE ports (and SATA ports accessed via the [man=4]ata[/man] subsystem) are, by default, always numbered the same, as you've noticed. ad0 is the primary master, ad1 is the primary slave, ad2 is the secondary master, ad3 is the secondary slave, ad4 is the first SATA port, ad6 is the second SATA port, ad8 is the third SATA port, and so on.

SATA ports accessed via the [man=4]cam[/man] layer (whether via the ATA_CAM shims or [man=4]ahci[/man]), are enumerated in numerical order, just like SCSI devices always have been. ada0 is the first detected SATA device, ada1 is the second detected SATA device, ada2 is the third detected SATA device, regardless of which SATA port they are connected to.
Freddie

Help for FreeBSD: Handbook, FAQ, man pages, mailing lists.
User avatar
phoenix
MFC'd
 
Posts: 3349
Joined: 17 Nov 2008, 05:43
Location: Kamloops, BC, Canada


Return to Storage

Who is online

Users browsing this forum: No registered users and 1 guest