Solved Zpool REMOVED status

One of two hard discs (ada1) in a zpool mirror started to fail today, so I used a spare disc to try to replace the disk. I used gpart to create partitions of identical sizes and a boot section to the ok disc (ada0) and successfully got it in with zpool attach. In the midst of the resilvering, the "good" disk seems also to have failed, with the error "REMOVED" as can be seen in the following output of zpool status:
Code:
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jul  1 14:46:44 2022
    253G scanned at 17.0M/s, 222G issued at 14.9M/s, 1.94T total
    185G resilvered, 11.18% done, 1 days 09:32:38 to go
config:

    NAME                     STATE     READ WRITE CKSUM
    zroot                    DEGRADED  527K     0     0
      mirror-0               DEGRADED 1.03M     0     6
        ada1p3               FAULTED      0     4     0  too many errors
        1457454098546718061  REMOVED      0     0     0  was /dev/ada0p3
        ada2p3               ONLINE       0     0 1.03M
After trying to su, the computer seems to have bricked. Does anyone have any suggestions for course of action?

This is what the console looks like:

View: https://www.youtube.com/watch?v=lbRbTny768s


I suspect I should just power this down physically for a start, I'm not sure what else I can do.
 
so ada1 is the "new" disk, ada0 the old one?
you created a 3-way mirror with ada2 during the replace operation?
On physically different connectors in the box?

You're going to need to physically power down at some point, in theory ada2 should be fine and still able to boot the box.
Make sure I have some good quality data cables for the drives, I'm assuming SATA cables here, make sure they support whatever the theoretical bandwidth is for your system.
I would power down, unplug from the wall, give it a few minutes, then open the case up (skip this if it's already open).
Then unplug power and data cables from ada0, reseat the power and data cables on ada1 (both ends of the data cables).
Compressed air and blow out all the dust in the box, concentrate on the power supply and the CPU fans.
Leaving ada0 unplugged power and data, power the box back on and see what happens.

Sometimes the errors you see are "false positives" from a bad data cable or slightly overloaded power supply, but often they indicate the drive is failing. smartmontools is your friend here.

Dust creates heat, heat is bad if it gets too high.
 
If he had a mirror with 2 disks, attach is the correct command to add a 3rd disk to make it a 3 way mirror, which is handy if you want to use bigger devices.
Replace I think also works.
Rereading the original, ada2 and ada1 I think were the original mirror, then ada0 was attached to make a 3 way mirror, after resilver was done he would have done detach on ada1 and then left with ada0 and ada2 in the mirror.

Hard to tell but it's possible the errors on ada1 caused the resilver to fail on ada0.
 
One of two hard discs (ada1) in a zpool mirror started to fail today, so I used a spare disc to try to replace the disk. I used gpart to create partitions of identical sizes and a boot section to the ok disc (ada0) and successfully got it in with zpool attach. In the midst of the resilvering, the "good" disk seems also to have failed, with the error "REMOVED" as can be seen in the following output of zpool status:
Code:
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jul  1 14:46:44 2022
    253G scanned at 17.0M/s, 222G issued at 14.9M/s, 1.94T total
    185G resilvered, 11.18% done, 1 days 09:32:38 to go
config:

    NAME                     STATE     READ WRITE CKSUM
    zroot                    DEGRADED  527K     0     0
      mirror-0               DEGRADED 1.03M     0     6
        ada1p3               FAULTED      0     4     0  too many errors
        1457454098546718061  REMOVED      0     0     0  was /dev/ada0p3
        ada2p3               ONLINE       0     0 1.03M
After trying to su, the computer seems to have bricked.
Yes, because ada1 is faulted, ada0 is removed and ada2 is an incomplete copy.

Does anyone have any suggestions for course of action?
If I get this all correct, then: ada0 and ada1 was the original mirror, then ada1 developed errors, and ada2 was attached to replace it. During resilver, while ada2 was written the new copy, ada0 decided to go offline and was removed. So now there is no valid copy at all.

A disk is usually removed when it sends a disconnect. This may happen when it gets too hot, when the power is unstable, or similar things. It does not necessarily mean that the disk is lost.
What I would do is, first, get a rescue stick that can independently boot the system without touching this pool (it's not cool to repair a pool while the OS runs from it). Then, before booting, disconnect ada1 and ada2, and see if ZFS can recognize and accept ada0 again.
If so, it then should be a complete valid copy that can work. Then reboot again with ada2 also attached and see if the resilver can be completed.

If ada0 cannot be made working again, then it gets ugly. Then I would disable ZFS, try to low-level read the raw image from ada1 with dd and write it to ada2. Then start zfs and see if that written ada2 can be used. It will have some data loss.
 
Sorry about my confusing post. I'll just summarise to reply, rather than respond to the individual messages above.

ada0 and ada1 are the two original discs. ada2 is the new disc. After ada1 failed, ada2 was partitioned and attached to the pool. During resilvering, with about 1/5 of it complete, ada0 seems also to have failed, with the REMOVE message as shown in my original post. I was not sure whether to leave them powered on due to the resilvering still being in process, or to switch them off. In the end I decided that it could not do much good to leave it running and powered the computer off with the physical switch.
If I get this all correct, then: ada0 and ada1 was the original mirror, then ada1 developed errors, and ada2 was attached to replace it. During resilver, while ada2 was written the new copy, ada0 decided to go offline and was removed. So now there is no valid copy at all.

Yes, exactly this.

A disk is usually removed when it sends a disconnect. This may happen when it gets too hot, when the power is unstable, or similar things. It does not necessarily mean that the disk is lost.
What I would do is, first, get a rescue stick that can independently boot the system without touching this pool (it's not cool to repair a pool while the OS runs from it). Then, before booting, disconnect ada1 and ada2, and see if ZFS can recognize and accept ada0 again.
If so, it then should be a complete valid copy that can work. Then reboot again with ada2 also attached and see if the resilver can be completed.

I'll try this and report back.

If ada0 cannot be made working again, then it gets ugly. Then I would disable ZFS, try to low-level read the raw image from ada1 with dd and write it to ada2. Then start zfs and see if that written ada2 can be used. It will have some data loss.
Thank you for two useful ideas for resolving the problem.
 
Using the information in this post, I created a bootable USB for FreeBSD 12.3 and started the machine up. First I tried with everything except /dev/ada0 disconnected, but nothing showed in /dev/. Then I tried with /dev/ada0, /dev/ada1, and /dev/ada2 all connected, and this time ada0 and ada1 showed in /dev/ but not ada2.
internals.jpg

To check the disc, I first of all tried fsck /dev/ada0 and variations, but the result of this was the error "No such file or directory". A Google search led me here, where I found the suggestion to run file -s /dev/ada0 on the devices. I got
Code:
/dev/ada0: DOS/MBR boot sector; partition 1 : ID=0xee, start-CHS (0x0,0,2), end-CHS (0x3ff,255,63), startsector 1, 4294967295 sectors
but file -s /dev/ada0p3 printed a result of "data". Then I tried zdb -l /dev/ada0p3, as also suggested in that thread, and got a lot of information, which I apologise for quoting in full since I'm not sure what the important parts are:
Code:
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 32486097
    pool_guid: 9802395473873768945
    hostid: 2994395337
    hostname: 'orange'
    top_guid: 1164000935590157935
    guid: 1457454098546718061
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 1164000935590157935
        metaslab_array: 38
        metaslab_shift: 34
        ashift: 12
        asize: 2998439247872
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2120269406060973466
            path: '/dev/ada1p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01/p3'
            whole_disk: 1
            DTL: 194
            create_txg: 4
            aux_state: 'err_exceeded'
        children[1]:
            type: 'disk'
            id: 1
            guid: 1457454098546718061
            path: '/dev/ada0p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@1/elmdesc@Slot_00/p3'
            whole_disk: 1
            DTL: 193
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 3177278176522888156
            path: '/dev/ada2p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p3'
            whole_disk: 1
            DTL: 349
            create_txg: 4
            resilver_txg: 32486092
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 32486097
    pool_guid: 9802395473873768945
    hostid: 2994395337
    hostname: 'orange'
    top_guid: 1164000935590157935
    guid: 1457454098546718061
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 1164000935590157935
        metaslab_array: 38
        metaslab_shift: 34
        ashift: 12
        asize: 2998439247872
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2120269406060973466
            path: '/dev/ada1p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01/p3'
            whole_disk: 1
            DTL: 194
            create_txg: 4
            aux_state: 'err_exceeded'
        children[1]:
            type: 'disk'
            id: 1
            guid: 1457454098546718061
            path: '/dev/ada0p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@1/elmdesc@Slot_00/p3'
            whole_disk: 1
            DTL: 193
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 3177278176522888156
            path: '/dev/ada2p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p3'
            whole_disk: 1
            DTL: 349
            create_txg: 4
            resilver_txg: 32486092
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 2
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 32486097
    pool_guid: 9802395473873768945
    hostid: 2994395337
    hostname: 'orange'
    top_guid: 1164000935590157935
    guid: 1457454098546718061
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 1164000935590157935
        metaslab_array: 38
        metaslab_shift: 34
        ashift: 12
        asize: 2998439247872
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2120269406060973466
            path: '/dev/ada1p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01/p3'
            whole_disk: 1
            DTL: 194
            create_txg: 4
            aux_state: 'err_exceeded'
        children[1]:
            type: 'disk'
            id: 1
            guid: 1457454098546718061
            path: '/dev/ada0p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@1/elmdesc@Slot_00/p3'
            whole_disk: 1
            DTL: 193
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 3177278176522888156
            path: '/dev/ada2p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p3'
            whole_disk: 1
            DTL: 349
            create_txg: 4
            resilver_txg: 32486092
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 3
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 32486097
    pool_guid: 9802395473873768945
    hostid: 2994395337
    hostname: 'orange'
    top_guid: 1164000935590157935
    guid: 1457454098546718061
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 1164000935590157935
        metaslab_array: 38
        metaslab_shift: 34
        ashift: 12
        asize: 2998439247872
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2120269406060973466
            path: '/dev/ada1p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01/p3'
            whole_disk: 1
            DTL: 194
            create_txg: 4
            aux_state: 'err_exceeded'
        children[1]:
            type: 'disk'
            id: 1
            guid: 1457454098546718061
            path: '/dev/ada0p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@1/elmdesc@Slot_00/p3'
            whole_disk: 1
            DTL: 193
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 3177278176522888156
            path: '/dev/ada2p3'
            phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p3'
            whole_disk: 1
            DTL: 349
            create_txg: 4
            resilver_txg: 32486092
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
I'm not sure what direction to take at this point, there seems to be some data left on the disc, but I'm not sure of the procedure to recover it.
 
Oh my gods. :)

Some helpful commands.
This shows what the system has detected (disks and partitions):
Code:
$ ls -l /dev/ada*
crw-r-----  1 root  operator   0x8b Jul  1 23:08 /dev/ada0
crw-r-----  1 root  operator   0x9c Jul  1 23:08 /dev/ada0p1
crw-r-----  1 root  operator   0x9e Jul  2 04:24 /dev/ada0p2
crw-r-----  1 root  operator   0xa0 Jul  2 04:24 /dev/ada0p3
[...]
This shows who is who:
Code:
# camcontrol devlist
<WDC WD5000AAKS-00A7B2 01.03B01>   at scbus2 target 0 lun 0 (ada0,pass7)
<SanDisk SDSSDA120G Z22000RL>      at scbus3 target 0 lun 0 (pass8,ada1)
<ST3000DM008-2DM166 CC26>          at scbus4 target 0 lun 0 (pass9,ada2)
<HP SSD S700 250GB S0704A1>        at scbus5 target 0 lun 0 (pass10,ada3)
[...]
So if your disks aren't exactly identical models, you can recognize them from this list and see which ada* name was given to each.
If they are exactly identical, you also need the serial number, like so:
Code:
# camcontrol identify ada0
[...]
device model          WDC WD5000AAKS-00A7B2
firmware revision     01.03B01
serial number         WD-WCASY7821919
[...]

Using the information in this post, I created a bootable USB for FreeBSD 12.3 and started the machine up. First I tried with everything except /dev/ada0 disconnected, but nothing showed in /dev/. Then I tried with /dev/ada0, /dev/ada1, and /dev/ada2 all connected, and this time ada0 and ada1 showed in /dev/ but not ada2.
On every boot the disks are given numbers anew, beginning with 0. So what actually showed up might be what formerly was ada1 and ada2, now as ada0 and ada1 - because the former ada0 wasn't detected. Without checking the real model names of the disks (as described above), one cannot know.

Then I tried zdb -l /dev/ada0p3, as also suggested in that thread, and got a lot of information,

Thank You, this one looks good. Comparing the guid it seems we are actually reading the good disk:
Code:
    guid: 1457454098546718061
        children[0]:
            guid: 2120269406060973466
            path: '/dev/ada1p3'
            aux_state: 'err_exceeded'
        children[1]:
            guid: 1457454098546718061
            path: '/dev/ada0p3'
        children[2]:
            guid: 3177278176522888156
            path: '/dev/ada2p3'
            resilver_txg: 32486092

Now see what zpool import says to it. And if it suggests importing, try it, and then the resilver should continue.

And, btw: read the manuals. ;) One can nowadays google everything, but 85% of that stuff is misinformation and piecemeal. There are fine manuals on the system, and I got all my knowledge from reading just these, I never bought a book.
 
Oh my gods. :)

Some helpful commands.
This shows what the system has detected (disks and partitions):
Code:
$ ls -l /dev/ada*
crw-r-----  1 root  operator   0x8b Jul  1 23:08 /dev/ada0
crw-r-----  1 root  operator   0x9c Jul  1 23:08 /dev/ada0p1
crw-r-----  1 root  operator   0x9e Jul  2 04:24 /dev/ada0p2
crw-r-----  1 root  operator   0xa0 Jul  2 04:24 /dev/ada0p3
[/QUOTE]

Currently it's showing /dev/ada0 and friends, and /dev/ada1, but ada2 did not appear.

[QUOTE="PMc, post: 573545, member: 52756"]

[...]
This shows who is who:
Code:
# camcontrol devlist
<WDC WD5000AAKS-00A7B2 01.03B01>   at scbus2 target 0 lun 0 (ada0,pass7)
<SanDisk SDSSDA120G Z22000RL>      at scbus3 target 0 lun 0 (pass8,ada1)
<ST3000DM008-2DM166 CC26>          at scbus4 target 0 lun 0 (pass9,ada2)
<HP SSD S700 250GB S0704A1>        at scbus5 target 0 lun 0 (pass10,ada3)
[...]

I got this. The TDKMedia is the USB stick:
Code:
<ST3000DM008-2DM166 CC26>          at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD30EZRZ-00Z5HB0 80.00A80>    at scbus2 target 0 lun 0 (pass1,ada1)
<HL-DT-ST DVD-ROM DH60N BF01>      at scbus3 target 0 lun 0 (pass2,cd0)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus4 target 0 lun 0 (pass3,ses0)
<TDKMedia Transit PMAP>            at scbus5 target 0 lun 0 (da0,pass4)
The ST is the working old disc from 2017, and the WDC is the new disc from 2019. I'm not sure what AHCI is, something or another.

On every boot the disks are given numbers anew, beginning with 0. So what actually showed up might be what formerly was ada1 and ada2, now as ada0 and ada1 - because the former ada0 wasn't detected. Without checking the real model names of the disks (as described above), one cannot know.

Oh, I see, so ada1 was ada2.

Thank You, this one looks good. Comparing the guid it seems we are actually reading the good disk:
...
Now see what zpool import says to it. And if it suggests importing, try it, and then the resilver should continue.

It says this:

Code:
   pool: zroot
     id: 9802395473873768945
  state: DEGRADED
 status: One or more devices were being resilvered.
 action: The pool can be imported despite missing or damaged devices.  The
    fault tolerance of the pool may be compromised if imported.
 config:

    zroot                              DEGRADED
      mirror-0                         DEGRADED
        2120269406060973466            FAULTED  corrupted data
        ada0p3                         ONLINE
        diskid/DISK-WD-WCC4N7PFNX2Lp3  ONLINE

I tried zpool import zroot but I got an error message from that, so I've tried zpool import -f zroot as the error message suggested. Things are looking like this at the moment (output of zpool status):
Code:
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jul  1 05:46:44 2022
    588G scanned at 3.56G/s, 258G issued at 52.3M/s, 1.94T total
    189G resilvered, 13.00% done, 0 days 09:22:53 to go
config:

    NAME                     STATE     READ WRITE CKSUM
    zroot                    DEGRADED     0     0     0
      mirror-0               DEGRADED     0     0     0
        2120269406060973466  FAULTED      0     0     0  was /dev/ada1p3
        ada0p3               ONLINE       0     0    12
        ada1p3               ONLINE       0     0    10

errors: 796868 data errors, use '-v' for a list

I guess I just have to keep fingers crossed and turn on the air conditioning in this room. I'm located in Japan and we're in the midst of a heat wave, which may have explained the original failure.

And, btw: read the manuals. ;) One can nowadays google everything, but 85% of that stuff is misinformation and piecemeal. There are fine manuals on the system, and I got all my knowledge from reading just these, I never bought a book.
Believe it or not, George isn't at home, and I've spent most of the last three days reading FreeBSD manuals. The manual pages by themselves are often quite cryptic in terms of knowing what you should do in a certain situation. For example, I got confused about whether to use zpool add or zpool attach and so on, and I had to Google things to find out what to do.

Thank you for your assistance.
 
<AHCI SGPIO Enclosure 2.00 0001> at scbus4 target 0 lun 0 (pass3,ses0)
This is enclosure management. Sometimes found on a disk backplane.
It allows you to eject disks and blink disk lights. Some show cage temps too.
I'm not sure what AHCI is, something or another.
So your disk enclosure or backplane has an AHCI SGPIO connection to your motherboard.
It can be controlled with sesutil(4)
 
So your disk enclosure or backplane has an AHCI SGPIO connection to your motherboard.
It can be controlled with sesutil(4)
Ahh, so these are integrated in the usual Intel chipset ahci controllers (and probably others).
Then this is another way to quickly get the required useful information in a properly ordered fashion:
Code:
# sesutil -u /dev/ses0 show
ses0: <AHCI SGPIO Enclosure 2.00>; ID: 3061686369656d30
Desc     Dev     Model                     Ident                Size/Status
Slot 00  ada0    WDC WD5000AAKS-00A7B2     WD-WCASY7821919      500G
Slot 01  ada1    SanDisk SDSSDA120G        162020405512         120G
Slot 02  ada2    ST3000DM008-2DM166        Z500NLSN             3T
Slot 03  ada3    HP SSD S700 250GB         HBSA20072400611      250G

Not so bad... obviousely there is no blinkenlights or temperature display on this system, because there is nothing else than the SATA plugs present on the board - and my issue was mainly, how do I get a dual board plus 15 disks plus proper cooling into a standard tower case...a real enclosure was not an option.
 
The disc is doing the same thing as before, it starts resilvering, gets a large part of the way there, then errors start arising. Then I go back to the reboot and start again, but the resilvering that was done the previous round has all been forgotten about, so after reaching 600GB before errors start, it then starts again around 190 GB. This has happened four times now. The error produced is "cam status: ata status error".

I've also tried changing the SATA cables.

Code:
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jul  1 05:46:44 2022
    790G scanned at 126M/s, 639G issued at 90.6M/s, 1.94T total
    571G resilvered, 32.25% done, 0 days 04:12:44 to go
config:

    NAME                     STATE     READ WRITE CKSUM
    zroot                    DEGRADED     0     0     0
      mirror-0               DEGRADED     0     0     0
        2120269406060973466  FAULTED      0     0     0  was /dev/ada1p3
        ada0p3               ONLINE       0     0     0
        ada1p3               ONLINE       0     0     1

errors: 1303748 data errors, use '-v' for a list
 
It could be helpful to know if your if your disks are connected via ses0, please give the output of sesutil map.
Could you also give the output of:
gpart show ada0
gpart show ada1
gpart show ada2
 
The disc is doing the same thing as before, it starts resilvering, gets a large part of the way there, then errors start arising. Then I go back to the reboot and start again, but the resilvering that was done the previous round has all been forgotten about, so after reaching 600GB before errors start, it then starts again around 190 GB.
This is normal. The progress is recorded only every few hours, so the last couple hours of work will be repeated.
(Configurable as vfs.zfs.scan_checkpoint_intval in 13.1, but I think not in 12.3. The default might be 7200 seconds.)

This has happened four times now. The error produced is "cam status: ata status error".
This is not normal. It has nothing to do with ZFS. It concerns basic reading/writing of data onto a disk, and usually points to hardware problems or configuration parameters.

In earlier times when we added a disk to a unix system, we did a "surface analysis" first. That basically means, write the entire disk with different patterns and read them back. Nobody has time for such nowadays. But when there are problems, there is probably not much other choice.

In this case, more information would also be useful. If You can get the port sysutils/smartmontools installed onto the stick, run smartctl -x /dev/adaX. This shows health data which a disk stores within it's own controller.

In the initial FreeBSD boot menu (handbook chapter 2.4.3), behind "More Options" is a switch "Verbose". With that activated, the system will spit out a lot more of debug information.

And then there is the surface analysis:
  1. read the entire disk with dd if=/dev/adaX of=/dev/null bs=64k. The problem is that this takes hours to complete, But it is important to see if the dd will end without error after reading all the expected size of the disk, or if it will report an I/O error somewhere on the way. In the latter case there are bad sectors.
  2. Write the entire disk with zeroes. This will destroy all current data and the partitioning on the disk, which has to be created anew afterwards. So be careful to pick the right disk! dd if=/dev/zero of=/dev/adaX bs=64k. After this completes orderly, return to step 1.
  3. More elaborate tests could be crafted, e.g. write certain bit patterns to the disk, or use multiple dd commands at different seek offsets to force disk seek activity.
My old WesternDigital Blue is always very delicate about bad sectors: whenever there is a power fluctuation, it may write a bad sector - which then is not really a bad sector, just an incomplete write operation. But the disk will treat that as a bad sector and report I/O errors - until I search them, overwrite them with zero, and let ZFS fix the content from the mirror.
 
I'll just post a quick status update here and I'll get back again later about the questions. My big priority with these failed discs was to recover some unduplicated log files, so before anything else I copied those with
Code:
zpool import -f zroot
zfs set mountpoint=/tmp zroot/usr/home
zfs mount zroot/usr/home
I used /tmp because nowhere else on the USB stick operating system seems to allow writing. I was then able to access the log files and copy them elsewhere.
 
It could be helpful to know if your if your disks are connected via ses0, please give the output of sesutil map.

Code:
ses0:
    Enclosure Name: AHCI SGPIO Enclosure 2.00
    Enclosure ID: 3061686369656d30
    Element 0, Type: Array Device Slot
        Status: Unsupported (0x00 0x00 0x00 0x00)
        Description: Drive Slots
    Element 1, Type: Array Device Slot
        Status: OK (0x01 0x00 0x00 0x00)
        Description: Slot 00
    Element 2, Type: Array Device Slot
        Status: OK (0x01 0x00 0x00 0x00)
        Description: Slot 01
        Device Names: ada0,pass0
    Element 3, Type: Array Device Slot
        Status: OK (0x01 0x00 0x00 0x00)
        Description: Slot 02
        Device Names: ada1,pass1
    Element 4, Type: Array Device Slot
        Status: Unknown (0x06 0x00 0x00 0x00)
        Description: Slot 03
    Element 5, Type: Array Device Slot
        Status: OK (0x01 0x00 0x00 0x00)
        Description: Slot 04
        Device Names: cd0,pass2
    Element 6, Type: Array Device Slot
        Status: Unknown (0x06 0x00 0x00 0x00)
        Description: Slot 05

Could you also give the output of:
gpart show ada0
Code:
=>        40  5860533088  ada0  GPT  (2.7T)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  5856335872     3  freebsd-zfs  (2.7T)
  5860532224         904        - free -  (452K)
gpart show ada1
Code:
=>        40  5860533088  ada1  GPT  (2.7T)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  5856335872     3  freebsd-zfs  (2.7T)
  5860532224         904        - free -  (452K)
gpart show ada2
Code:
gpart: no such geom: ada2
 
These disks have now been replaced and I was able to get the small amount of unduplicated data off the disks using the above advice, so I am marking this as "Solved".

Thank you everyone.
 
Back
Top