Solved ZFS cannot repair faulted USB drive setup in a mirror

Relictrix

Member

Reaction score: 1
Messages: 66

Hi all,

## Whats my setup?
A laptop runs FreeBSD with ZFS on virtualbox. I have a mirror ZFS pool of which 1 drive is the drive of my laptop and one drive is a USB external drive that I regularly connect to sync . It's a backup in case something goes wrong. Until now this mirrored USB drive came in handy 1 time, where my VM broke completely. Mostly the VM breaks because of issues with the Virtual box and host OS where the Virtual box is running.

## What goes wrong now?
There were too many errors so the disk could not be resilvered. I decided to remove the complete da0 device, recreated all the partitions with the goal to replace the faulted disk. However this fails again.

Code:
  pool: zroot
 state: DEGRADED
  scan: resilvered 90.8G in 0 days 00:29:46 with 0 errors on Mon Feb 22 10:52:58 2021
config:

        NAME                       STATE     READ WRITE CKSUM
        zroot                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            ada0p3.eli             ONLINE       0     0     0
            replacing-1            OFFLINE      0     0     0
              5406864750556332175  OFFLINE      0     0     0  was /dev/da0p3.eli/old
              2171267551982000851  OFFLINE      0     0     0  was /dev/da0p3.eli

errors: No known data errors

Code:
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 1d cf 32 70 00 00 08 00
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 1d cf 32 70 00 00 08 00
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 1d cf 32 70 00 00 08 00
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 1d cf 32 70 00 00 08 00
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 0 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 1d cf 32 70 00 00 08 00
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Error 5, Retries exhausted

## What is the problem?
1. I am trying to find out what is exactely wrong
2. Did some attempts to use smartctl but it is not build for usb drives.
3. Any help is appreciated.

Thanks in advance,

Best Regards,

R
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 10,185
Messages: 35,689

The disk itself looks like it's gone bad. No matter how many times you recreate partitions and try to attach it, it's not going to resolve the errors that are on the physical disk. Look with smartctl(8) at the SMART data.
 

im

Member

Reaction score: 18
Messages: 54

Assuming da0 is your USB flash, I think your USB flash drive completely broken.
USB flash have no SMART, so it is impossible to use smartctl for evaluation of USB-flash health.

You can do simple check of your flash using dd()
Detach your flash da0 from any storage pool,
and try to rewrite your flash by zeros. Check for any similar errors during the process.
dd if=/dev/zero of=/dev/flashdevicehere bs=131072
Be careful with dd's of= parameter. It will rewrite any specified target device.
 

ralphbsz

Son of Beastie

Reaction score: 1,886
Messages: 2,874

Yes, da0 has failed. You can try overwriting it with zeroes. If the device is a spinning disk (on a USB to SATA or something like that converter), that might help, or it might hurt even more (if the disk has failing sectors). If the disk is a USB stick, it is probably done for.
 
OP
R

Relictrix

Member

Reaction score: 1
Messages: 66

Hi,

Thanks for the reactions! As I can start already the dd that IM is suggesting, it's running now. It's a SSD USB Drive from FREECOM, 256 GB of which only 150 GB should be used. I wonder if it's the drive, I used now another USB port. It does not seem completely broke because it resilvering, but at a certain moment bang it stops. Writing this I thought , was ZFS not marking these sectors as bad, perhaps just speculation, I guess have to know exactly what the error means.

While this runs will check the smartctl further ...

Best Regards,

R
 
OP
R

Relictrix

Member

Reaction score: 1
Messages: 66

Hi,

And I am learning here, things go faster as I thought, here the report attached. It seems to be a Toshiba then, huh. I was not even aware.

I stopped the dd for now.

So I did not say this in the first place, I can still create a gpart partitioning on it. I can still encrypt it with geli. I can attach it. But then when doing the replace it fails. Perhaps it does not matter, some places on the SSD are perhaps still broken?

I am doing a replace now on another USB port, and meanwhile reading the smartctl report.

Best Regards,

R
 

Attachments

  • smartctl_report.txt
    9.7 KB · Views: 10

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 10,185
Messages: 35,689

was ZFS not marking these sectors as bad, perhaps just speculation
No filesystem does this. Back in the olden days (with old MFM and early SCSI disks) you had to keep track of bad sectors and mark them. Modern (IDE/SATA/SAS/SCSI) harddisks do this automatically in the firmware and remap bad sectors to a spare bit of space. When this spare bit of space is full it's time to replace the disk.

This all goes out the window with SSDs of course. There's no such thing as a sector or a track any more since it's all stored in flash memory chips. The firmware just makes it look like there are tracks and sectors. SSDs do something called wear leveling to ensure each chip has about the same number of write cycles, this is to ensure better longevity. But at some point in time an SSD is going to break, and it usually does this catastrophically.

Looking at the SMART report, it does report a bunch of errors. It's lifetime doesn't seem very long, only 982 hours (around 41 days). It's erase_count looks absurdly high though. Compare this with an SSD I have:
Code:
  9 Power_On_Hours_and_Msec 0x0032   028   028   000    Old_age   Always       -       63510h+45m+50.890s
That's well over 7 years. That's almost double its expected lifetime and I still have zero issues with this drive.
 

im

Member

Reaction score: 18
Messages: 54

About your smartctl report:
The drive exactly has errors.
Look at these lines of the report:
Code:
Error 69 occurred at disk power-on lifetime: 978 hours (40 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.
  04 51 00 08 74 ad 40  Error: ABRT at LBA = 0x00ad7408 = 11367432

SMART has at least one bad value:
Code:
169 Bad_Block_Count         0x0013   100   100   010    Pre-fail  Always       -       100
Some values are seems to be too low:
Code:
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       15463
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       17262
 

PMc

Daemon

Reaction score: 502
Messages: 1,062

Overheated?

194 Temperature_Celsius 0x0022 048 010 000 Old_age Always - 52 (Min/Max 16/90)
 
OP
R

Relictrix

Member

Reaction score: 1
Messages: 66

Hi all,
I double checked the trim value is yes. But it is. I checked the same report of another SSD of FREECOM i have , 128 GB. It has all low values. The temperature can be the factor, because it heats up a lot. Also it heats up more and intensively when I do for example a zpool replace.

I will check tomorrow if i somehow can reduce the speed of the zpool replace or I could but no idea it helps, to enable just USB 2, then the speed will go down automatically. But no idea it will also reduce the temperature.
Best Regards,
R
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 10,185
Messages: 35,689

Bad chips tend to heat up more too. It's an easy way of checking a faulty circuit if you don't have schematics. Power it up and physically feel each chip, the chip that's hot after a few seconds of being turned on is usually faulty. I've used this back in the day when I was working at a company that made modems. Still have "fond" memories of having an imprint of a DIL 14 chip burned onto my fingertip for a few weeks because one got SUPER hot.
 
OP
R

Relictrix

Member

Reaction score: 1
Messages: 66

Yes, I think it's clear it's not a good chip indeed. I am trying to do it now that it's cold. And what I do is: I use CTRL-P , which will pause the VM. Or I touch it regularly because that will transfer the heat to my hand or a cold desk where it's on. :):)
 
OP
R

Relictrix

Member

Reaction score: 1
Messages: 66

Ok now it succeeded:

Code:
root@vm1:~ # zpool status
  pool: zroot
state: ONLINE
  scan: resilvered 98.6G in 213503982334601 days 06:29:13 with 0 errors on Tue Feb 23 07:45:45 2021
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            ada0p3.eli  ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0

errors: No known data errors
root@vm1:~ #

Damn 213503982334601 days, and I am even older then my SSD disk :) And touching your SSD helps too, give it a hug ;)

Once again thanks for your help, remarks and comments!
 
Top