[zfs] zpool status:repaired

nekoexmachina · Feb 16, 2012

Hi Forums!
I've got a problem with my zfs partition.
I'm updating now my FreeBSD installation plus migrating to ZFS-only solution.
So, I've created zpool on pretty fresh HD and the system is being compiled now.
But, I have output in # zpool status

Code:

takino# zpool status amdz
  pool: amdz
 state: ONLINE
 scrub: scrub stopped after 0h0m with 11263671994483776 errors on Thu Jun 18 08:25:36     223139722
config:

        NAME        STATE     READ WRITE CKSUM
        amdz        ONLINE       0     0     0
          gpt/amdz  ONLINE       0     0     0  6.25P repaired

What does 6.25P repaired mean? Plus, why was there a scrub on Thu Jun 18 while zpool was created on Feb 15?

Some time ago (well.. Lots of time ago, probably in 2010) I've tested zfs-only solution with this HDD with different zpool name. Also after that there was # dd if=/dev/zero of=/dev/ad0, which as I understand should delete info about the zpools on that hard-drive. zpool.cache in /boot/zfs was deleted after that.
What the hell happen?

Beeblebrox · Feb 16, 2012

ZFS is awesome, because if it detects that the HDD is somehow writing faulty data it can initiate a scrub (error-check) on its own. I would say you are lucky that you caught this problem BEFORE you had your system up-and-running, rather than finding later that your data was corrupted.

What does 6.25P repaired mean?

That's not your real problem, this is your real problem:

scrub stopped after 0h0m with 11263671994483776 errors

The scrub was not even able to complete the process, and there were so many errors. Your HDD does not look very reliable from that message. I suggest you stress-test your system with Inquisitor or at least re-check your hdd with smartmontools. Also, run mhdd on the disk.

nekoexmachina · Feb 16, 2012

Beeblebrox said:
ZFS is awesome, because if it detects that the HDD is somehow writing faulty data it can initiate a scrub (error-check) on its own. I would say you are lucky that you caught this problem BEFORE you had your system up-and-running, rather than finding later that your data was corrupted.

That's not your real problem, this is your real problem:

The scrub was not even able to complete the process, and there were so many errors. Your HDD does not look very reliable from that message. I suggest you stress-test your system with Inquisitor or at least re-check your hdd with smartmontools. Also, run mhdd on the disk.

The date:
Thu Jun 18 08:25:36

Looks strange for me.
Also installing smartmontools now to run the test.

EDIT: Could you give me hand of help with analyzing it?

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   056   051   006    Pre-fail  Always       -       152102233
  3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       170
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       92743741
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21322
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   096   096   020    Old_age   Always       -       4965
194 Temperature_Celsius     0x0022   042   057   000    Old_age   Always       -       42
195 Hardware_ECC_Recovered  0x001a   056   050   000    Old_age   Always       -       152102233
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   182   000    Old_age   Always       -       36
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

If I understand correctly, no values are bad enough and HDD looks alive.

Beeblebrox · Feb 16, 2012

The date: Thu Jun 18 08:25:36 Looks strange for me.

Ah! Sorry my friend, I overlooked that.
Maybe destroy and create a new partition table with gpart? Even if you are using raw HDD (no table) under zfs, creating a table might over-write the old pool settings. There are some cases where gpt/msdos partition tables conflict with zfs partition, if zfs is given the whole disk as raw.

nekoexmachina · Feb 16, 2012

There was never a non-gpt partition table, also I've recreated the partitions anyway just before # zpool create.

So thats not an error and only 'wtf' thread as I understand now.

Code:

takino# gpart show -l ad0
=>       34  117231341  ad0  GPT  (56G)
         34        256    1  newboot  (128K)
        290  117231085    2  amdz  (56G)

Sebulon · Feb 16, 2012

@nekoexmachina

If I understand correctly, no values are bad enough and HDD looks alive.

We obviously have different opinions about how bad "bad" is...

The drive is about 2,5 years old, has a gazillion read errors and you think itÂ´s all honky dorey?

Have you ever run:
# smartctl -t long /dev/ad0
?

/Sebulon

nekoexmachina · Feb 16, 2012

Whoops. Did not look at the RAW_VALUE.

Beeblebrox · Feb 16, 2012

All I can say is "Oh Funk!". I thus stand by my statement: ZFS saved your ass.
You have to run mhdd or a similar low-level surface scan (HDD manufacturers also have their own tools for this). MHDD is not part of FreeBSD it's dos-based. Under certain circumstances manufacturer diagnostics do a better job than MHDD (ex: seagate tool is good while samsung is not AFAIK).

Beeblebrox · Feb 16, 2012

Before you conclude that your HDD is crap, you should eliminate 2 other possibilities:
1. Make absolutely sure the power supply cable to the HDD is tight. A loose power connector can make the HDD phase in-and-out like some drunk guy trying to stay awake.
2. Make sure the problem is not with your mobo - you could have a damaged controller. that is why you should also run Inquisitor or a similar total-system-check.
3. You can also try to attach the HDD to another system and run MHDD there.