ZFS Issues - Permanent data errors

About six weeks ago, I purchased and built a moderate sized fileserver with 8x1.5TB drives (and another 80GB IDE drive for the OS). I installed FreeBSD on it and put the eight drives into a zpool, separated into two raidz1 configurations. The create command looks like

[CMD=zpool] create bulk raidz1 /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad16 raidz1 /dev/ad12 /dev/ad14 /dev/ad18 /dev/ad20[/CMD]

Now I've added quite a few ZFS filesystems to it and have used it to back up pretty much everything. I've made it accessible to friends so it gets mostly constant access.

A few days ago, the place I'm living had a power outage. The server is connected through a cheap-ish UPS, which could only keep it running for the first 15 minutes of the 2 hour power outage. There doesn't seem to be any hardware damage, but a couple days later, I noticed 7 permanent file errors when I typed in zpool status.

I typed in zpool status -v to see the files and noticed that none of them were particularly important, so I deleted them to prevent others from trying to access them and failing. However, I couldn't get ZFS to remove the errors - it still said that there were 7 file errors and listed things like "bulk/Landing:<0x86db>" when I tried to see what the files were.

I eventually found the command to initiate a scrub of the pool, and I did. It found 132 more errors, which were, again, of relatively unimportant files. I assumed it was due to the power outage and thus a one-time thing, so I just deleted those files as well. However, it still said it had 139 errors and listed 139 entries of "bulk/Landing:<0x86db>" (different numbers, of course). I still haven't found a way to clear them.

Now, I typed in zpool status again this morning just to check on it, and it found another error in a file that was written during a daily backup early this morning. This led me to believe it was a hardware issue, because there has been no other power outage here. The server has been on and running fine the whole time.

The 8 data hard drives, 1.5 TB each, are all SATA with SMART enabled. I typed in smartctl -H /dev/ad4, for each of the data hard drives, and each of them reported "PASSED". I assume there's more I can do with SMART, but I haven't found anything yet.

So, basically, does this look familiar to anyone? Does anyone know of some more diagnostics I can run to help pin down the problem, or know of any solutions? Is there any way to clear the errors in deleted files from zpool status? (I tried zpool clear, but that didn't do it).

The exact zpool status error message reads:

One or more devices has experience and error resulting in data corruption. Applications may be affected. Restore the file in question if possible. Otherwise, restore the entire pool from backup.

Thanks for your help in advance,

-- Ethan
 
Whoops, forgot to mention - ever since the data errors have come up, the server as a whole has been running sluggish. I can only get about 15-25 MB/s out of it, whereas before, it could easily max out its gigabit uplink.
 
Searing said:
However, it still said it had 139 errors and listed 139 entries of "bulk/Landing:<0x86db>" (different numbers, of course). I still haven't found a way to clear them.

On the systems i work with ZFS on i read a message like this only one time. One file was reported to be corrupt after a scrub and i removed it. I did a scrub again and a message like yours was shown. (Edit) I figured out that this problem appeared because i had some snapshots with the corrupted file in it. After removing the snapshots with this specific file all was ok.

Please post your complete
Code:
zpool status
output.
Code:
zpool iostat -v
could be interesting too.

Searing said:
ever since the data errors have come up, the server as a whole has been running sluggish. I can only get about 15-25 MB/s out of it

This sounds like one of your drives is failed. At least ZFS thinks that.
 
Code:
[root@aluminum /]# zpool status
  pool: bulk
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 11h18m with 133 errors on Thu Sep  9 20:10:39 2010
config:

        NAME        STATE     READ WRITE CKSUM
        bulk        ONLINE       0     0     2
          raidz1    ONLINE       0     0     4
            ad4     ONLINE       0     0     0  23.6M repaired
            ad6     ONLINE       0     0     0  33.5M repaired
            ad10    ONLINE       0     0     0  19.4M repaired
            ad16    ONLINE       0     0     0  33.5M repaired
          raidz1    ONLINE       0     0     0
            ad12    ONLINE       0     0     0  22.4M repaired
            ad14    ONLINE       0     0     0  34.6M repaired
            ad18    ONLINE       0     0     0  24.3M repaired
            ad20    ONLINE       0     0     0  36.4M repaired

errors: 140 data errors, use '-v' for a list

Those CKSUM errors came up around the same time that I found the new corrupted file (this morning). As you can see, ZFS reports all of the devices as online. In addition, they all passed the SMART test. Of course, there could still be something wrong with one or more of them, but haven't completely failed.

I should note that all eight drives are not from the same place, or even the same age. Two of them are about a year old. The rest are new (less than two months old).

Code:
[root@aluminum /]# zpool iostat -v
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
bulk        7.57T  3.31T    432     45  52.2M  4.07M
  raidz1    3.99T  1.45T    213     21  25.8M  1.95M
    ad4         -      -    122     10  7.10M   668K
    ad6         -      -    123     10  7.10M   668K
    ad10        -      -    128     10  7.15M   668K
    ad16        -      -    127     11  7.14M   666K
  raidz1    3.58T  1.86T    218     23  26.4M  2.13M
    ad12        -      -    132     11  7.17M   728K
    ad14        -      -    132     11  7.17M   728K
    ad18        -      -    133     11  7.23M   728K
    ad20        -      -    131     12  7.20M   726K
----------  -----  -----  -----  -----  -----  -----

As you can see, the devices were actually created at different times. The reason is because two of the drives (ad18 and ad20) were being used in my desktop and had data on them that I wanted to put on the server. I set up the first four drives in a raidz1, copied the data over, then put the other four drives in. Most of the data has been recopied to other ZFS filesystems, but about 400-500GB remains on the original. I could recopy it over as well, for good measure, but that shouldn't cause these issues.

Oh, and about the snapshots - it is true that some of the files that were corrupted are also stored away in snapshots, and I have yet to track those snapshots down and delete them. However, most of the files that I deleted had no snapshot. It is possible, I suppose, that ZFS won't clear any errors until it can clear them all, but that doesn't seem terribly likely.
 
Just asking, but.. have you done an extensive memory test?

MemTest86 for 24 hours should tell if your system is stable. The corruption appears to be excessive to be accounted to the HDD's BER or Bit-Error-Rate; and i suspect your system may not be stable. I also see the corruption is on your raid-z vdev instead of on individual disks. I find that highly suspicious.

So i would advise a Memtest86+ for 24 hours to see if your system really is stable.
 
No, I haven't done an extensive memory test yet. I have a Hiren's BootCD lying around, so I could do that (I could also test the drives individually). However, my fileserver doesn't have any sort of graphics output, so I'd have to swap in the graphics card from my desktop, which is a real pain in the ass and renders my desktop unusable for the duration of maintenance, which could take over 24 hours. Is there anything I can do under FreeBSD to test the memory or hard drives?

Oh, and none of the parts of the server are overclocked - they're all running stock speeds.
 
Searing said:
Is there anything I can do under FreeBSD to test the memory or hard drives?

Replace with known good modules, or take some modules out and see if it still happens. That's far more reliable than some software based memory tester eg memtest86 which still after all this time is not a conclusive utility to verify a DIMM is good. If it says bad you can trust it.
 
Sounds like you have either a failing drive or possibly bad RAM. You should not be repairing that much data in a scrub, nor should you be getting that many damaged files (meaning the file and the parity were corrupt).

Don't just run the SMART test. Check the SMART output to see if you have lots of repaired sectors on one (or more) of the drives. The slowdown could be from the drive re-mapping bad sectors causing ZFS to re-send the read/write requests, slowing the committing of the transaction groups, etc. Consumer drives will tend to spend forever reading a bad sector causing ZFS to timeout.
 
Here's the full SMART readout for one of the drives. They're all pretty much the same.

Code:
[root@aluminum /]# smartctl -a /dev/ad4
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.0-RELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD154UI
Serial Number:    S1XWJDWZ208089
Firmware Version: 1AG01118
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Sat Sep 11 04:26:26 2010 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (19424) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  34) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   076   076   011    Pre-fail  Always       -       8090
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       32
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       800
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   071   000    Old_age   Always       -       23 (Lifetime Min/Max 21/29)
194 Temperature_Celsius     0x0022   077   070   000    Old_age   Always       -       23 (Lifetime Min/Max 20/30)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1351503532
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The Hardware_ECC_Recovered field looks scary. It's about 1.3 billion. The rest of the drives have similar levels.

Another interesting thing is with /dev/ad18:

Code:
[root@aluminum /]# smartctl -A /dev/ad18
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.0-RELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   099   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   075   075   011    Pre-fail  Always       -       8250
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       293
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       7089
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       127
 13 Read_Soft_Error_Rate    0x000e   100   099   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       20
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   076   064   000    Old_age   Always       -       24 (Lifetime Min/Max 22/27)
194 Temperature_Celsius     0x0022   076   062   000    Old_age   Always       -       24 (Lifetime Min/Max 21/30)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1248794960
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

I'm not sure what Reported_Uncorrect means, but it doesn't sound good, and the other drives all have 0's in that field.
 
Hardware ECC Recovered is normal to be very high; HDDs rely on ECC or else they wouldn't be able to read much sectors without corruption. This is nothing to worry about.

I can see nothing wrong with the SMART output; particarly UDMA_CRC_error count would indicate cabling errors but you don't have any. I think you should focus on your memory instead.
 
Searing said:
Here's the full SMART readout for one of the drives. They're all pretty much the same.

Code:
[root@aluminum /]# smartctl -a /dev/ad4

190 Airflow_Temperature_Cel 0x0022   077   071   000    Old_age   Always       -       23 (Lifetime Min/Max 21/29)
194 Temperature_Celsius     0x0022   077   070   000    Old_age   Always       -       23 (Lifetime Min/Max 20/30)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1351503532

Code:
[root@aluminum /]# smartctl -A /dev/ad18

190 Airflow_Temperature_Cel 0x0022   076   064   000    Old_age   Always       -       24 (Lifetime Min/Max 22/27)
194 Temperature_Celsius     0x0022   076   062   000    Old_age   Always       -       24 (Lifetime Min/Max 21/30)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1248794960

I'm not sure what Reported_Uncorrect means, but it doesn't sound good, and the other drives all have 0's in that field.

Hi Searing

Only had a skim of post - it dropped me to end of post when in thread and noticed what I noticed, quick skimmed the posts and then wrote this as I'm about to head out for the day - so please forgive me if I've missed something above

I think it's your TEMPERATURE of the hdd's that could be the problem, while I couldn't find the spec's for your particular model with Samsung, (useless site) I did find a few for some other Samsung drives and they were 60c, (Degrees Celsius) for MAX OPERATING TEMPERATURE.

That's the same temperature spec. as my old 7200.7 40GB & 120GB Seagates, though have reliably had them run @ 62c Celsius in some rather hot summer months without issue for the last around 8 years - Usually they're sitting,(main drive atleast) on 50c-56c Celcius,(were in Vortex fan cradle but fan's been dead for years, so don't even get to have case absorb temps away as mounted on 4 little bits of rubber - oh well).

Anyhow, I'm only 2c Celcius over spec. - which should be within design factor spec's anyhow - BUT YOU'RE LIKE 16c-17c Degrees Celsius over spec. at times with your gear.

You need some drive cooling me thinks

Hope this helps

Later, E_v-O
 
I have fans blowing directly on all of the drives, in addition to just having a lot of fans in the system. Either way, I'm pretty sure that the raw_value is the temperature of the drives in Celsius - in this case, just slightly above room temperature, which makes sense. After all, I can't imagine so few drives (only nine in the system total, spread over the entire front of a mid-size tower) generating anywhere near enough heat to reach 75 C with even mediocre cooling.
 
Yes maximum 30 degrees, so nothing to worry about. ;-)

Have you done a Memory Test yet? You can find them on Ubuntu Linux livecd or Ultimate Boot CD or download .iso separately.
 
I'll run them tomorrow evening. As I said, it involves removing the graphics card from my desktop, so I can't do my homework while it runs. Unfortunately, I have quite a bit of it due next week, so until I finish the homework, I'm going to simply not use my server. On the plus side, there have been no more errors since I restarted my server - I still don't want to leave the problem alone, though, because it's bound to come back.

On a side note, should I be concerned about the Reported_Uncorrect field on one of the drives not being 0?
 
Don't rely on SMART values too much. Several disk vendors use the counters in non-standard, non-documented ways and "standard" SMART tools may give you the false impression that your disks are failing, when they're actually fine.
 
# zpool clear bulk

should clear your error status message and error counters. Then, do

# zpool scrub bulk

to check if you still have data errors.
 
Hey, I hate to bring a dead thread back to life, but I'm having the exact same problem as before.

I eventually ran that memory test - all my memory is fine. I also ran HDD Regenerator on all 9 of my hard drives (including my OS drive, which is not part of ZFS) and there were no bad sectors. As far as I can tell, all of my hardware is fine.

Anyways, it stayed at 139 data errors for a while - until another power outage occurred and my server abruptly lost power. Now I have the same issue as before, but the number of data errors has escalated up to 614! Just as before, when I remove a corrupted file, it simply replaces the file path with the FS path and a hex code, like this:

bulk/Landing:<0xdcb3>

Note that this is happening to all of my filesystems, not just the bulk/Landing one.

Also, I had some errors in backup snapshots. The zpool status -v output line would look like:

bulk/Backups/Hydrogen@2010:08:18:16:57:05:/Users/Ethan/AppData/blahblahfile.ext

They were old snapshots, so I decided to just remove the entire snapshot when it happened. Now the entries look like this:

<0x157>:<0x6b1df>

Another note: I'm not entirely sure that all of these files with permanent errors actually have errors. Many of the video files, for instance, play back just fine. It may be the case that there is some error correction built into the file format so I don't notice it, but unfortunately there were not trivial corrupt files (like text files) for me to definitively check if this is the case. I suppose I could download a program to scan the video files for hidden errors if need be.

Anyways, like before, zpool clear has no effect at all. Also, ALL of the errors were CKSUM errors - there were no READ or WRITE errors. Running additional scrubs results in it reporting that it has repaired a significant amount of data (as much as 100 MB), but no additional "permanent errors" are found.

Does anyone have any insight into what could be going wrong? I can't trust my server to reliably store data, which is a big problem for me. I'm planning on switching operating systems soon (to Ubuntu Server with ZFS-Fuse), but I don't want to try exporting/importing my data (or upgrading the ZFS version) if it's so riddled with errors. That being said, this may be caused by a bug in FreeBSD's ZFS implementation, at least the version I have (in FreeBSD 8.0)
 
I have had once such situation. Drives were previously used in an 3ware RAID without any worries, never ever experienced errors, S.M.A.R.T. also did not indicate anything unusual. But, connecting the drives to the motherboard ports with ZFS on top of them resulted in what you describe.

Replacing the drives one by one (starting with those that had most checksum errors) solved this for me. The drives do not indicate problems (haven't had time to toughly test yet), but I am wary of ever using these again. Apparently they do not read what was written...

Have you tried with a more recent version of FreeBSD, such as 8-stable? You do not need to upgrade the zpool or zfs versions. 8-stable has also much improved performance with ZFS.
 
I believe right now 8-stable is pretty.. "unstable", as it's now 8.2-prerelease. But as far as you use this only as a file server, it should not be problem.Not that it is broken, there are just too frequent changes.

You may wish to first try a snapshot, like downloading the latest 8-stable memstick image from ftp://ftp.freebsd.org/pub/FreeBSD/snapshots, writing it to an USB stick and booting from there. You should be able to import the ZFS pool (do not upgrade it!), run scrub, remove those bad files, etc.

I would just rebuild the OS, following this procedure:

edit /etc/make.conf to uncomment the CVSUP entries. Only the following is relevant

Code:
SUP_UPDATE=
#
SUP=            /usr/bin/csup
SUPFLAGS=       -g -L 2
SUPHOST=        cvsup.uk.FreeBSD.org
SUPFILE=        /usr/share/examples/cvsup/standard-supfile

You need to have the source loaded before doing the following:

# cd /usr/src
# make update
# make cleanworld
# make buildworld
# make buildkernel
# make installkernel

Boot single-user, to be safe

# mergemaster -p
# make installworld
# mergemaster -Fi

More details in /usr/src/UPDATING. Keeping FreeBSD up to date is very easy. Just make sure you have good backups. ZFS snapshots might help great deal here, but are of course not replacement for backups. :)
The ease to management of the OS and upgrading is the thing that keeps me firmly with FreeBSD, whatever features the alternatives have..
 
Back
Top