God I love ZFS!

olav · Dec 3, 2010

During my weekly scrub I got this e-mail from my server.

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h55m with 0 errors on Fri Dec  3 03:55:15 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada1    ONLINE       0     0     1  32K repaired
            ada3    ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: No known data errors

Good to know that my data is safe and has been healed

AndyUKG · Dec 3, 2010

Its not always so great. I recently had an issue where the disks in a pool lost power while the server was running. This resulted in corruption in two files, unfortunately they were meta files which cannot be deleted or restored from backup, so I had to destroy the pool and recreate it

. Pretty crappy!

olav · Dec 3, 2010

Didn't you have a backup of the /boot/zfs/zpool.cache file?

hopla · Dec 3, 2010

olav said:
Didn't you have a backup of the /boot/zfs/zpool.cache file?

Interesting, what is the importance of that file? How would it have helped him in his situation?

AndyUKG · Dec 3, 2010

hopla said:
Interesting, what is the importance of that file? How would it have helped him in his situation?

I don't think it would have helped at all is the answer. This seems to be a cache of info about the pool, such as disk device names etc. The issue I had was corruption of meta data on the disks, ie meta data of actual data, not meta data of the disk pool (if that makes sense

).

Andy.

Alt · Dec 3, 2010

Which raid type(raidz1/mirror/stripe) you had?

AndyUKG · Dec 3, 2010

Alt said:
Which raid type(raidz1/mirror/stripe) you had?

mirror

carlton_draught · Dec 3, 2010

AndyUKG said:
mirror

Did you actually lose any data? Or was it just something that righted itself when you destroyed and recreated the pool?

AndyUKG · Dec 4, 2010

carlton_draught said:
Did you actually lose any data? Or was it just something that righted itself when you destroyed and recreated the pool?

Well isn't destroying the pool loosing all my data?? If I hadn't had a back copy I would have lost data, and it would have been very hard to identify exactly what data was corrupt as insufficient info was provided by the ZFS error.

FYI the error was ZFS-8000-8A - Corrupted data, the description of this is:

Damaged files may or may not be able to be removed depending on the type of corruption. If the corruption is within the plain data, the file should be removable. If the corruption is in the file metadata, then the file cannot be removed, though it can be moved to an alternate location. In either case, the data should be restored from a backup source. It is also possible for the corruption to be within pool-wide metadata, resulting in entire datasets being unavailable. If this is the case, the only option is to destroy the pool and re-create the datasets from backup.

Galactic_Dominator · Dec 5, 2010

AndyUKG said:
Well isn't destroying the pool loosing all my data??

An unreadable pool would be considered loss of data. Sometimes ZFS corruption simply results in a pool being unable to write new data. Ultimately this type of situation requires a new pool, but wouldn't be considered loss of data for obvious reasons. Since you gave very few details as to the nature of your reported corruption, the question you replied to seems appropriate and your supercilious response does not.

AndyUKG · Dec 5, 2010

If the terms, corruption (which I used in my first post) and destroying don't imply loss of data I don't know what does...

AndyUKG · Dec 5, 2010

Galactic_Dominator said:
An unreadable pool would be considered loss of data. Sometimes ZFS corruption simply results in a pool being unable to write new data. Ultimately this type of situation requires a new pool, but wouldn't be considered loss of data for obvious reasons. Since you gave very few details as to the nature of your reported corruption, the question you replied to seems appropriate and your supercilious response does not.

The reality is that as I had a backup copy of the data I didn't do any detailed analysis of what might have been affected data wise, it would have been an interesting exercise but impractical as the pool contained millions of files. If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.

Is it really necessary to come onto a thread and label people with insulting names when they are trying to share experiences and knowledge??

aragon · Dec 5, 2010

AndyUKG said:
If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.

Busy building an 8 TB NAS at the moment and this is quite a big worry for me. It's not all that easy to backup 8 TB of data either, so... *cringe*

DutchDaemon · Dec 6, 2010

Take the exit at Semantics Junction, guys. Thanks.

fronclynne · Dec 6, 2010

aragon said:
Busy building an 8 TB NAS at the moment and this is quite a big worry for me. It's not all that easy to backup 8 TB of data either, so... *cringe*

Well, I've had UFS(2) fail well enough to hose data three or four times, FAT[12|16|32] more times than I can count, NTFS is as fault-tolerant as the 880 (warning! California joke, sorry), and a wayward sand particle made a backup on CD rather . . . unreadable.

I'm going to patent/copyright/trademark a ZenFS, for when your data are unimportant as individual bits. The fact that you are tied to the idea of discreet information retention is why you fail at enlightenment, my child. When all of your partitions are copies of /dev/urandom you will know true freedom.

Galactic_Dominator · Dec 6, 2010

AndyUKG said:
The reality is that as I had a backup copy of the data I didn't do any detailed analysis of what might have been affected data wise, it would have been an interesting exercise but impractical as the pool contained millions of files. If I hadn't had a backup copy to restore from, or even to compare the data against I would have felt extremely nervous about the integrity of the data (ie if I'd had to copy the data off, destroy the pool and create it). This seems to me a pretty piss poor result from a simple power outage on a supposedly advanced fault tolerant file system.

It's both well documented and common knowledge that certain types of hardware failures can cause corruption regardless of your filesystem. Specifically in ZFS's case, these failures tend to cause errors exactly as you have stated. See here for some information:

http://docs.sun.com/app/docs/doc/819-5461/gavwg?a=view

Essentially, these issue's come down to several forms of hardware.

Cable cross-talk
Faulty hardware eg RAM
Sudden power failure

Since you report sudden power failure, I'll tell exactly why it happened and why it's your fault and not ZFS's.

ZFS, as with any FS, trusts a flush request. Because of ZFS's COW abilities and transaction grouping this is generally not a problem but there is one generally rare situation that results in corruption even on ZFS. Hard drives have a feature write-caching which greatly increases performance at the risk of possible corruption.

ZFS guarantees "good" data is not moved until the entire write is complete, but that guarantee comes with a caveat some do not realize. If the hard drive "lies" to ZFS that one portion of the write is complete, ZFS will go ahead with committing the transaction group and updating the uberblock. Say you lose power at this point, and the disk completes the writes but issues them out of order. You're stuck with new COW data, but with wrong uberblock so the COW differentials are unable to track changes to specific files and you end up with your exact scenario. This is simplified a bit but you should get the idea. Remember in ZFS, redundancy and consistency are different things and one doesn't always guarantee the other. The consistency portion is what went wrong for you so the mirror doesn't help.

You can find plenty more of these but this link shows my explanation in the real world:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-January/035740.html

Every reliability document worth it's weight advises you to disable drive write-caching. Here's just one example:

http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks

There are ZFS and hardware methods you can use reduce the performance impact of disabling this but for a lot of setups it's going to be more effective to simply keep full backups as this is generally a rare issue.

Because you didn't take adequate measures to insure ZFS could operate reliably, you are at fault here. You compounded the issue by blaming ZFS and spreading FUD. I don't think it was intentional on your part, but nevertheless potentially harmful. Please don't take this as another personal assault as I'm sure you're a fine person. You also seem like an intelligent person and a decent sysadmin with some room for improvement. I believe you're a decent sysadmin since you had backups.

This explanation was brought to you by your more detailed problem description. HIH.

AndyUKG said:
Is it really necessary to come onto a thread and label people with insulting names when they are trying to share experiences and knowledge??

Maybe you're confusing me with someone else as I'm quite sure I never called anyone here an insulting name. When I do ask questions, I really dislike getting misleading responses, FUD, or answers that do nothing but serve to inflate the responder's post count. So what I was pointing out to you is there are more details in ZFS than are dreamt of in your philosophy, and you need not get snippy when someone asks for a clarification.

I'm all for you sharing your experiences though as we're all in this ZFS boat together, and hopefully reports like yours(the detailed version, not the original) can help both awareness and resolution for everyone.

olav · Dec 6, 2010

Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html

The Solaris ZFS file system is safe with disk write-cache enabled because it issues its own disk cache flush commands.

I guess that's not quite true then?

Anyway I tested with write cache disabled here. For sequential data transfer the speed on my ZFS pool is pretty much equal as with write cache enabled. However the OS disk is now 5-10x slower. The only thing I did was adding this to /boot/loader.conf
[CMD=""]hw.ata.wc=0[/CMD]

olav · Dec 6, 2010

Ack, the
[CMD=""]hw.ata.wc=0[/CMD]
property doesn't work with AHCI drives. Only with regular IDE.

Code:

[olav@zpool ~]$ sudo camcontrol identify ada0
pass0: <SAMSUNG HD203WI 1AN10003> ATA-8 SATA 2.x device
pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 2.x
device model          SAMSUNG HD203WI
firmware revision     1AN10003
serial number         S1UYJDWZ725504
WWN                   50024e903c88923
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       3907029168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 
media RPM             5400

Feature                      Support  Enable    Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  yes      no      0/0x00  254/0xFE
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no      0/0x0
unload                         yes      yes
free-fall                      no       no
data set management (TRIM)     no

No wonder my ZFS pool didn't show a speed decrease. How can I disable the write cache for AHCI enabled devices?

aragon · Dec 6, 2010

fronclynne said:
Well, I've had UFS(2) fail well enough to hose data three or four times, FAT[12|16|32] more times than I can count, NTFS is as fault-tolerant as the 880 (warning! California joke, sorry), and a wayward sand particle made a backup on CD rather . . . unreadable.

I've lost some data on UFS too, but never an entire file system in the more than 10 years I've been using FreeBSD (ok, baring the odd total disk failure). With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.

Won't stop me trying it though. Too many awesome features to pass up.

AndyUKG · Dec 6, 2010

aragon said:
Busy building an 8 TB NAS at the moment and this is quite a big worry for me. It's not all that easy to backup 8 TB of data either, so... *cringe*

If you don't already have some enterprise backup with an LTO library or something similar the easiest way is probably going to be to have a duplicate copy of your pool which you replicate via zfs send/receive. I too have an 8TB pool and am using this method. Of course you could still be vulnerable to some ZFS bug that renders both pools useless!

AndyUKG · Dec 6, 2010

olav said:
Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html

I guess that's not quite true then?

Anyway I tested with write cache disabled here. For sequential data transfer the speed on my ZFS pool is pretty much equal as with write cache enabled. However the OS disk is now 5-10x slower. The only thing I did was adding this to /boot/loader.conf
[CMD=""]hw.ata.wc=0[/CMD]

Just googled this; it would seem that ZFS is safe to use with write cache enabled from all the info I've seen , BUT that this obviously depends on the disks you are using behaving as they should (honouring flush requests from ZFS). With cheap consumer grade drives the most likely to not behave well, and expensive SAS disks the most likely to be good. This could explain my issue...
All in all, seems to leave us in a bit of a lottery where its impossible to know if the drive you have will work reliably

Unless someone has compiled a list of good drives somewhere. On all my systems I am using cheap SATA drives for ZFS.

AndyUKG · Dec 6, 2010

aragon said:
With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.

To clarify what happened in my case. ZFS reported unrecoverable corruption in 2 meta data files and gave as the corrective action "destroy the pool and recreate from backup". However the pool was still mounted and data readable. So it wasn't a case when all data would have been lost if I hadn't had a backup.

aragon · Dec 6, 2010

AndyUKG said:
To clarify what happened in my case. ZFS reported unrecoverable corruption in 2 meta data files and gave as the corrective action "destroy the pool and recreate from backup". However the pool was still mounted and data readable. So it wasn't a case when all data would have been lost if I hadn't had a backup.

Good to know, thanks!

Galactic_Dominator · Dec 6, 2010

aragon said:
I've lost some data on UFS too, but never an entire file system in the more than 10 years I've been using FreeBSD (ok, baring the odd total disk failure). With ZFS though, I get the impression it's relatively easy to loose an entire pool from the slightest mishap in setup or environment.

One detail I forgot to mention earlier is that ZFS reported the corruption, in a different FS this type of thing would have been silent. So because of ZFS's design, the recovery process sucks but at least you were aware of the issue and can take action to resolve it.

phoenix · Dec 6, 2010

olav said:
Aha, I thought ZFS was designed to be safe with write cache enabled. From here: http://www.postgresql.org/docs/current/static/wal-reliability.html

I guess that's not quite true then?

It is true. ZFS sends cache flush commands as needed. However, not all harddrives obey the command. Some will respond with "flush complete" even though it has only written the data to the cache and not to the platters. AFA ZFS is concerned, the data is on the platters (the disk told it so), so it continues on with the next transaction.

It's always a trade-off between "pure speed" and "total data security". So long as you have a good, working UPS properly configured to issue an ordered shutdown of the box, and have good harddrives that don't lie about "flush complete", and you don't mind the slim chance of a drive dying with data in the cache, then you can run with the disk caches enabled. If you are absolutely paranoid about data safety and don't mind sacrificing a lot of write throughput, then run with all caches (including controller caches) disabled.

It's up to you to decide what's important, and configure the system to match.

God I love ZFS!

Administrator