zfs dedup and copies

Hi,

I'm at the end of a backup/archiving project I started some time ago. The storage pool is 2 x 6 disks raidz2 with 2 TB drives for a total of 14TB free space. The system is an amd64 FreeBSD 8.2 release.

About the archiving process, there will be some large files that will be backuped up every day, for example outlook .pst files and certain number of them can be 2GB or more. With a daily incremental backup space will disapear quickly.

A solution for this problem I thought of is deduplication. I know it's RAM consumming and a loss of performance is predictable. It's not a problem.

To have something more secure, I thought about copies property in zfs but from what I understand, what will be copied with deduplication set to on and copies greater than 1, is the reference to the block and not the block itself. The problem is now if there is an error with the referenced block, all datas that refer to it will be unrecoverable.

I can't find a lot of information about that on the net. The only thing I've found talks about setting dedupditto to 2 but this is not possible with version 28 of zfs. The lower value possible is 100.

So my question is, does anybody already dealt with a situation like this and/or have some informations about the combination of zfs dedup and copies ?

Thanks,
 
By default, once a deduped block is referenced 1000 times, ZFS will add a second copy of the block on the disk. Once it's referenced another 1000 times, ZFS will add a third copy of the block on disk. And so on.

So, blocks that are referenced a lot, will have extra redundant copies in the pool.

There's a ZFS property for this, so you can actually change the "keep another copy" property on a per-dataset basis.
 
kisscool-fr said:
To have something more secure, I thought about copies property in zfs but from what I understand, what will be copied with deduplication set to on and copies greater than 1, is the reference to the block and not the block itself.
Did you test it?
I believe I've read the ZFS devs discussing the topic and saying that when you set copies to X, there will be no less than X of them regardless of deduplication. Dedupeditto can increase the number further.
 
phoenix said:
By default, once a deduped block is referenced 1000 times, ZFS will add a second copy of the block on the disk. Once it's referenced another 1000 times, ZFS will add a third copy of the block on disk. And so on.

So, blocks that are referenced a lot, will have extra redundant copies in the pool.

There's a ZFS property for this, so you can actually change the "keep another copy" property on a per-dataset basis.

Thanks for your input. It don't realy confirms what I've found with further searching. Dont't have the link anymore but to resume one guy had (a) checksum error(s) with his deduped pool and the error(s) occured on referenced blocks. He could repair the majority of his datas but for the last ones it was impossible. He had to destroy the pool and restore from backup. It's what I want to avoid because this is a backup server.

From what you are saying, the redundant copies are full copies, right ?


Slurp said:
Did you test it?
I believe I've read the ZFS devs discussing the topic and saying that when you set copies to X, there will be no less than X of them regardless of deduplication. Dedupeditto can increase the number further.

No, I did not test it yet but even if I would want to test it I don't know how to verify it really. I'm not a zfs expert, I just learn it a little bit every day. It's why I read a lot about on discussion lists, forums and blogs but even in discussion lists there are contradictory responses.

It's on a list (I think) I have found that with copies greater than 1, what is copied is the reference to the block and not the refecenced block.
I've also found something talking about ditto blocks, but like I said earlier the value mentioned was 2 and it seems to be impossible to put this value for ditto blocks. The minimum is 100. Did I missed something ?


Crest said:
FreeBSD 8.2 does not yet include dedup support. You need ZFS v28 for dedup.

Yes I know it. I have patched the system with zfs v28 support.
 
The following blog post from Sun/Oracle suggests that by default ZFS stores one copy of data and 2 metadata. With copies(n) > 1, metadata is stored 3 times and the actual data is stored n times.

https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection

I can't say for certain, but I see no reason why this should change with deduplication. The first time a record is stored, 2 copies of data will be stored and 3 metadata (with copies=2). The next time you store the same record, it will increase the ref count but still have 2 full copies on disk. I can't see why or how it would work any other way.

The whole point of copies is to provide redundancy on single disk pools (such as laptops or HW raid arrays as mentioned in the blog). Without copies you can scrub a single disk pool but not fix any records that fail checksum. With copies>1 ZFS can fix the corrupt data.

In your case I don't see much advantage to copies. You already have dual redundancy in the vdevs. Obviously you immediately half the amount of data you can store with copies=2, bringing the raw space of your pool down to 8TB (((6-2)x2TB + (6-2)x2TB) / 2 copies). If you're ok with that amount of space you may as well just set up a stripe of 3 way mirrors which provides great resilience, probably higher performance and I think mirrors are much simpler and easier to manage. May rebuild quicker as well as it only has to read from another disk in the same mirror rather than reading from all the other disks in the vdev.

Of course we've all seen or read about ZFS pools going corrupt, even with redundancy, quite often ending with a rebuild. Unfortunately the complexity of the inner working of ZFS mean than one of it's downsides is recovery from errors is near impossible if ZFS can't fix the problem itself (There are blogs on the net about recovering data from the few people who really know ZFS but it's a hell of a lot of work just to find and pull out one <=128k record). The ideal solution in any scenario (especially with critical data) is to have 2 independent backups so that the loss of any system still leaves you with a backup. If you're considering copies=2, you could also just create 2 separate pools with one vdev each and zfs send data from one to the other. A corrupt, unfixable vdev will bring down a pool with 2 vdevs, whereas with separate pools, the second should hopefully survive unless something catastrophic has gone wrong with the system.

Just going back to dedupe for a second, if you ever plan to destroy a large file system it may be worth clearing it out manually first. The posts below suggest it can be a real problem and that one company ended up getting a loan system from Oracle with 128GB of RAM just so the destroy could finish. (This may have been fixed, I've not had any involvement with dedupe or very large systems but it seems to just be a side effect of lots of deduped data and it's well documented tendency for using lots of RAM)

http://lists.freebsd.org/pipermail/freebsd-fs/2012-August/014904.html
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg47526.html

I'm also not sure how wise it is to patch a system with v28 support. Would it not make more sense to just upgrade to 8.3?
 
Sigh...dedup

Yes, and let that be a cautionary example to funking always have a backup! It´s so silly, that I did something I thought was a simple maintenance command leaving the entire system of about 10-14TB completely fubar, just because of dedup. If you read on in that thread, I also tried hooking up just the disks to a server with even double that RAM, 64GB, and it was still fubar. I´m so pissed off at dedup right now, words can´t even describe it. I abandoned the rescue operation after well over a month. There was 2TB worth of data that was really important(and not backed up yet), but... And I really did try everything imaginable.

Rookie mistake. But I thought to myself "Yeah, I´ll just delete these old filesystems(and underlying snapshots) first and back up the important stuff right after that, that´s just ordinary maintenance.", but "right after that" never had the chance to come:( And I call myself a "Storage technician", gah!

/Sebulon
 
http://arc.opensolaris.org/caselog/PSARC/2009/571/mail
The per-dataset ncopies property is obeyed in that if one were to dedup many
blocks with ncopies set to 2 there would result in 2 copies total after
deduplication.

The 'dedupditto' property guides what we're calling auto-ditto in which ZFS
chooses to store an additional copy once some threshhold is reached. This is
independent of the per-dataset 'ncopies' property.
Unless BSD changed the behaviour (doubtful), it does what I said, keeps at least N copies regardless of how many are deduped. If you have a mail saying otherwise, please share the link.

Aside, like usdmatt I'm surprised that you prefer to update ZFS despite the fact that you don't know its codebase to just updating the distro.
 
Hi,

Sorry for the late reply, I was a little bit busy. I'll try to reply point by point.

usdmatt said:
The following blog post from Sun/Oracle suggests that by default ZFS stores one copy of data and 2 metadata. With copies(n) > 1, metadata is stored 3 times and the actual data is stored n times.

https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection

I can't say for certain, but I see no reason why this should change with deduplication. The first time a record is stored, 2 copies of data will be stored and 3 metadata (with copies=2). The next time you store the same record, it will increase the ref count but still have 2 full copies on disk. I can't see why or how it would work any other way.

You're right. It's just I misunderstand the first time I read about that. I found the article I was talking about, it's from opensolaris http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup. The relevant part is "How does the dedup property interact with the copies property?"
Reading it another times makes me understand how it works. English is not my native language, so sometimes it's better to reread again and again :)

usdmatt said:
The whole point of copies is to provide redundancy on single disk pools (such as laptops or HW raid arrays as mentioned in the blog). Without copies you can scrub a single disk pool but not fix any records that fail checksum. With copies>1 ZFS can fix the corrupt data.

I agree with that.

usdmatt said:
In your case I don't see much advantage to copies. You already have dual redundancy in the vdevs. Obviously you immediately half the amount of data you can store with copies=2, bringing the raw space of your pool down to 8TB (((6-2)x2TB + (6-2)x2TB) / 2 copies). If you're ok with that amount of space you may as well just set up a stripe of 3 way mirrors which provides great resilience, probably higher performance and I think mirrors are much simpler and easier to manage. May rebuild quicker as well as it only has to read from another disk in the same mirror rather than reading from all the other disks in the vdev.

Yes, in my configuration up to 4 disks can fail (2 disks in each vdev) and the pool will still be available but there isn't any redundancy for the datas without copies.
Imagine a case where one disk fails, replacing it and during the resilver another one fails and before the resilvering completes a third one and a fourth one fail in the same vdev (yes it's hypothetical but there are 2TB drives from the same serie and resilvering can take some time). It leaves me with 3 good drives and 3 bad ones in the same vdev. This vdev is down and the pool too. With a lot of chance (yes I said it's hupothetical) I can use dd or dd_rescue to backup one of the failing drives, reinsert it to the vdev and continue my resilvering. Without copies I am not sure I can get all my datas back, with copies the chances are greater.
You said with copies=2 it leaves me with only 8 TB of space. It's not a problem for me because it's 8TB of non deduped datas. Once deduped it can be much more.

Your point about the 3 way mirror is something I did not thought about. It is interesting said like that, management and rebuild time are better, ok, perfamance is greater, ok but it is not a problem for me because I don't need extra perfs just something reasonable, but what stops me is resilience. If we reuse the case before and the 3 failing drives are from the same mirror vdev, all is down. I don't know why in this situation having copies reassures me. And if I use copies, the space left is 4TB of non deduped datas which is not enough for me.

usdmatt said:
Of course we've all seen or read about ZFS pools going corrupt, even with redundancy, quite often ending with a rebuild. Unfortunately the complexity of the inner working of ZFS mean than one of it's downsides is recovery from errors is near impossible if ZFS can't fix the problem itself (There are blogs on the net about recovering data from the few people who really know ZFS but it's a hell of a lot of work just to find and pull out one <=128k record). The ideal solution in any scenario (especially with critical data) is to have 2 independent backups so that the loss of any system still leaves you with a backup. If you're considering copies=2, you could also just create 2 separate pools with one vdev each and zfs send data from one to the other. A corrupt, unfixable vdev will bring down a pool with 2 vdevs, whereas with separate pools, the second should hopefully survive unless something catastrophic has gone wrong with the system.

Having to independant vdevs in the same machine. Interesting. Don't thought about that either. In this case, what configuration would be better ? 2 raidz vdevs (10 TB non deduped) or 2 raidz2 vdevs (8 TB non deduped) ?
With the configuration I have now "raid60", performance are acceptable for me but I'm not sure what they can be in a raidz or raidz2. I don't need something monstruous, just something acceptable.


usdmatt said:
Just going back to dedupe for a second, if you ever plan to destroy a large file system it may be worth clearing it out manually first. The posts below suggest it can be a real problem and that one company ended up getting a loan system from Oracle with 128GB of RAM just so the destroy could finish. (This may have been fixed, I've not had any involvement with dedupe or very large systems but it seems to just be a side effect of lots of deduped data and it's well documented tendency for using lots of RAM)

http://lists.freebsd.org/pipermail/freebsd-fs/2012-August/014904.html
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg47526.html

Thanks for the advice. I've done some tests with dedup some time ago and yes had to recreate the pool after destroying a file system. I plan to add to 2 128GB SSD as cache devices to this configuration, it should be enough I hope.

usdmatt said:
I'm also not sure how wise it is to patch a system with v28 support. Would it not make more sense to just upgrade to 8.3?

It's just for the moment I can't update the entire system to 8.3 but it will be updated soon.
The patch I found is from Freebsd's lists, I can't find the thread but the patch can be found here http://people.freebsd.org/~mm/patches/zfs/v28/.



Sebulon said:
Sigh...dedup

Yes, and let that be a cautionary example to funking always have a backup! It´s so silly, that I did something I thought was a simple maintenance command leaving the entire system of about 10-14TB completely fubar, just because of dedup. If you read on in that thread, I also tried hooking up just the disks to a server with even double that RAM, 64GB, and it was still fubar. I´m so pissed off at dedup right now, words can´t even describe it. I abandoned the rescue operation after well over a month. There was 2TB worth of data that was really important(and not backed up yet), but... And I really did try everything imaginable.

Rookie mistake. But I thought to myself "Yeah, I´ll just delete these old filesystems(and underlying snapshots) first and back up the important stuff right after that, that´s just ordinary maintenance.", but "right after that" never had the chance to come:( And I call myself a "Storage technician", gah!

/Sebulon

Sorry for that but thanks for the advice. I try to take care of a maximum of things to have something whitout problems.

Slurp said:
http://arc.opensolaris.org/caselog/PSARC/2009/571/mail

Unless BSD changed the behaviour (doubtful), it does what I said, keeps at least N copies regardless of how many are deduped. If you have a mail saying otherwise, please share the link.

Aside, like usdmatt I'm surprised that you prefer to update ZFS despite the fact that you don't know its codebase to just updating the distro.

Like I said before, it was just my misunderstanding of what I read. Rereading the link I provided upper, it says what you guys told me ;)
 
Sebulon said:
Sigh...dedup

Yes, and let that be a cautionary example to funking always have a backup! It´s so silly, that I did something I thought was a simple maintenance command leaving the entire system of about 10-14TB completely fubar, just because of dedup. If you read on in that thread, I also tried hooking up just the disks to a server with even double that RAM, 64GB, and it was still fubar. I´m so pissed off at dedup right now, words can´t even describe it. I abandoned the rescue operation after well over a month. There was 2TB worth of data that was really important(and not backed up yet), but... And I really did try everything imaginable.

Rookie mistake. But I thought to myself "Yeah, I´ll just delete these old filesystems(and underlying snapshots) first and back up the important stuff right after that, that´s just ordinary maintenance.", but "right after that" never had the chance to come:( And I call myself a "Storage technician", gah!

/Sebulon

I had a similar issue a year a go. There was a very good thread about this on one of the Solaris mailing lists that has disappeared now (Thanks Oracle) with an ugly fix for this problem. Starting the server up, letting it load, letting it try to import/scrub the ZFS pool, have it crash when it runs out of memory, then reboot and repeat. After 6 days solid I managed to reimport my 28TB (12TB of data) pool. Long and painful, but I recovered all the data and haven't touched dedup since.
 
Back
Top