Question about dedup with ZFS v28

jkcarrol · Jun 2, 2011

I am running FreeBSD 8.2/amd64 and I applied the ZFS v28 patch for 8.2-R and zpool/zfs upgraded my pools.

I was previously using gallery 2 for hosting my family photos, and recently updated to gallery 3. Due to limitations in the way the g2 import/migration process works, it had to create basically copies of every one of the source jpegs.

Since I want to keep the same gallery 2 album structure (for the time being) in case I find a problem or feature missing from gallery 3, I tried to do the following:

1. moved /data/pictures to /data/pictures.old

2. create a new dataset on my data zpool for the pictures (data/pictures) and enabled dedup and compression:
[CMD="zfs create -o dedup=on -o compression=on data/pictures"]zfs create -o dedup=on -o compression=on data/pictures[/CMD]

3. copied /data/pictures.old/gallery3 to /data/pictures/gallery

4. copied /data/pictures.old/gallery to /data/pictures/gallery.old

I was expecting the second copy (step #4 above) to run much faster and/or consume only the disk space constituting the files from the old gallery that were not in the new gallery. What I observed in reality was that it took just as long to copy (e.g. as if I was just copying to the zpool without dedup enabled) and df reports ~169 GB used, instead of roughly half of that.

Did I not correctly enable dedup? Do I need to enable dedup for the parent zpool also? It shows as enabled:

Code:

root@pflog:/data# zfs get all data/pictures
NAME           PROPERTY              VALUE                  SOURCE
data/pictures  type                  filesystem             -
data/pictures  creation              Thu Jun  2  9:10 2011  -
data/pictures  used                  165G                   -
data/pictures  available             1.32T                  -
data/pictures  referenced            165G                   -
data/pictures  compressratio         1.02x                  -
data/pictures  mounted               yes                    -
data/pictures  quota                 none                   default
data/pictures  reservation           none                   default
data/pictures  recordsize            128K                   default
data/pictures  mountpoint            /data/pictures         default
data/pictures  sharenfs              on                     inherited from data
data/pictures  checksum              on                     default
data/pictures  compression           on                     local
data/pictures  atime                 on                     default
data/pictures  devices               on                     default
data/pictures  exec                  on                     default
data/pictures  setuid                on                     default
data/pictures  readonly              off                    default
data/pictures  jailed                off                    default
data/pictures  snapdir               hidden                 default
data/pictures  aclinherit            restricted             default
data/pictures  canmount              on                     default
data/pictures  xattr                 off                    temporary
data/pictures  copies                1                      default
data/pictures  version               5                      -
data/pictures  utf8only              off                    -
data/pictures  normalization         none                   -
data/pictures  casesensitivity       sensitive              -
data/pictures  vscan                 off                    default
data/pictures  nbmand                off                    default
data/pictures  sharesmb              off                    default
data/pictures  refquota              none                   default
data/pictures  refreservation        none                   default
data/pictures  primarycache          all                    default
data/pictures  secondarycache        all                    default
data/pictures  usedbysnapshots       0                      -
data/pictures  usedbydataset         165G                   -
data/pictures  usedbychildren        0                      -
data/pictures  usedbyrefreservation  0                      -
data/pictures  logbias               latency                default
data/pictures  dedup                 on                     local
data/pictures  mlslabel                                     -
data/pictures  sync                  standard               default

Did I misunderstand dedup and incorrectly expect it to not have two copies of the same data. Does the ctime/mtime of the extra data come into play? After the g2 import, I did write a script to go back and change the mtime/ctime of the files in the new gallery directory to match those of the old gallery files.

Thanks in advance!

usdmatt · Jun 2, 2011

You should be able to enable dedup for individual filesystems.
The ctime/mtime of files also has no bearing on whether a file is deduped or not.

ZFS keeps a dedup table (DDT) which contains the checksum of every record stored on the filesystem. When a new record is written, the checksum is compared against the table to see if it's a duplicate. It is possible that two identical files could end up on the system twice if they are split across records differently, especially with compress on but I would of expected a half decent dedup ratio for two identical copies of ~80GB of images. I have not tried dedup yet though so I'm no expert.

Do you see the dedup ratio if you do a [CMD="zpool"]list[/CMD]
If it's exactly 1.00x then I would be inclined to think dedup is not enabled/working for some reason.

jkcarrol · Jun 2, 2011

usdmatt said:
Do you see the dedup ratio if you do a [CMD="zpool"]list[/CMD]
If it's exactly 1.00x then I would be inclined to think dedup is not enabled/working for some reason.

Yes, I do see a dedup factor:

Code:

NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backup  1.09T   297G   815G    26%  1.00x  ONLINE  -
data    2.72T   714G  2.02T    25%  2.04x  ONLINE  -

But df reports 165 G used on the data/pictures mount:

Code:

Filesystem         1G-blocks Used Avail Capacity  Mounted on
data                    1742  394  1348    23%    /data
data/pictures           1513  165  1348    11%    /data/pictures

I was expecting it to show ~half of the 165. So perhaps it's working, but df still reports it as used space for some reason? Is there a way to check the details of the dedup table to see if the pictures are in there?

FWIW, I generated the md5s for the list of files in both file systems and they match for most, as I expected. So my questions are basically:

1. Does df report dedup'd data or not? E.g. suppose I had a dataset with dedup enabled and I copied a 10 GB file to it. I then copied the same file. Would df show 10 G or 20 G used?

2. Should dedup have made copying the data much, much faster? I was expecting it to be much faster since it wouldn't really have to "copy" most of it, just calculate and compare the checksum then if it's already in the table, just create some internal pointer/reference to the data, rather than actually duplicate it on disk.

usdmatt · Jun 2, 2011

Well the dedup factor itself seems to suggest you are using half the raw storage space you would of been using had you not used dedup.

It's entirely possible that df may show the wrong usage. This dedup article mentions that on solaris df reports incorrect size/used information so FreeBSD is probably the same.
http://blogs.oracle.com/jsavit/entry/deduplication_now_in_zfs

Of course you'd expect there to be a way to actually see how much space the data/pictures filesystem is using, other than just looking at the overall pool size.
What does [CMD="zfs"]list[/CMD] show?

I would generally expect writing duplicate data to be quicker. It is possible that you have quick disks and a slower processor. When writing data, if compression and checksumming are causing a bottleneck, you won't see much of an improvement as ZFS still has to do these things with dedup. You may notice a much bigger performance difference when writing duplicate data with compression off. I'm pretty sure the 4 x 2TB raidz array in my home NAS can write just as fast, if not faster than my atom processor can compress data.

Another recommendation is to always make sure you have a decent amount of RAM. You don't really want ZFS having to store the DDT on disk.

gkontos · Jun 2, 2011

I am using dedup on my 9-Current desktop currently on /usr/src(with compression) and /virtualbox. I have noticed significant speed increase during compile and read operations but there is always some sort of delay during heavy write. I also tried using dedup on my jails but this tend to slow up things terribly.

I think that disk performance is very critical for dedup more than CPU due to the calculations needed and since my disks are very old SATA I imagine that I have a bottleneck there. You can get some nice statistics on dedup by issuing:

[CMD=""]# zdb -DD data/pictures[/CMD]

jkcarrol · Jun 2, 2011

usdmatt said:
What does [CMD="zfs"]list[/CMD] show?

Here's the output:

Code:

NAME            USED  AVAIL  REFER  MOUNTPOINT
backup          198G   531G   198G  /backup
data            560G  1.32T   394G  /data
data/pictures   165G  1.32T   165G  /data/pictures

usdmatt said:
I would generally expect writing duplicate data to be quicker

As would I. This is a Core i7 2600k with 16 GB of RAM, so it shouldn't have been CPU or memory bound.

I found some mentions of dedup and df being inaccurate, and it recommends people use zpool and abandon df to look at usage. So I think you're right, df is jut not dedup aware. That doesn't explain why it was so slow to copy though. :/

It appears that what happens is the size of the filesystem "grows". For example, I did a df before and after I created a 2G file from /dev/urandom:

before:

Code:

Filesystem    1M-blocks   Used   Avail Capacity  Mounted on
data/pictures   1549992 169010 1380981    11%    /data/pictures

dd:

Code:

# dd if=/dev/urandom of=2G bs=1024 count=2M
2097152+0 records in
2097152+0 records out
2147483648 bytes transferred in 29.533158 secs (72714325 bytes/sec)

after:

Code:

Filesystem    1M-blocks   Used   Avail Capacity  Mounted on
data/pictures   1549988 171058 1378930    11%    /data/pictures

Note that the amount used did increase from 169010 to 171058. I'm not sure why the total size actually went down though?

But if then copy that 2G file to a new file, notice the df output:

after copying 2G file:

Code:

Filesystem    1M-blocks   Used   Avail Capacity  Mounted on
data/pictures   1552002 173105 1378896    11%    /data/pictures

Note that used went up by ~2G, but the available size also increased by nearly that much! So effectively it used only: (173105-171058)-(1552002-1549988) = 33 MB.

If I do the same experiment with a 2G empty file (note: I removed the files from the other 2G test, so comparisons between the df output of the previous experiment and this dont' make sense):

before:

Code:

Filesystem    1M-blocks   Used   Avail Capacity  Mounted on
data/pictures   1549987 169010 1380976    11%    /data/pictures

after (2G file from /dev/zero):

Code:

Filesystem    1M-blocks   Used   Avail Capacity  Mounted on
data/pictures   1549992 169010 1380982    11%    /data/pictures

So the total size increased by 5 MB, but the total used didn't change.

carlton_draught · Jun 3, 2011

I don't think it's a good idea to use compression if you know the filesystem is just going to be storing jpegs. jpegs are already highly compressed. Any further gains from compression will be very marginal. You will force your CPU to do a lot of work for no benefit, perhaps bottlenecking the system (you'd need to test it though).

danbi · Jun 3, 2011

Dedup performance will likely depend mostly on available RAM (rather ARC size), then CPU, then disks. Actually, dedup if implemented properly will save disk I/O, thus improve performance in the typical cases.

Dedup is not always the best choice, however.
The original question would be served better by using ZFS snapshots and cloning. This does "duplicate" the data, without using any additional disk space. It will also not create DDT that use additional disk space AND memory. Any modification to the files afterwards will result in allocation of new space. There is very small chance that JPEG files have anything in common, so dedup will not help much. Dedup works on filesystem block level, rather than on file level.

One additional note. For dedup to work, you need to enable it before writing the first set of (duplicate) data. If you have existing data, you enable dedup and copy that data within the same dataset, you will not see any dedup effect, because the DDT for the older objects does not exist. If you then create third copy, it will be deduplicated with the second copy, but not the first (you will get 2/3 storage used instead of 1/3).

There may not be much improvement by using both compression and dedup.

phoenix · Jun 3, 2011

jkcarrol said:
Yes, I do see a dedup factor:

Code:

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT backup 1.09T 297G 815G 26% 1.00x ONLINE - data 2.72T 714G 2.02T 25% 2.04x ONLINE -

But df reports 165 G used on the data/pictures mount:

Code:

Filesystem 1G-blocks Used Avail Capacity Mounted on data 1742 394 1348 23% /data data/pictures 1513 165 1348 11% /data/pictures

I was expecting it to show ~half of the 165. So perhaps it's working, but df still reports it as used space for some reason? Is there a way to check the details of the dedup table to see if the pictures are in there?

FWIW, I generated the md5s for the list of files in both file systems and they match for most, as I expected. So my questions are basically:

1. Does df report dedup'd data or not? E.g. suppose I had a dataset with dedup enabled and I copied a 10 GB file to it. I then copied the same file. Would df show 10 G or 20 G used?

Most command-line tools that show disk usage will show "apparent disk usage". Meaning, it shows the total, raw amount of disk space that would be used to store that data. IOW, df is showing that you are storing 189 GB of data on disk, which is correct. You have 2 copies of 80+ GB of data on disk. This is what the end-user sees, since they don't need to know whether or not compression is in effect, or dedupe is in effect, or redundancy is in effect, etc. They just need to know how much storage they have requested be used.

zfs list will show the amount of physical disk space in use for each filesystem, including compression, dedupe, and redundancy.

zpool list will show the actual amount of physical storage in use, as in, the number of actual physical disk sectors written to.

Note: the three values shown above will be different!

2. Should dedup have made copying the data much, much faster? I was expecting it to be much faster since it wouldn't really have to "copy" most of it, just calculate and compare the checksum then if it's already in the table, just create some internal pointer/reference to the data, rather than actually duplicate it on disk.

Writes will be slower with dedupe enabled, as ZFS has to:

split the data into blocks
calculate the checksum for each block
compare the checksum for each block to the DDT to see if it's a duplicate block
if it's a new (unique) block, write the block to disk, update the DDT with the new entry, write the DDT entry to disk
if it's a duplicate block, update the DDT reference count, write the DDT entry to disk
probably some other stuff

Without dedupe, ZFS only has to split the data into blocks, calculate the checksum for each block, and write the blocks to disk.

Dedupe is not something you enable for performance reasons. The only reason to enable dedupe is to save on physical storage space.

jkcarrol · Jun 3, 2011

danbi said:
Dedup performance will likely depend mostly on available RAM (rather ARC size), then CPU, then disks. Actually, dedup if implemented properly will save disk I/O, thus improve performance in the typical cases.

I have plenty of RAM (for example, arc is regularly up near 12 GB since I use very little memory for processes), and the CPU is an i7 2600k, which I think should be more than a match for the task

Dedup is not always the best choice, however.
The original question would be served better by using ZFS snapshots and cloning.

Yeah, I think you're right that I would have been better off with snapshots. I ended up moving all the data off the "data/pictures" dataset and then removed the dataset, because I have rsnapshot working for backups to my backup zpool, and I don't have the delete option enabled, so I waited to run the backup until I had the new gallery 3 directory structure in place in the same path as the old gallery 2 albums. Thus when the backup RAN, it was able to avoid duplicating the data, and I just have an rsnapshot directory with the old gallery 2 data and a new rsnapshot directory with the gallery 3 data. With 99% of it being the same, this did the trick and saved me from keeping the gallery 2 albums laying around.

This does "duplicate" the data, without using any additional disk space. It will also not create DDT that use additional disk space AND memory. Any modification to the files afterwards will result in allocation of new space. There is very small chance that JPEG files have anything in common, so dedup will not help much. Dedup works on filesystem block level, rather than on file level.

Ahh I see, so dedup operates on the block level not on the file level? In that case, why wouldn't it be able to duplicate each block of each JPEG since there was a second copy? Per your other comment, I did copy the data for the first time after turning dedup on, so both copies were done with dedup enabled. But if JPEG files, even 2 copies of all of them, won't benefit from dedup, that would explain what I saw

As for compression, I'm not sure why I enabled that. I figured it wouldn't make sense for data that's already highly compressed like JPEG or AVC (h264/aac).

Anyway, thanks very much for the explanation!

bbzz · Jun 5, 2011

So, on what type of files would dedup best used? Would it be beneficial to use dedup on filesystems without compression enabled (such as movies, music, pictures, etc)? I guess it makes sense since it's not files based but block based, but I'm not sure...

olav · Jun 5, 2011

I believe files that contain much of the same metadata will benefit greatly from dedup. Like word/excel files. Correct me if I'm wrong.

jalla · Jun 5, 2011

bbzz said:
So, on what type of files would dedup best used? Would it be beneficial to use dedup on filesystems without compression enabled (such as movies, music, pictures, etc)? I guess it makes sense since it's not files based but block based, but I'm not sure...

Music, video, and pictures contain highly random data.
There's absolutely nothing to gain from any kind of dedup with those kinds of files.

bbzz · Jun 5, 2011

Yeah, that makes sense.

@olav
That also makes sense - those files also benefit with heavy compression; It would be interesting to see if there are performance issues with both compression and dedup on those files?

regards

usdmatt · Jun 6, 2011

I'm not sure word/excel documents will dedup very well. You have to remember ZFS dedups by the record (which is variable up to 128KB by default) not by the block on your hard disk.
If you write two word documents under 128KB each, they will be stored in one record each and so would need to be identical to be deduped. For documents >128KB, they will be split into 128KB records, and again I think there's little chance of finding exact duplicates in that.

As jalla said, media files are extremely random and aren't going to dedup very well.

I'd like the hear any other use suggestions but I can't think of many places dedup will work well other than those where exact duplicate files are likely, such as storage servers used by many users. Some areas where dedup might come in useful in my line of work are as follows:

1) Web Storage
We store about 150GB of web files on our storage server and I've no doubt there's quite a few copies of Wordpress/Joomla/etc on there.

2) VM Storage
This is an awkward one as copying two copies of an OS (eg Windows files) to a ZFS datastore will obviously dedup very well. However, I'm not sure how well it would dedup when those files are being written to a zvol or vmdk/vhd file on top of ZFS without trying it out.

Using dedup on an NFS VM store would give you a interesting clone feature though. In order to use ZFS's built in clone command, each VM would need to be on a separate dataset and mounting a share for every VM isn't really feasible. However if you have one NFS shared dataset for your VM's, with dedup on, just doing a
[CMD="cp"]templateVM/disk.vmdk newVM/disk.vmdk[/CMD] should give you a new copy of the template VM without using up any additional space. It will only start using more space when changes are made. (If you're using thin provisioning, you may need to use rsync --sparse instead unless ZFS makes the files sparse automatically)

gkontos · Jun 6, 2011

usdmatt said:
2) VM Storage
This is an awkward one as copying two copies of an OS (eg Windows files) to a ZFS datastore will obviously dedup very well. However, I'm not sure how well it would dedup when those files are being written to a zvol or vmdk/vhd file on top of ZFS without trying it out.

From my personal experience under virtualbox, there is a significant performance increase when deduping on cloned virtual machines.

danbi · Jun 6, 2011

One of my first experiments with dedup was on a diskless boot sever, where I was keeping different FreeBSD versions for different clients (such as 7-stable-i386, 8-stable-amd64, 8-stable-i386, 8-stable-amd64, etc.). This all deduped very well, because many of the files (non executable) are the same anyway.

Similar is with jails. With many jails it is probably best to use cloning to create new jail instances, however this means identical base system for all jails. With dedup, you will get about the same savings, even if you re-install OS within jails, or use slightly different revisions etc.