ZFS write performance on dedup

Hi,

I am using FreeBSD 9.0, I am having trouble understanding ZFS performance in dedup mode. The setup is as follows:

Code:
test# zpool status
  pool: mainpool
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mainpool    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
            ada4p1  ONLINE       0     0     0
            ada2p1  ONLINE       0     0     0
        logs
          mirror-1  ONLINE       0     0     0
            ada5p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
        cache
          ada0p2    ONLINE       0     0     0

errors: No known data errors

dedup and compression are both ON. 128k blocks. ada5p2 and ada1p2 are 45G SSD device. ada0p2 is 45G SSD device. This is the secondary cache which is set to metadata. ada3p3, ada4p1, ada2p1 are regular harddrives. The computer has 16GB of RAM. The primary cache is set to metadata. The main pool has 1.3TB of data in it. There is nothing running on this computer apart from what I run.

I reboot the computer to clear all caches. I use an arc_summary.pl script to get the kernel memory used:

Code:
Kernel Memory:                                 728.27M

I write 5000M from /dev/random to the mainpool, then I check the memory size - I do this repeatedly:
Code:
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       775.14M
dd if=/dev/random of=/mainpool/test.100.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 567.303935 secs (9241748 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3333 1043  2289    31%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       1438.59M
dd if=/dev/random of=/mainpool/test.101.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 604.442479 secs (8673911 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3333 1048  2285    31%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       2093.17M
dd if=/dev/random of=/mainpool/test.102.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 336.765594 secs (15568336 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3333 1052  2280    32%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       2491.40M
dd if=/dev/random of=/mainpool/test.103.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 671.916393 secs (7802876 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3333 1057  2275    32%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       3185.06M
dd if=/dev/random of=/mainpool/test.104.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 1151.609028 secs (4552656 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3333 1062  2270    32%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       4244.58M

The kernel memory becomes full. Performance is poor, but I'm assuming that is because the dedup table (DDT) is not in cache. I carry on doing this for an hour..
Code:
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       13001.54M
dd if=/dev/random of=/mainpool/test.141.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 905.066026 secs (5792815 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3332 1243  2089    37%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       13258.20M
dd if=/dev/random of=/mainpool/test.142.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 726.570730 secs (7215925 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3332 1248  2084    37%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       13054.64M
dd if=/dev/random of=/mainpool/test.143.out bs=1M count=50005000+0
records in
5000+0 records out
5242880000 bytes transferred in 735.318146 secs (7130084 bytes/sec)
df -g /mainpoolFilesystem 1G-blocks Used Avail Capacity  Mounted on
mainpool        3332 1252  2079    38%    /mainpool
( /root/arc_summary.pl | grep 'Kernel Memory' ) 2>/dev/nullKernel
Memory:                                       13006.72M

It levels off at 13G, I guess the OS is holding onto a bit. So pretty much all the kernel memory is used and performance has not gotten any better - hovering around 7MB/s. I want to check that the problem is actually deduplication, so I turn off dedup on mainpool and do the same write:

Code:
test# zfs set dedup=off mainpool
test# dd if=/dev/random of=/mainpool/test.145.out bs=1M count=5000
5000+0 records in
5000+0 records out
5242880000 bytes transferred in 61.958975 secs (84618572 bytes/sec)

That's 84MB/s, which is fast, this shows the problem is the deduplication. I have the following questions:
  1. The performance of the writes fluctuate, sometimes 5MB/s sometimes 15MB/s, given that I am writting random data I would have expected things to be highly consistent. Each 128k block will be checksummed, ZFS will read the DDT, which will hit or miss, then write the data. Point is, each 5000M of random data I write should be approximately the same work for the computer.
  2. I don't understand why ZFS gobbled all the RAM. I'm assuming the RAM usage went up to 13G because the primary cache was filling up. How is this possible? I don't understand how the DDT table can be that big for 1.3TB of data.
    • Is there a way of seeing the size of the DDT table?
    • What exactly does it mean primarycache=metadata, what else is stored apart from the DDT?
  3. When the mainpool was empty the random data writes were fast, at about 64MB/s, as more random data was added to mainpool it got slower (all data is /dev/random data). I can understand that to some extent because the DDT is getting bigger.. The only additional overhead I can think is the DDT read.. I was hoping that by having it all in RAM the
    process would be very fast - Is it normal to get this level of performace degredation?
  4. Are there any tools I can use to get to the bottom of what is going on? Are there ZFS logs? Can the code be compiled and instrumented?
Get back to me if you need any clarification on my setup.

Many thanks.
 
Try your dd benchmarks without dedupe with a total amount of data that more than exceeds your physical RAM. So, if you have 16GB RAM, output 20-to-32 GB to disk. That will force heavy writes to disk, which might give a more realistic upper limit on your ZFS volume's sustained throughput.

ZFS with compression and/or dedude performs pretty poorly no matter what. The "Evil ZFS Tuning Guide" says pretty firmly to not use either unless you really need it for some reasonable scenario. After months of trying to be stingy with disk consumption, trying various permutations of compression and dedupe, I just gave up and bought more disks. The performance penalty was just not worth the gains (for me, at least).

I'm afraid I cannot answer your specific questions, as they're a little too deep for my understanding of ZFS. However, I will pass on the often-quoted advice about compression/dedupe: don't use them.

They're really cool technologies, but you pay a high price for them. I know a few NetApp storage admins, and even they won't use either unless there's a really good reason. I know ZFS is not WAFL, but you don't get something for nothing in storage.
 
I'm trying to theorise in my mind why the ZFS write speed with dedup becomes slower and slower. I'm thinking something like this is going on under the hood:

In the scenario where you have a lot of data and the DDT table is large but completely cached in ARC:

If your block checksum matches a DDT entry in RAM, the block does not need to be written to disk, just a single DDT disk entry update. :)

However if the block checksum does not match a DDT entry in RAM;

a) The new DDT entry is added to the ARC cache (fast)
b) The data block itself needs to be written to disk (slow)
c) The new DDT entry is added to disk (slow)
d) The tree structure of the DDT needs to be updated on disk, ie. Every entry on the tree to get to the new entry must be updated on disk. (slow) :(

As the DDT gets bigger step d) becomes longer and longer.

L2ARC does not help, all caches (ARC + L2ARC) are temporary, the DDT must be maintained on disk.

Does this sound right? If anyone can confirm or deny it would be helpful. Also, am I on the best forum for ZFS postings? I see some other people posting but not that many.
 
The more unique blocks on the filesystem, the larger the DDT gets. Each and every block ever written needs to be checked against the DDT for either a hit-and-add-new-reference or a miss-and-add-new-record. This of course takes time.

Your particular benchmark (dd if=/dev/random ...) guarantees a worst-case scenarios since every single block written will be unique (and incompressible, if you're benching compression on that filesystem as well), the DDT's worst search (in the "Big-O notation" sense) will be realized, and every single block will result in a new entry into the DDT.
 
sembsl said:
I'm trying to theorise in my mind why the ZFS write speed with dedup becomes slower and slower. I'm thinking something like this is going on under the hood:

In the scenario where you have a lot of data and the DDT table is large but completely cached in ARC:

If your block checksum matches a DDT entry in RAM, the block does not need to be written to disk, just a single DDT disk entry update. :)

However if the block checksum does not match a DDT entry in RAM;

a) The new DDT entry is added to the ARC cache (fast)
b) The data block itself needs to be written to disk (slow)
c) The new DDT entry is added to disk (slow)
d) The tree structure of the DDT needs to be updated on disk, ie. Every entry on the tree to get to the new entry must be updated on disk. (slow) :(

As the DDT gets bigger step d) becomes longer and longer.

Pretty much. There's more to it than that behind the scenes, but that's the general (worst-case) path.

If you have all unique (random) data in the pool, then dedupe is worthless, and performance will tank as every write to the pool requires at least 3 writes (data, metadata, DDT entry).

If you have highly duplicated data, then writes are about normal (metadata + DDT entry update instead of metadata + data).

L2ARC does not help, all caches (ARC + L2ARC) are temporary, the DDT must be maintained on disk.

L2ARC *does* help, as the DDT is stored in the L2ARC, so finding existing entries in the table (for reads and writes) is much faster that reading a small chunk of the DDT into the ARC, flushing that chunk, reading the next chunk of the DDT into the ARC, flushing that chunk, etc.

Just try to use dedupe on a pool with only 16 GB of RAM and no L2ARC of any kind, and see just how poorly things run. :) You'll notice *A LOT* of IOps, with very low MBps, as the pool thrashes continuously trying to keep the DDT in the ARC.

Also, am I on the best forum for ZFS postings? I see some other people posting but not that many.

This is one of the better places for ZFS-on-FreeBSD discussions. For more general ZFS discussion, the zfs-discuss mailing list may be better (although the members on that list tend to lean more toward Solaris-based OSes than FreeBSD).
 
One does not use dedupe technologies to increase performance. One uses them to save disk space and storage costs, hopefully without impacting performance too much.

For example, we use ZFS dedupe on our backups servers. The rsync write performance is good enough to backup all our servers between 5 pm and 7 am, with several hours to spare / grow into in the future. However, one backup server has a (just over) 5x dedupe/compress ratio. It's using 34 TB of disk space to store over 150 TB of data. There's no way we would afford a box with 150 TB of storage in it. :)
 
Using dd with /dev/random isn't really a fair test of a filesystem that has dedup enabled :) You're essentially feeding it "worst possible case" data, where there is basically no duplication. Conversely, neither is using /dev/zero :) - which would be very much an "ideal case" (all data is duplicate).

Have you tried it with some real-world test data?

I don't run ZFS in production at the moment, but I DO run a Netapp FAS2240 and have both compression and dedup enabled on various shares.

Netapp works a bit differently as it doesn't do dedup inline (it's a scheduled job) - and on the Netapp it can actually help read performance in certain situations as dedup blocks share cache entries - you make better use of your cache (assuming you have duplicated data to dedup).


At the end of the day, as with any storage subsystem knobs, it pays to verify whether you'll get any benefits first (and using example real world data! /dev/random is not real world!). As above, nothing comes for "free".

But - using random data you're getting all of the cost with none of the benefit - no duplicate data! If this is representative of your real world data (my bet is that it isn't, but if it is) - then common sense dictates you should be turning dedup OFF.

Ditto for compression - random data does not compress very well at all - there's no point burning RAM and CPU on it.
 
We use EMC Data Domain 'deduplication boxes' for backups, currently the physical data stored on them is about 110 TB but the actual use is less then 7 TB, so the dedup ratio is about 15:1. Data Domain is very similar to ZFS here, first it deduplicates data, then it compresses it. We store backups of 100+ FreeBSD/AIX/Linux systems, from about another 100+ Windows systems, 30+ Oracle/DB2/MSSQL databases and so.

So as You see, if You put enough similar data on the 'deduplication box' it dedupes quite nicely.
 
I've read that in order to use dedup you need lots of memory. I've used it for 1.5TB (dedup was x3) with only 8GB of RAM and it worked very slow. In my case I've read I would need 24GB of RAM for 1.5 TB of data with high dedup ratio. So 110 TB of physical data how much memory do you have?
 
overmind said:
I've read that in order to use dedup you need lots of memory. I've used it for 1.5TB (dedup was x3) with only 8GB of RAM and it worked very slow. In my case I've read I would need 24GB of RAM for 1.5 TB of data with high dedup ratio. So 110 TB of physical data how much memory do you have?

We have the D860 model with 36GB RAM and dual quad-core Intel Xeon CPUs, here are some details if You want:
http://open-systems.ufl.edu/static/asr-pub/DD670_DD860_DD890_HW_Overview_775-0186-0001.pdf

We have the Data Domain (2U in the middle of this picture below, yes covered by dummy cover) with two expansion enclosures 3U 24TB each (30TB RAW capacity). The Data Domain D860 does not have any storage, its just a 'compute box'.

datadomain.jpg


About RAM for ZFS, ZFS keeps all DDT table entries in RAM (if possible) which is online deduplication, this is why ZFS needs that much RAM for deduplication. Compare that to DragonflyBSD HAMMER which does offline deduplitacion, which means You do not have to keep all this stuff in RAM, just run a batchjob daily at night to deduplicate content. Sometimes You can get away even with 1GB of RAM for a lot of data ;).

IMHO it would be great if ZFS would allow offline deduplication which would be later 'applied' in zpool scrub process, but that is only my pipe dream.
 
@vermaden: Thank you! I've asked about RAM because I've tried dedup and worked bad for me (of course I've tested only with 8G RAM but then the pool was not big) and I've also read this:

http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

"For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K. This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS."

So for example if my pool is 10TB, I would need 200GB RAM ?!
This is too much I think...
 
overmind said:
So for example if my pool is 10TB, I would need 200GB RAM ?!
This is too much I think...

Depends on the data. If You would like to keep a lot of small files with, for examepl 1K blocksize, then this may be even too small for 10TB, but if you have more large files then small ones and You use 128K blocksize, then You can get away with, for example 4GB RAM for 1TB.

You can also use 128GB SSD for ZFS L2ARC, so if DDT will not fit in RAM, it will be kept in the L2ARC on fast SSD, that way You will limit the performance hit.
 
I had 1.5 TB of data on the pool with dedup(in fact it was 3.5TB real data), mostly source code, small files. When I copied (write) to that pool I got 2-3 MB/s transfer rate, on a 8GB RAM machine. And since I did not have ssd at that time I've tried to use ZFS L2ARC on a USB flash drive but did not help alot (2-3MB/s increase in speed).

Maybe the best approach is to force big block size from start? I'll try the same setup I had with 32GB RAM and 7TB pool size to see if it works. If not, I'll add a SSD but I am curious if it will not burn out in time because I think L2ARC will write to SSD alot.
 
overmind said:
Maybe the best approach is to force big block size from start?
Well, the big block (128k) will be used for big files, but when a file has size of 1k, then it will use 1k block size. Even if You set 128k default block size, then small files would use smaller block sizes adjusted to their size.
 
overmind said:
@phoenix:

What is the amount of RAM memory installed in that backup server?

Server for non-school backups: 64 GB
Server for school backups: 28 GB (with 32 GB on order, to bring it to 48 GB)
Server for groupware backups: 64 GB (dedup disabled)
Server for off-site replication: 128 GB

The non-school backups server backs up 62 servers (including VMs).

The school backups server backs up 73 servers (all remote).

The groupware backup server backs up 1 server (it's separate due to the "multiple files per e-mail, plus databases" which makes it's extremely I/O intensive).

And then ZFS snapshots are sent from each of those 3 to the off-site replication server each day.
 
overmind said:
@vermaden: Thank you! I've asked about RAM because I've tried dedup and worked bad for me (of course I've tested only with 8G RAM but then the pool was not big) and I've also read this:

http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

"For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K. This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS."

So for example if my pool is 10TB, I would need 200GB RAM ?!
This is too much I think...

Don't know where he's pulling his numbers from. Every discussion about dedup on the ZFS mailing lists ends with various different calculations always coming out with a "rule-of-thumb" being 1 GB of ARC for every 1 TB of unique data".
 
overmind said:
I had 1.5 TB of data on the pool with dedup(in fact it was 3.5TB real data), mostly source code, small files. When I copied (write) to that pool I got 2-3 MB/s transfer rate, on a 8GB RAM machine. And since I did not have ssd at that time I've tried to use ZFS L2ARC on a USB flash drive but did not help alot (2-3MB/s increase in speed).

Maybe the best approach is to force big block size from start? I'll try the same setup I had with 32GB RAM and 7TB pool size to see if it works. If not, I'll add a SSD but I am curious if it will not burn out in time because I think L2ARC will write to SSD alot.

Dedup works best with larger files, or faster storage, as the amount of random I/O required for dedup is HUGE! And if you're using raidz vdevs, random I/O is slowed down even more.

To get the absolute best performance out of ZFS dedupe:
  • fill the box with as much RAM as it can hold
  • add a super-fast SSD for L2ARC
  • use mirror vdevs
  • use the fastest harddrives you can get (or use all SSDs)
  • fill the pool with as much identical data as possible to keep the DDT small

Deviating from any of those (not enough RAM, no L2ARC, raidz vdevs, "green" HDs, all unique data, etc) will slow things down.

Dedupe access is all random I/O, so you need to build a pool that can handle lots of random I/O.
 
I've recently assembled a backup machine for work to hold our weekly backups of the primary fileserver. We're backing up two file-systems totalling around 15 TB.

The original configuration I used:
  • 1x X5650 CPU
  • 8x 2 TB green drives (raidz on top of GELI encryption)
  • 48 GB RAM
  • 120 GB Vertex 2E L2ARC
  • compress=GZIP and dedup=on primarycache=metadata secondarycache=metadata
ZFS sending the first file-system (about 7 TB) was fine, since as you have probably guessed the DDT fit into RAM. While sending the second file-system we noticed massive slow-downs (around 7 MB/s). Since this meant backups would take weeks I revised the configuration.

The new configuration:
  • 2x X5650 CPUS
  • 160 GB RAM
  • 8x 2 TB green drives (raidz on top of GELI encryption)
    [* 120 GB Vertex 2E L2ARC
  • compress=GZIP and dedup=on primarycache=metadata secondarycache=metadata arc.max_size=128GB
This configuration can happily fill the disks at a constant 70 MB/sec. I think the 5 GB RAM to 1 TB storage is roughly right but for file-systems with many small files (such as ours, which include home directories and svn/git checkouts) I find 8 GB is closer to the truth. Due to encryption plus dedup plus compression the machine does run with a system load of ~32 but this is to be expected.

Currently the dedup ratio is about 2.5:1 but this should improve as we send further snapshots to it.
 
You are using 8 x 2 TB green drives, but did not mention the manufacturer ;) From the Western Digital website:
Recommended use.

WD Green hard drives are tested and recommended for use in PCs as secondary storage, external enclosures and other applications that require cool and quiet operation.*

*Desktop drives are not recommended for use in RAID environments, please consider using WD Red hard drives for home and small office 1-5 bay NAS systems and WD Enterprise hard drives for rackmount and >5 bay NAS systems.
 
They are indeed WD green drives :)

We do weekly backups onto a set of eight drives then send them off-site. We have four sets in rotation so no single set gets pounded continuously. If they were constantly in use I'd prefer NL SAS drives but for backups these seem to be fine.

I'll have to see how easy it is to get LZ4 support into our FreeBSD install, we're using 9.1-RELEASE. I'm guessing it's just a kernel patch? LZJB gives us a ~2.5:1 compression ratio on the main file server and GZIP gives us ~3.7:1 on the backup server. I'll try setting GZIP-9 next time I create a disk set to see if it makes much of a difference.
 
J65nko said:

This is due to lack of TLER support, which I believe ZFS does not use.

But yes, I have seen varying degrees of RAID related failure (in-built Intel fake-raid under Windows) on a number of WD Blacks before I discovered this issue.

The same blacks have been flawless for 18 months so far with ZFS.
 
Back
Top