My ZFS V28 benchmarks

Goose997 · Sep 24, 2011

hi all

I spent the last few months building and refining a home server running ZFS V28 on 8.2-STABLE. I previously used Windows Home Server where I enjoyed the ability to store a large number of backups with deduplication. After the decision to drop Drive Extender from the new Home Server, I decided to go for FreeBSD and ZFS which gave me similar features. The ability to extend a pool was important since this allows one to have huge single Samba shares (e.g. a movies directory of 5 TB).

The system has an Asus P8P67 motherboard, with an Intel i5-2500 CPU @ 3.30GHz and 16 GB RAM. I started with 8 GB but must recommend 16 GB if you have a large dedup table (e.g. I backup several different PC's on this server in a directory which has dedup and compression).

The system boots off a 1TB Seagate drive (ST31000524AS JC45) on UFS. The performance specs of this HDD is:

Code:

# diskinfo -c -t -v ada1
ada1
        512             # sectorsize
        1000204886016   # mediasize in bytes (931G)
        1953525168      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        1938021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        5VP7VE4K        # Disk ident.

I/O command overhead:
        time to read 10MB block      0.082120 sec       =    0.004 msec/sector
        time to read 20480 sectors   2.698682 sec       =    0.132 msec/sector
        calculated command overhead                     =    0.128 msec/sector

Seek times:
        Full stroke:      250 iter in   5.819074 sec =   23.276 msec
        Half stroke:      250 iter in   3.890680 sec =   15.563 msec
        Quarter stroke:   500 iter in   6.128329 sec =   12.257 msec
        Short forward:    400 iter in   2.114499 sec =    5.286 msec
        Short backward:   400 iter in   2.666716 sec =    6.667 msec
        Seq outer:       2048 iter in   0.105536 sec =    0.052 msec
        Seq inner:       2048 iter in   0.280450 sec =    0.137 msec
Transfer rates:
        outside:       102400 kbytes in   0.820216 sec =   124845 kbytes/sec
        middle:        102400 kbytes in   0.971817 sec =   105370 kbytes/sec
        inside:        102400 kbytes in   1.592233 sec =    64312 kbytes/sec

I then created a raidz of 5 x 2TB Seagate drives (ST2000DL003-9VT166 CC32). These drives are 4K aligned (just in case) and I chose the 4 data + 1 parity for optimization reasons I read on this thread.

The performance of the 2TB Seagates are as follows:

Code:

# diskinfo -c -t -v ada3
ada3
        512             # sectorsize
        2000398934016   # mediasize in bytes (1.8T)
        3907029168      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        3876021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        5YD4E2K9        # Disk ident.

I/O command overhead:
        time to read 10MB block      0.092707 sec       =    0.005 msec/sector
        time to read 20480 sectors   2.618028 sec       =    0.128 msec/sector
        calculated command overhead                     =    0.123 msec/sector

Seek times:
        Full stroke:      250 iter in   6.634518 sec =   26.538 msec
        Half stroke:      250 iter in   4.936454 sec =   19.746 msec
        Quarter stroke:   500 iter in   7.973672 sec =   15.947 msec
        Short forward:    400 iter in   2.845359 sec =    7.113 msec
        Short backward:   400 iter in   2.793776 sec =    6.984 msec
        Seq outer:       2048 iter in   0.123256 sec =    0.060 msec
        Seq inner:       2048 iter in   0.108371 sec =    0.053 msec
Transfer rates:
        outside:       102400 kbytes in   0.721418 sec =   141943 kbytes/sec
        middle:        102400 kbytes in   0.849894 sec =   120486 kbytes/sec
        inside:        102400 kbytes in   1.562353 sec =    65542 kbytes/sec

Finally, I recently got a Corsair 60 GB SSD (Corsair CSSD-F60GB2-A 2.1b). This is the performance:

Code:

# diskinfo -c -t -v ada2
ada2
        512             # sectorsize
        60022480896     # mediasize in bytes (55G)
        117231408       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        116301          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        1111650335000462002D    # Disk ident.

I/O command overhead:
        time to read 10MB block      0.042269 sec       =    0.002 msec/sector
        time to read 20480 sectors   1.353079 sec       =    0.066 msec/sector
        calculated command overhead                     =    0.064 msec/sector

Seek times:
        Full stroke:      250 iter in   0.029324 sec =    0.117 msec
        Half stroke:      250 iter in   0.034488 sec =    0.138 msec
        Quarter stroke:   500 iter in   0.105246 sec =    0.210 msec
        Short forward:    400 iter in   0.084277 sec =    0.211 msec
        Short backward:   400 iter in   0.081533 sec =    0.204 msec
        Seq outer:       2048 iter in   0.074574 sec =    0.036 msec
        Seq inner:       2048 iter in   0.073741 sec =    0.036 msec
Transfer rates:
        outside:       102400 kbytes in   0.453352 sec =   225873 kbytes/sec
        middle:        102400 kbytes in   0.490449 sec =   208788 kbytes/sec
        inside:        102400 kbytes in   0.409787 sec =   249886 kbytes/sec

I use the SSD on the same controller as a 2GB ZIL and a 40GB L2ARC. I also reserved some space for a swap partition, but am not using it yet. I aligned the partitions according to this thread.

Code:

# gpart show ada2
=>        0  117231408  ada2  BSD  (55G)
          0     129024        - free -  (63M)
     129024    4194304     1  freebsd-zfs  (2.0G)
    4323328   83886080     2  freebsd-zfs  (40G)
   88209408   27262976     4  freebsd-swap  (13G)
  115472384    1759024        - free -  (858M)

I then used bonnie++ version 1.96 and these are the results:

Code:

Configuration               Block Output    Rewrite    Block Input     Random Seeks
                                MB/s          MB/s         MB/s               /s 
UFS                             89.8          38.1         97.8             171.2
ZFS                            185.5         118.0        275.4             186.7
ZFS+dedup                       25.1          15.6        248.8             288.0
ZFS+gzip                       282.5         212.1        953.6             450.7
ZFS+gzip+dedup                  21.5          16.7        650.7             333.9
ZFS+zil+l2arc                  231.8         129.0        256.8             371.2
zfs+zil+l2arc+dedup             21.2          18.3        318.8             568.6
zfs+zil+l2arc+gzip             289.3         233.4        903.4             530.2
zfs+zil+l2arc+gzip+dedup        41.6          25.9        818.9             723.0

The dedup was done with SHA256. As can be seen, there is a large performance impact during writing but not so much during reading. It seems L2ARC does have a nice positive impact on read speed.

Any comments?

regards
Malan

aragon · Sep 24, 2011

Goose997 said:
Any comments?

You created your pool using gnop(8)? Does your ashift show a value of 12?

# zdb |grep ashift

The dedup results were surprising, and worrying. Anyone else think this is normal?

Goose997 · Sep 24, 2011

aragon said:
You created your pool using gnop(8)? Does your ashift show a value of 12?

hi

Yes, I created it using gnop(8). I now see the ashift on the SSD is 9, on the pool HDD's it is 12.

Any ideas on the dedup will be interesting, for the rest I think the performance is OK.

regards
Malan

phoenix · Sep 26, 2011

To get good performance out of dedupe you absolutely NEED tonnes of RAM. You need to keep the DDT (dedupe data table, I think) in RAM (as part of the ARC). If you don't have enough RAM To keep the DDT in the ARC, then you need to configure an L2ARC device. The DDT will "spill over" into L2ARC, which will be faster than pulling the DDT (in chunks) from the pool.

You can see just how much ARC space you need for the DDT by looking at the output of # zdb -DD <poolname> and multiplying the total number of unique block (last line, first column) by 276 bytes (I think, it might be 376 bytes).

rusty · Sep 27, 2011

If you're looking at dedupe it's well worth reading http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

stassik · Jan 17, 2012

Goose997 said:
The dedup was done with SHA256. As can be seen, there is a large performance impact during writing but not so much during reading. It seems L2ARC does have a nice positive impact on read speed.

Any comments?

regards
Malan

In v28 you can do ZiL on RAM disk without risks. Why you don't do it?

phoenix · Jan 17, 2012

Until your server crashes, you lose the data in RAM, the pool can't read the ZIL on boot, and any data that was in there is now gone.

There's a reason everyone suggests using enterprise-grade SSDs with onboard capacitors for ZIL.

stassik · Jan 18, 2012

phoenix said:
Until your server crashes, you lose the data in RAM, the pool can't read the ZIL on boot, and any data that was in there is now gone.

There's a reason everyone suggests using enterprise-grade SSDs with onboard capacitors for ZIL.

SSD is very unstable too...

Goose997 · Jan 18, 2012

hi

My understanding from the ZFS Best Practices Guide is that for ZFS V28 the mirrored or enterprise ZIL's are not a strict requirement any more:

Mirroring the log device is recommended. Prior to pool version 19, if you have an unmirrored log device that fails, your whole pool might be lost or you might lose several seconds of unplayed writes, depending on the failure scenario.
[*]In current releases, if an unmirrored log device fails during operation, the system reverts to the default behavior, using blocks from the main storage pool for the ZIL, just as if the log device had been gracefully removed via the "zpool remove" command.

I don't use the ZIL much - all my daily access is over Samba which does not use the ZIL. Up to now I have been too lazy to set up NFS access.

I guess the un-mirrored ZIL is a bit of calculated risk - does anybody have more information specific to V28?

Thanks
Malan

usdmatt · Jan 18, 2012

I remember reading a Solaris blog a while ago that suggested 5GB of ram per 1TB of storage was recommended for dedupe. Can't find it today but the FreeBSD wiki now actually suggests the same value. http://wiki.freebsd.org/ZFSTuningGuide

For 8-10TB of disks you'll be wanting 48GB to really have it run well. As mentioned on the wiki above, you're better off using compression if you can.

The idea of using RAM as a ZIL device is nonsense. The whole point of the ZIL is to record sync writes (I.e. those that ZFS has guaranteed to the applications have been written) so that in the event of a power failure, they can be replayed. If you're using a RAM disk, you'd be better off forgetting the ZIL and setting sync=disabled on your datasets.

I'm surprised that ZFS can supposedly withstand the loss of a log device with no effect but it is plausible. The current transaction is stored in RAM. Sync writes are written to the ZIL but only read after a power failure. If the ZIL fails while the server is up, all the records waiting to be written should still be in RAM so it should be possible to carry on and create a new ZIL on disk to record future transactions. I'd definitely have to try it before trusting it though and remember this relies on the server staying up. I would expect ZFS to complain if your server crashes with a ZIL and comes back up without one (as with a RAM disk) because it will have most likely lost writes that it has guaranteed to clients have been written. If you're lucky you'll be able to manually import the pool and rewind to an earlier transaction group.

phoenix · Jan 18, 2012

It's X GB (I seen 1, 2, 5 bandied about) of ARC space per 1 TB of unique data. The more unique data you have (so the more DDT entries), then the more ARC space you need. If you have extremely dedupable data, then you don't need as much ARC space.

ARC space includes L2ARC. So you don't need 48 GB of RAM, you need 48 GB of ARC+L2ARC. The DDT table will be stored (well, cached) on the cache device for speedier access than reading it from disk. Of course, you can't just stuff a 500 GB cache device into a system with 8 GB of RAM and expect dedupe to work, since you need ARC space to track the contents of L2ARC. But you can stuff a 100 GB cache device into a system with 16-32 GB of RAM, instead of using 48-64 GB of RAM.

For example, we have a storage server with a 3.68x dedupe ratio and a 1.55x compression ratio for a combined disk savings of over 5.33x. Only has 20 GB of RAM (16 GB for ARC), with a 32 GB L2ARC. And over 13.0 TB of storage space in use, with 24.5 TB of free space. And the ARC size rarely goes above 14 GB.

Our other storage server has a 2.13x dedupe ratio and a 1.40x compression ratio for a combined disk savings of 2.72x. Has 24 GB of RAM (20 GB for ARC), with a 32 GB L2ARC. Just under 6 TB storage space allocated with 27.2 TB free. ARC use on this system rarely goes above 8 GB.

I find a good rule of thumb to start with is "1 GB of ARC per TB of data". Then you monitor your system to see whether it needs more RAM or not. We started with 8 GB in each of the servers above. Then expanded them to 16 GB. Then to 20 GB and 24 (most we could get into 8 RAM slots based no the RAM sticks we had on hand). Down the road we may need to push that to 32 GB or 64 GB (the most we can put in with a single CPU socket in use). If we find we need more than that, we'll need to get another CPU.

Sebulon · Jan 19, 2012

Hi all,

IÂ´m posting in this thread since itÂ´s about bonnie tests.

FS2-7:

Code:

[B][U]HW[/U][/B]
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB (L2ARC)

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
[CMD="#"]zpool get version pool1[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default

[CMD="#"]bonnie++ -d . -s 16g[/CMD]
Write  Rewrite  Read
34     44       413  MB/s

MAIN:

Code:

[B][U]HW[/U][/B]
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
8x  SAMSUNG HD103SI (in a raidz2 zpool)

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
[CMD="#"]zpool get version pool1[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default

[CMD="#"]bonnie++ -d . -s 16g[/CMD]
Write  Rewrite  Read
316    156      290  MB/s

IÂ´m chocked to see the difference in write-speed on these two machines that are that similar in hardware. Main scores 10 times better write performance, even though it has less spindles in the pool. IÂ´m getting the feeling I must have missed something big on FS2-7. Any thoughts?

/Sebulon

vermaden · Jan 19, 2012

@Sebulon

Check output of that command on both systems: % zdb | grep ashift

SAMSUNG HD204UI are 4k (4096B) drives which sould have ashift=12
SAMSUNG HD103SI are 512B drives and sould have ashift=9

I have also read, that in case when You will not align properly ZFS pool (ashift=9 on 4k drives), then these two help a lot in /boot/loader.conf file:
[CMD=""]vfs.zfs.vdev.min_pending=1
vfs.zfs.vdev.max_pending=1[/CMD]

Sebulon · Jan 19, 2012

@vermaden

FS2-7

Code:

[CMD="#"]zdb | grep ashift[/CMD]
            ashift: 12

MAIN

Code:

[CMD="#"]zdb | grep ashift[/CMD]
            ashift: 9

/Sebulon

vermaden · Jan 19, 2012

I see the difference now, 1x OCZ Vertex 3 240GB (L2ARC) vs ZERO L2ARC

Sebulon · Jan 19, 2012

@vermaden

I fail to understand. Are you saying that a (read)cache device has that big of a negative impact on write performance?

/Sebulon

Sebulon · Jan 19, 2012

@vermaden

FS2-7

Code:

[CMD="#"]bonnie++ -d . -s 16g[/CMD]
Write  Rewrite  Read
34     44       413  MB/s
[CMD="#"]zpool remove pool1 gpt/cache1[/CMD]
[CMD="#"]bonnie++ -d . -s 16g[/CMD]
Write  Rewrite  Read
17     34       317  MB/s

The results actually became worse when removing the cache device.

/Sebulon

vermaden · Jan 19, 2012

I thought that additional SSD L2ARC is on the faster system ... I would dig more for that problem, its definitely not normal to see such degraded performance, what is space utilization on these pools?

Sebulon · Jan 20, 2012

@vermaden

FS2-7

Code:

[CMD="#"]zpool list[/CMD]
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool1  18.1T  14.3T  3.87T    78%  1.59x  ONLINE  -

MAIN

Code:

[CMD="#"]zpool list[/CMD]
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool1  7.25T  4.16T  3.09T    57%  1.00x  ONLINE  -

There are file systems that have dedup=on at FS2-7, but the bonnie tests where made at file systems without.

/Sebulon