Benchmark results: 20 TB RAID-Z3

listentoreason · Oct 20, 2013

I have been running combinatoric disk performance tests on the following system:

FreeBSD 9.1
SUPERMICRO MBD-X8DTL-iF-O Dual LGA 1366
2 x Intel Xeon E5606 (2.13 GHz Quad-Core)
6 x KVR13LR9S8/4EF (4 GB 1333 ECC; total 24 GB)
12 x WD30EFRX (3 TB WD Red)
LSI00244 (9201-16i) (SAS x4 HBA)

I have been exploring performance as a function of RAID-Z level and the number of disks allocated to the pool, based on observations by throAU in this thread. The testing methodology was inspired by analysis done at calomel.org. My results (as measured by bonnie++) are very similar to what they found:

Increasing the pool size has significant performance benefits, particularly for reads
Increasing RAID-Z level has modest performance cost on writes

Calomel has extensive tweaks listed for both ZFS as well as configuring the LSI controllers. The only modifications I have made to my system are zfs set atime=off. This is partly due to too many other things to do, but also a philosophy of leaving things alone that are functioning satisfactorily; For my needs the performance is more than satisfactory, so I would rather not tinker too heavily with the system. I have configured my primary pool as RAID-Z3 with 11 x 3 TB drives, which provides 20 TB of usable storage.

However, I have also been trying to implement geli on top of ZFS. Judging from the majority of postings I have found, most users will first provision all drives in the pool with geli, then put ZFS on top of the encrypted devices. I do not want to encrypt the entire pool, however, and have instead been testing geli on top of ZFS. The benchmarks there are significantly worse; about a three fold hit for a UFS provision, and up to six fold for having an encrypted zpool in a zpool. The results below are on top of the RAID-Z3 11 drive pool:

Creating a UFS encrypted file system: Write/Rewrite/Read = 90 / 30 / 150 MB/sec

Code:

zfs create -V 2TB abyss/shadow
geli init -s 4096 /dev/zvol/abyss/shadow
geli attach /dev/zvol/abyss/shadow
newfs /dev/zvol/abyss/shadow.eli
mkdir /private
chmod 0777 /private/
mount /dev/zvol/abyss/shadow.eli /private

Creating an encrypted zpool (per matoatlantis): Write/Rewrite/Read = 71 / 26 / 78 MB/sec

Code:

zfs create -V 2TB abyss/shadow2
geli init -s 4096 /dev/zvol/abyss/shadow2
geli attach  /dev/zvol/abyss/shadow2
zpool create private2 /dev/zvol/abyss/shadow2.eli

The Xeon E5606 supports AES-NI, so I had hoped for better performance. If anyone has thoughts on improving performance of an encrypted provision on top of ZFS, I would be very interested in hearing it. Also, if this approach is flat-out a Bad Idea, I would be interested to know that, too.

The combinatoric data were generated by a randomization script I will post below. All data points have three replicates. The standard deviation was quite tight for writes and rewrites (generally +/- 1-5 MB/sec), a little looser for reads. Raw data:

Code:

Pool	Write	Rewrite	Read	WriteSD	RewriteSD	ReadSD
Z1+2	194	91	183	8	3	10
Z1+4	372	139	322	1	1	9
Z1+8	400	177	377	5	2	5
Z2+2	182	90	201	1	2	10
Z2+4	356	152	396	2	0	22
Z2+8	396	203	479	0	4	5
Z3+2	173	111	361	7	3	41
Z3+4	333	149	378	7	3	5
Z3+8	332	210	465	5	2	7

listentoreason · Oct 20, 2013

Here is the Perl script I used to help automate the process, if anyone else wishes to perform the tests on their own system.

CAUTION: Running the Perl script by itself is not destructive, it merely generates a shell script. When the shell script is run, however, it will perform destructive operations on your system! If you find either script confusing, do not run them!

Code:

#!/usr/bin/perl -w

# Copyright Charles Tilford, 2013
# May be freely used or modified under the Perl Artistic License 2.0

# Yes, I know this might be more elegantly done with a shell script.
# However, whenever I write more than 15 lines in bash somewhere a child
# drops their ice cream cone or a major economy collapses.

use strict;

# Lines you need to change, or may wish to:

my $user      = "someUser"; # You WILL need to change to your user name
my $poolName  = "testPool"; # You may want to change, but not neccesary
my $volName   = "testVol";  # You may want to change, but not neccesary
my $devPrfx   = "ZfsDrive"; # Needs to match the device labels you used
my $devNumMin = 0;          # Will be appended to $devPrfx
my $devNumMax = 11;         # Will be appended to $devPrfx

my $caution = <<EOF;

This Perl script will not actually run any commands. It will generate
a shell script file that when run will actually run the commands.
 
However, you will need to first change a few parameters and
then uncomment the "safety" line that is preventing the file from
being generated. Please alter this code, generate the file, verify
that it looks good, and happy benchmarking!
 
The shell script will contain additional comments on how to prepare
your devices to have the geometry and labels set. 
  
EOF

########################## SAFETY ####################################
#                                                                    #
# The line below is the "safety". Comment it out to run the program. #
#                                                                    #
########################## SAFETY ####################################

&msg(split(/[\n\r]+/, $caution)); exit;





my $sudo      = 0;          # Set to 1 if you plan to run as not-root
# I have not tested running in sudo, if commands take too long it is
# possible that sudo credentials may expire, requiring you to
# repeatedly reenter them

my $reps      = 3;
my $cmdFile   = "ZfsTestingCommands.sh";
my $tsvFile   = "ZfsTestResults.tsv";
my $statFile  = "ZfsStatus.txt";
my $bash      = `which bash`;

srand(time() ^ ($$ + ($$ << 15)));

my $totDevs   = $devNumMax - $devNumMin + 1;
# You can replace this with a literal array:
my @numRange  = ($devNumMin..$devNumMax);
# You could replace this with a literal array, too, if you did not have
# a sequential order of disks, eg
# @devices = ('abc0', 'abc1', 'xyz4', 'xyz5');
my @devices   = map { $devPrfx . $_ } @numRange;

$sudo         = $sudo ? "sudo " : "";

# Can alter the following settings if you wish to test different combinations
my @raidzLvls = (3,2,1); 
my @diskLvls  = (8,4,2);

my $rzTxt     = join(',', @raidzLvls);
my $dlTxt     = join(',', @diskLvls);
my $combTot   = $reps * scalar(@raidzLvls) * scalar(@diskLvls);
my $runHours  = int(10 * $combTot / 4) / 10;


open(CMDFILE, ">$cmdFile") || 
    die "Failed to write command file:\n  $cmdFile\n  $!\n  ";

my $intro = <<EOF;

This file is designed to exhaustively test RAIDZ configurations on a FreeBSD
system. RUNNING THIS FILE IS DESTRUCTIVE. It will create and destroy zpools,
so under no circumstances should it be run unless you are ABSOLUTELY SURE
you know what it is doing. The file requires the zfs service to be enabled
and started, and it requires the 'bonnie++' package.
 
The file has been generated presuming that there are $totDevs devices that have
been prepared with '$devPrfx' labels. The method I used to prepare the drives
was taken from http://savagedlight.me/2012/07/15/freebsd-zfs-advanced-format/
where for each disk 'xyz#' I ran (as root):
 
gpart create -s gpt xyz5
gpart add -a 1m -t freebsd-zfs -l ZfsDrive5 xyz5
 
The commands below will establish zpools for testing, and configure them to
use 4 kb sectors as described in the savagedlight.me posting. These pools
are then tested using bonnie++ to generate disk IO statistics for each
configuration. A total of $reps bonnie++ runs will be performed for all
combinations of:
 
RAIDZ : $rzTxt
Data Disks : $dlTxt (the number of non-parity disks in the zpool)
 
This is a total of $combTot bonnie++ tests; on my system each run
takes about 15 minutes, so expect roughly $runHours hours to complete.
The disks used are randomized to exercise as much of your hardware as
possible.
 
I took the idea for this level of testing from calomel.org:
https://calomel.org/zfs_raid_speed_capacity.html
 

EOF

if ($sudo) {
    $intro .= "The commands assume you are running as non-root; they will\nneed sudo to perform zpool operations.";
} else {
    $intro .= "The commands assume you are running as root\n".
        "It is a very bad idea to run bonnie as root, so they will run\n".
        "instead as user '$user'. Obviously you should change that as needed.";
}
$intro .= "";

&docmd("#!$bash");
&cmdmsg(split(/[\n\r]+/, $intro));

&docmd("echo Commands will be written to  : $cmdFile");
&docmd("echo Benchmarks will be wriiten to: $tsvFile");
&docmd("echo ZFS status will be wriiten to: $statFile");
&cmdmsg("");

my $barLen = 60;
my $bar    = '#' x $barLen;
my $boxFmt = '# ' . '%-'.($barLen-4).'s'.' #';

&docmd("echo -e \"".join("\\t", qw(RAIDZ DISKS REP WRITE REWRITE READ))."\" >> $tsvFile");
foreach my $raidz (@raidzLvls) {
    foreach my $dataDisks (@diskLvls) {
        my $totalDisks = $raidz + $dataDisks;
        &cmdmsg("",$bar,sprintf($boxFmt,"RAID-Z$raidz with $dataDisks non-parity disks ($totalDisks total)"), $bar);
        my $msg = "STARTING: RAID-Z$raidz with $dataDisks data disks";
        &docmd("echo $msg");
        # This randomizes the disks:
        my @scramble = map { $_->[0] } sort 
        { $a->[1] <=> $b->[1] } map { [ $_, rand(1) ] } @devices;

        # Pull off the ones we are going to use:
        my @using;
        foreach my $disk (splice(@scramble, 0, $totalDisks)) {
            # I just like having abc11 and abc3 sort nicely...
            my ($alpha, $num) = ("", 0);
            if ($disk =~ /(\D+)/) { $alpha = $1; }
            if ($disk =~ /(\d+)/) { $num   = $1; }
            push @using, [$disk, $alpha, $num];
        }
        @using = map {$_->[0]} sort { $a->[1] cmp $b->[1] || $a->[2] <=> $b->[2] } @using;

        my $dir = &create_pool( \@using, $raidz );
        &cmdmsg("", "Running $reps disk tests");
        foreach my $rep (1..$reps) {
            &bonnie($raidz, $dataDisks, $rep, $dir);
            &cmdmsg("") unless ($rep == $reps);
        }
        &destroy_pool($raidz, $dataDisks);
    }
}

close CMDFILE;

&msg("Finished. Please read, modify if needed, and then run:", $cmdFile);


sub create_pool {
    my ($disks, $rz) = @_;
    # I am following guidance for enforcing 4kb sectors from:
    # http://savagedlight.me/2012/07/15/freebsd-zfs-advanced-format/
    my @devs;
    &cmdmsg("", "Preparing virtual 4kb devices");
    foreach my $disk (@{$disks}) {
        my $cmd = sprintf("%sgnop create -S 4k gpt/%s", $sudo, $disk);
        push @devs, "gpt/$disk.nop";
        &docmd($cmd);
    }

    &cmdmsg("", "Creating pool, exporting, and importing");
    my $makeCmd = sprintf("%szpool create %s raidz%s", $sudo, $poolName, $rz);
    for my $d (0..$#devs) {
        $makeCmd .= " \\\n   " unless ($d % 3);
        $makeCmd .= " /dev/$devs[$d]";
    }
    &docmd($makeCmd);
    &docmd("zfs set atime=off $poolName");
    &docmd("${sudo}zpool export $poolName");
    foreach my $disk (@devs) {
        &docmd("${sudo}gnop destroy $disk");
    }
    &docmd("${sudo}zpool import -d /dev/gpt $poolName");
    &docmd("${sudo}zfs create $poolName/$volName");
    my $testDir = "/$poolName/$volName/bonnieData";
    &docmd("${sudo}mkdir $testDir");
    &docmd("${sudo}chmod 0777 $testDir");
    return $testDir;
}

sub bonnie {
    my ($rz, $dd, $r, $dir) = @_;
    my $bonTok = sprintf("RAIDZ%d_%d", $rz, $dd);
    my $bfile = sprintf("Bonnie_%s-Data_Rep%d.csv", $bonTok, $r);
    my $cmd   = "bonnie++ -r 8192 -s 81920 -f -b -n 1 -q -m $bonTok ";
    $cmd     .= "-u $user " unless ($sudo);
    $cmd     .= sprintf("-d \\\n    %s > %s", $dir, $bfile);
    &docmd("date +'   [Repitition $r] %H:%M:%S'");
    &docmd($cmd);
    &docmd('awk -F "," \'{ print "'."$rz\\t$dd\\t$r\\t".
           '"$10"\t"$12"\t"$16 }\' \\'. "\n  $bfile >> $tsvFile");
}

sub destroy_pool {
    my ($rz, $dd) = @_;
    &cmdmsg("", "Destroying pool");
    &docmd("date >> $statFile");
    &docmd("echo Status of RAIDZ$rz with $dd data disks: >> $statFile");
    &docmd("zpool status >> $statFile");
    &docmd("${sudo}zpool destroy $poolName");

}

sub msg {
    warn join("\n", @_) ."\n";
}

sub cmdmsg {
    foreach my $line (@_) {
        if (!defined $line || $line =~ /^\s*$/) {
            print CMDFILE "\n";
        } else {
            print CMDFILE "# $line\n";
        }
    }
}

sub docmd {
    my $cmd = shift;
    print CMDFILE "$cmd\n";
}

Here is a sample of the bash code that it generates; this is one test iteration with three replicates:

Code:

# ############################################################
# # RAID-Z1 with 4 non-parity disks (5 total)                #
# ############################################################
echo STARTING: RAID-Z1 with 4 data disks

# Preparing virtual 4kb devices
gnop create -S 4k gpt/ZfsDrive0
gnop create -S 4k gpt/ZfsDrive3
gnop create -S 4k gpt/ZfsDrive6
gnop create -S 4k gpt/ZfsDrive7
gnop create -S 4k gpt/ZfsDrive11

# Creating pool, exporting, and importing
zpool create testPool raidz1 \
    /dev/gpt/ZfsDrive0.nop /dev/gpt/ZfsDrive3.nop /dev/gpt/ZfsDrive6.nop \
    /dev/gpt/ZfsDrive7.nop /dev/gpt/ZfsDrive11.nop
zfs set atime=off testPool
zpool export testPool
gnop destroy gpt/ZfsDrive0.nop
gnop destroy gpt/ZfsDrive3.nop
gnop destroy gpt/ZfsDrive6.nop
gnop destroy gpt/ZfsDrive7.nop
gnop destroy gpt/ZfsDrive11.nop
zpool import -d /dev/gpt testPool
zfs create testPool/testVol
mkdir /testPool/testVol/bonnieData
chmod 0777 /testPool/testVol/bonnieData

# Running 3 disk tests
date +'   [Repitition 1] %H:%M:%S'
bonnie++ -r 8192 -s 81920 -f -b -n 1 -q -m RAIDZ1_4 -u someUser -d \
    /testPool/testVol/bonnieData > Bonnie_RAIDZ1_4-Data_Rep1.csv
awk -F "," '{ print "1\t4\t1\t"$10"\t"$12"\t"$16 }' \
  Bonnie_RAIDZ1_4-Data_Rep1.csv >> ZfsTestResults.tsv

date +'   [Repitition 2] %H:%M:%S'
bonnie++ -r 8192 -s 81920 -f -b -n 1 -q -m RAIDZ1_4 -u someUser -d \
    /testPool/testVol/bonnieData > Bonnie_RAIDZ1_4-Data_Rep2.csv
awk -F "," '{ print "1\t4\t2\t"$10"\t"$12"\t"$16 }' \
  Bonnie_RAIDZ1_4-Data_Rep2.csv >> ZfsTestResults.tsv

date +'   [Repitition 3] %H:%M:%S'
bonnie++ -r 8192 -s 81920 -f -b -n 1 -q -m RAIDZ1_4 -u someUser -d \
    /testPool/testVol/bonnieData > Bonnie_RAIDZ1_4-Data_Rep3.csv
awk -F "," '{ print "1\t4\t3\t"$10"\t"$12"\t"$16 }' \
  Bonnie_RAIDZ1_4-Data_Rep3.csv >> ZfsTestResults.tsv

# Destroying pool
date >> ZfsStatus.txt
echo Status of RAIDZ1 with 4 data disks: >> ZfsStatus.txt
zpool status >> ZfsStatus.txt
zpool destroy testPool

Sebulon · Oct 22, 2013

Hi @listentoreason!

This might be of interest: GELI Benchmarks.

/Sebulon

Terry_Kennedy · Oct 22, 2013

listentoreason said:
The Xeon E5606 supports AES-NI, so I had hoped for better performance. If anyone has thoughts on improving performance of an encrypted provision on top of ZFS, I would be very interested in hearing it. Also, if this approach is flat-out a Bad Idea, I would be interested to know that, too.

FreeBSD (at least in 8.x) won't use the CPU's crypto capabilities unless you have the appropriate modules loaded. On my systems, I have:

Code:

crypto_load="yes"
cryptodev_load="yes"
aesni_load="yes"

Which gives me this:

Code:

cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS> on motherboard

usdmatt · Oct 22, 2013

FreeBSD (at least in 8.x) won't use the CPU's crypto capabilities unless you have the appropriate modules loaded. On my systems, I have:

Interesting. A quick look at the online SVN browser suggests that crypto()/aesni() aren't in GENERIC by default in 9.1/9.2, so loading those could be a major gain.

Also, the SVN logs suggest that some performance improvements might of been made recently, although by the look of it you may need to go to HEAD or possibly the 10.0 betas to test it.

http://svnweb.freebsd.org/base/head/sys/crypto/aesni/aesni.c?view=log

On my machine, pulling the code to userland I saw performance go from
~150MB/sec to 2GB/sec in XTS mode. GELI on GNOP saw a more modest
increase of about 3x due to other system overhead (geom and
opencrypto)...

One thing that intrigues me, is there a specific reason why your script creates a new filesystem /testPool/testVol, rather than just using /testPool?

listentoreason · Oct 23, 2013

usdmatt said:
One thing that intrigues me, is there a specific reason why your script creates a new filesystem /testPool/testVol, rather than just using /testPool?

Yes, actually! It's due to an amoeba-level comprehension of the distinctions between zvols and directories, which results in a fair bit of Frankenstein patchwork when I build work flows from scattered examples I find on the net. I've been trying to expand my understanding of the structure and functionality of ZFS, but I am still baffled by what a zvol represents, or when it is appropriate to make one.

I feel relatively comfortable with zpools. I have the vague sense that a zvol can allow segregation beyond what a directory structure might allow (quotas, for example?), but remain 98%+ ignorant on their proper application.

I infer from the question that it was perhaps an odd thing to do? Any insight (either direct, or via helpful links) would be greatly appreciated. In particular, does creating a zvol add overhead to the system?

listentoreason · Oct 23, 2013

Hmm. I had hoped by this time that I would be able to post tasty benchmarks with AES-NI accelerated encryption. However, I am failing to get aesni recognized. Per @Terry_Kennedy's suggestion, I have modified /boot/loader.conf to look like this:

Code:

crypto_load="yes"
cryptodev_load="yes"
aesni_load="yes"
geom_eli_load="yes"

(I'm not sure I need the geli loader there, geli fired up fine before, I presume it auto-loads when needed? Or is in 9.1 by default?)

The relevant modules appear to have loaded:

Code:

[CMD]kldstat[/CMD]
Id Refs Address            Size     Name
 1   30 0xffffffff80200000 1323388  kernel
 2    1 0xffffffff81524000 1f9a0    geom_eli.ko
 3    4 0xffffffff81544000 2b4a8    crypto.ko
 4    3 0xffffffff81570000 dde0     zlib.ko
 5    1 0xffffffff81580000 4cd0     cryptodev.ko
 7    1 0xffffffff81612000 13436d   zfs.ko
 8    1 0xffffffff81747000 2fb1     opensolaris.ko
 9    1 0xffffffff8174a000 3dff     linprocfs.ko
10    1 0xffffffff8174e000 1f417    linux.ko
11    1 0xffffffff8176e000 1a9f     aesni.ko

but I am not getting the system to recognize the presence of a (putatively) AES-NI capable CPU:

Code:

[CMD]dmesg | egrep -i '(aes|crypt|\.eli)'[/CMD]
cryptosoft0: <software crypto> on motherboard
aesni0: No AESNI support.
aesni0: No AESNI support.
aesni0: No AESNI support.
GEOM_ELI: Device zvol/abyss/shadow.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: software
GEOM_ELI: Device zvol/abyss/shadow2.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: software

Multiple aesni0 entries are present because I tried kldunload aesni / kldload aesni a couple times, hoping that I could, uh... catch it with its guard down and get it to work? The shadow and shadow2 GELI provisions are the ones mentioned above (UFS and zpool, respectively).

Does anyone see anything obvious that I'm missing? I've read Sebulon's benchmarks and holy cow, his encrypted pools are outperforming my plaintext ones; I'm not going to be able to get to that throughput, but it would be nice to do better than 33% of what my raw zpool can do. The aesni man page implies that I should be good to go by just including aesni_load="YES". I'm concerned that the E5606 AES-NI capability just isn't being recognized for some reason.

Code:

[CMD]uname -a[/CMD]
FreeBSD citadel 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     [email]root@farrell.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64

usdmatt · Oct 23, 2013

Regarding AESNI, The manual for your board suggests there's a BIOS option for it, so you may be lucky and it might be disabled by default:

Intel AES-NI

Select Enable to use the Intel Advanced Encryption Stanard (AES) New Instructions (NI) to ensure data security. The options are Disabled and Enabled.

Yes, actually! It's due to an amoeba-level comprehension of the distinctions between zvols and directories, which results in a fair bit of Frankenstein patchwork when I build work flows from scattered examples I find on the net. I've been trying to expand my understanding of the structure and functionality of ZFS, but I am still baffled by what a zvol represents, or when it is appropriate to make one.

When you create a pool, a single ZFS dataset is created. That dataset is mounted on /poolname, and supports all the functionality of any other ZFS dataset. (The FreeBSD docs used to contain a terrible line that suggested you needed to create sub-datasets to make use of ZFS features, and that by using the root mount you were somehow storing files directly on the pool itself).

Edit: It's still there...

Files may be created on it and users can browse it, as seen in the following example.
However, this pool is not taking advantage of any ZFS features. To create a dataset on this pool...

A ZFS dataset is just like any other filesystem. It's mounted on a mount point, you can put files on it, set permissions, etc, just like a UFS filesystem. On a single pool, you can create as many datasets as you want. The main reasons I can come up with for separate datasets are:

Each dataset can have different options. You might want readonly=on on /pool/readonly, but not on /pool/readwrite
You can easily see used space for each dataset in zfs list, and can snapshot the datasets independently.
Separate datasets can be mounted in different locations on the system. For example, You could create a pool called storage, mounted on /storage, then create a second dataset, storage/home, mounted on /home.

Code:

zpool import -d /dev/gpt testPool
zfs create testPool/testVol

This code does not create a ZVOL, it just creates a second dataset, mounted on /testPool/testVol. Save for any weird quirks, I wouldn't expect performance to be any different if you tested on /testPool or /testPool/testVol.

A ZVOL is created using the -V option. A ZVOL actually creates a virtual 'volume' and an associated device in /dev. Thus, the reason for creating ZVOLS is when you want to do something that needs a real device - such as running UFS on top of it. ZVOLS can also be useful if you want to export via iSCSI or they can be used as the backing disk for bhyve VMs.

When you created the encrypted UFS filesystem / ZFS pool you did, correctly, create a ZVOL.

Note that the zfs create command also has a -s option to create a sparse volume ( zfs create -sV 2TB pool/sparseVol). (Otherwise you may have noticed that the ZVOL immediately reserved 2 TB of your pool). The interesting thing about this is that you can actually create a device bigger than your pool. For example, you could create a 10 TB UFS file system on top of a 2 TB ZFS pool, then buy disks to increase the pool size as/when needed, without having to worry about re-sizing the UFS filesystem.

bthomson · Oct 24, 2013

listentoreason said:
However, I have also been trying to implement geli on top of ZFS. Judging from the majority of postings I have found, most users will first provision all drives in the pool with geli, then put ZFS on top of the encrypted devices. I do not want to encrypt the entire pool, however, and have instead been testing geli on top of ZFS. The benchmarks there are significantly worse; about a three fold hit for a UFS provision, and up to six fold for having an encrypted zpool in a zpool.

I have observed the same thing on 9.1-RELEASE with UFS-on-ZVOL. There are some parameters that could be tweaked, such as the block size of the ZVOL vs the block size of the UFS volume, whether the zpools do sync or async writes (sync parameter), etc.

Another interesting question would be whether this same slowdown occurs on Solaris. Up until 9.2-RELEASE FreeBSD wasn't able to do ZPOOL-on-ZVOL even though Solaris always has been able to, so the two don't always match up exactly in corner cases.

Terry_Kennedy said:
Code:

crypto_load="yes" cryptodev_load="yes" aesni_load="yes"

I have always just used

Code:

aesni_load="yes"

At startup, geli indicates hardware crypto.

listentoreason · Oct 24, 2013

Nutshell: Thanks to @usdmatt, I was able to get AES-NI activated and recognized. However, my GELI benchmarks actually got worse. I am planning more benchmarks, but would appreciate any insights.

Details:

usdmatt said:
Regarding AESNI, The manual for your board suggests there's a BIOS option for it, so you may be lucky and it might be disabled by default

Yes, it was! All seems to be in order now, thank you!

Code:

[CMD]dmesg | egrep -i '(aes|crypt|\.eli)'[/CMD]
Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI>
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS> on motherboard
GEOM_ELI: Device zvol/abyss/shadow.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device zvol/abyss/shadow2.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware

Sadly, hardware cryptographic support does not appear to have helped my I/O results. Initial rates for the encrypted UFS provision were 90 / 30 / 150 MB/sec (Write/Rewrite/Read, copied from the top of this thread); after enabling hardware support I actually fell to 67 / 30 / 102 MB/sec:

Code:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
citadel         80G           68740   7 30910  20           105048   9  57.1  12
Latency                        1561ms   13590ms               776ms    1974ms
Version  1.96       ------Sequential Create------ --------Random Create--------
citadel             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1  1406   5 +++++ +++  1936   3  1730   4 +++++ +++ +++++ +++
Latency             26074us      21us     193ms    1045us      10us     385us

The encrypted-zpool-on-zpool test is worse than UFS, at 37 / 25 / 87 MB/sec, and notably worse at writes than before I enabled hardware encryption (from above, 71 / 26 / 78 MB/sec):

Code:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
citadel         80G           37670  12 25203   8           88426  13  49.8   3
Latency                        6217ms    6529ms               823ms    1803ms
Version  1.96       ------Sequential Create------ --------Random Create--------
citadel             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency                93us     132us     137us      87us      48us      71us

I suspect I'm doing something very wrong, since these values are so far removed from what others have achieved. I was worried that in my tinkerings to activate hardware encryption I perhaps managed to break something else, so I rechecked my main zpool and it looks fine (298 / 242 / 718 MB/sec):

Code:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
citadel         80G           305568  72 247557  64           735151  83 143.1   5
Latency                        1551ms    1572ms               161ms     827ms

I currently have a spare disk sitting idle in the case, I am going to try traditional "GELI on bottom" with a simple UFS volume and see how it performs. That should at least give me raw GELI benchmarks. I'll also do some bonnie++ tests on just a plaintext UFS device; I should have done that anyway as a control for the benchmarks above.

Hmmm, a thought. This machine has 8 cores available, but I wonder if AES-NI is "one per chip". That can't explain my current conundrum, but I wonder if hardware encryption will not scale as well because each chip has only one AES-NI 'unit', not one per core.

@usdmatt, thanks very much for the description of ZVOLs and datasets, it was very helpful. I had monkey-copied the -V flag but had not recognized its significance. My lexicon (which now includes "pool", "device", "volume", "file system", "provision" and "dataset") is still uncomfortably ahead of my comprehension but I'm closing the gap.

@bthomson, if you have any links to tweaks you've seen I'd be interested in reading them. I'm generally loathe to do to much tinkering, but I'd be curious to see what components people believe play a role in this.

tegaP0PwkubtXdsK · Feb 3, 2014

From WD:

http://www.wdc.com/en/products/products.aspx?id=810 said:
WD Red drives are designed and tested for small office and home office, 1-5 bay NAS systems, small media servers and PCs with RAID.

This implies that WD Red would be unsuitable for a system with twelve drives (as you have created). I would like to put eighteen of these into a new system. Should I look into another drive?

RAID = redundant array of inexpensive drives

Thank you,

Chris

listentoreason · Feb 4, 2014

tegaP0PwkubtXdsK said:
From WD:

http://www.wdc.com/en/products/products.aspx?id=810 said:

WD Red drives are designed and tested for small office and home office, 1-5 bay NAS systems, small media servers and PCs with RAID.

Click to expand...

This implies that WD Red would be unsuitable for a system with twelve drives (as you have created). I would like to put eighteen of these into a new system. Should I look into another drive?

RAID = redundant array of inexpensive drives

At the time I bought them they seemed reasonably priced. There was a modest premium on them compared to other drives, but they also had a longer warranty if I recall. I tend, perhaps naively, to view longer warranties as a gesture that a manufacturer stands behind their hardware. I thought it was a five year warranty, but it's listed as three years now at NewEgg. Don't know if this is my poor memory or if they have reduced their warranty. I was also willing to spend more on disk since I had decided to make the build a full-fledged server as well as a NAS, since I had ended up with a modestly expensive motherboard to get ECC RAM. So the relative cost of the drives compared to the rest of the hardware was reduced.

My prior FreeNAS zpool had lasted quite some time; it had one drive fail, and replacing that was a pain (512 b vs 4 kb sector issue). I built a large array because I want to just leave it alone for many years before it fills up, so disk longevity is more of a concern for me. I bought a total of 15 drives for this build in order to have spares, and I also presumed that at least a few would be duds. Out of those 15, 11 have been in active use in the zpool for three months, with zero errors on scrub. One drive was 100% dead-on-arrival. The other three I did some minor stress-testing to verify that they were not also DOA. So I had an out-of-the box failure rate of 1 in 15.

Anyway, I agree the "I" is an important selling point here, and I'd definitely look for cheap drives. As the metrics above show you can get stunning performance from consumer-grade drives in a zpool. I'd be very interested to see what kind of metrics you get with an 18 drive pool. Many people would recommend populating the pool with multiple brands and/or models, to minimize the risk of a "bad lot effect" that simultaneously wipes out several drives (and your redundancy). With that approach you could bargain shop, picking up a few different drives as they go on sale.

I presume that your eighteen drives will be 16 + 2 in a RAID-Z2, to keep with the powers-of-two rule for ZFS. You may wish to consider 19 in a RAID-Z3; from what I've read (comments here and elsewhere), RAID-Z3 was introduced for large pools because they come under much more strain (and/or take longer?) during a resilver, increasing the risk that one or more "marginal" drives would fail while adding in a new disk. It would increase your hard disk expense by only a small amount with a pool of that size.

-CAT

tegaP0PwkubtXdsK · Feb 4, 2014

CAT,

My intention was to create two RAIDZ3 vdevs of 9 drives each. However, just today in the FreeNAS forums somebody sat me down and explained that RAIDZ3 vdevs should contain 4, 5, 7, 11, or 19 drives following the 2^N+3 rule. So I will probably be going with two 11-disk RAIDZ3 vdevs. However, I first need to understand why I was under the impression that a ZFS vdev should never have more than 9 drives in it...

However, in this particular thread I am curious about whether or not I can disregard WD's guidance that the "WD Red" line of drives should not be used in groups larger than 5. I suspect it's just their attempt and getting me to spend more money on fancier NAS drives for a RAED setup (redundant array of Expensive drives).

Chris

listentoreason · Feb 4, 2014

tegaP0PwkubtXdsK said:
My intention was to create two RAIDZ3 vdevs of 9 drives each. However, just today in the FreeNAS forums somebody sat me down and explained that RAIDZ3 vdevs should contain 4, 5, 7, 11, or 19 drives following the 2^N+3 rule. So I will probably be going with two 11-disk RAIDZ3 vdevs. However, I first need to understand why I was under the impression that a ZFS vdev should never have more than 9 drives in it...

However, in this particular thread I am curious about whether or not I can disregard WD's guidance that the "WD Red" line of drives should not be used in groups larger than 5. I suspect it's just their attempt and getting me to spend more money on fancier NAS drives for a RAED setup (redundant array of Expensive drives).

When I was starting to plan out the server I called WD tech support to see what advice they had for ZFS. 90% of the conversation was the tech person saying "Z what?", consulting with a similarly confused manager, and me trying to politely hang up. They are marketing to traditional RAID users, and all the home consumer RAID "solutions" I've seen involve a hardware RAID controller. I'm sure such systems end up with hardware constraints (technical or marketing) on the size of a RAID pool. I am far from an expert here, but my perception is that from the perspective of an individual drive, within a zpool it's a lonely little device getting and sending an isolated stream of data. It has no concept that it is part of a grander picture; the pool is entirely managed in software, each drive is internally a little island with no self-knowledge of the others. So I can't believe that the pool size would be relevant to any hardware "group limits."

In theory you could make gargantuan ZFS file systems with millions of drives, each believing it was alone in the universe (and you would still not be close to tapping the design limits of ZFS). Practically, one of the more pressing issues I had with a modest 11-drive pool was finding a case to hold all the drives; I went with a cheap 4U rack that could hold 15. That may have been pound-foolish; I worry about hardware failures from rickety connections (the molex connectors for the fans are atrocious). Having enough SATA connections was also an issue. I was concerned that I was going overboard with a higher-end HBA, but I am glad I did. I suspect my performance would have eventually (quickly?) been bottlenecked by IO through a cheaper SATA card.

I too have seen various advice about not making the pool too large. There are some suggestions that you pay a performance penalty, but in my circumstance I saw significant performance gains in most metrics with an expanded pool. I have also seen cautions about issues in replacing drives in a large pool, as mentioned above (that resilvering so many drives puts a strain on the pool). In my case I wanted a single, monolithic virtual device so it would be many years before I had to think about building a new pool (since it can't be expanded). Also, managing backups is, for me at least, easier with one large pool. I still have to consider how I distribute files onto my external storage media (USB hard drives in my case) but I don't have to worry about that on the primary pool ("oh, /media/photos/raw/ is at 98%...").

kpa · Feb 4, 2014

Just to note. RAID solutions for home consumers are almost 100% BIOS based software RAID, also known as "pseudo RAID". Such solutions use the CPU for the RAID functions. It's very rare to see a real H/W RAID controller on an average fileserver that is for personal or even small business use.

ralphbsz · Feb 4, 2014

Agree with kpa: Many low-cost RAID cards use the main CPU to perform RAID functions. In and of itself, that is not a bad thing: on modern machines, the CPU is perfectly powerful enough to perform RAID (including parity calculation and checksumming), for dozens or hundreds of disks. Which is why some file systems (for example ZFS) now include the RAID functionality themselves. To quote a friend (who is a professional file system person): "RAID is too important to leave it to the disk controller guys". And I whole-heartedly agree with that sentiment (which is why I no longer own any hardware RAID controllers).

The problem is that many low-cost RAID cards have very badly implemented software stacks. Often it's a CD with a windows driver. While many low-cost RAID implementations are pretty awful, the ones that require installing Windows drivers tend to be the worst. Judging by what I hear, the better option is to find a RAID card that does all the RAID functionality in firmware on the card. Here is a way to figure out which RAID cards are independent of crappy windows drivers: Look for models that support both strange operating systems and strange instruction sets (FreeBSD is only a little strange; really strange ones include HP-UX on Itanium, AIX on Power, and Solaris on Sparc). Cards that can work in some of these environments tend to be industrial strength. But I would still go for software RAID, if implemented well.

tegaP0PwkubtXdsK · Feb 4, 2014

listentoreason said:
I am far from an expert here, but my perception is that from the perspective of an individual drive, within a zpool it's a lonely little device getting and sending an isolated stream of data. It has no concept that it is part of a grander picture

Can't argue with that!

Chris

listentoreason · Feb 5, 2014

ralphbsz said:
Agree with kpa: Many low-cost RAID cards use the main CPU to perform RAID functions. In and of itself, that is not a bad thing: on modern machines, the CPU is perfectly powerful enough to perform RAID (including parity calculation and checksumming), for dozens or hundreds of disks. Which is why some file systems (for example ZFS) now include the RAID functionality themselves. To quote a friend (who is a professional file system person): "RAID is too important to leave it to the disk controller guys". And I whole-heartedly agree with that sentiment (which is why I no longer own any hardware RAID controllers).
...
But I would still go for software RAID, if implemented well.

Hear, hear. There are many reasons that I decided to be "that guy at work that goes around trying to convince everyone to use ZFS." One of my biggest reasons is that it's FOSS, available on multiple free platforms, and completely implemented in software. This means that my zpool is simply the hard drives; if any part of my system fails, I just have to build a new system, with arbitrary hardware, and move the drives over. I don't have to worry if UberRaidMaster™ is still in business, if they are still supporting the OS I want to use, if they are still making UberRaid™ v3.8.4-compatible cards, and if they have decided to increase the cost of the needed hardware by five fold.

I work in an industry that consumes very large amounts of disk (I think we're over a petabyte now). Some of that needs to be very high performance, but a lot of it would do fine with a zpool built on cheap disk (we've got a fair bit of data hosted on pricey "Enterprise" drives with SquashFS, for example). All of my colleagues are comfortable in BSD and/or Linux systems, and many maintain significant servers at home. I remain baffled that I have garnered so few adherents (I think I convinced one to go with ZFS on FreeNAS). Maybe I should offer free T-shirts? I can't afford free "product information seminars" in Aruba, and I suspect I'm not quite as pleasing to the eye as the vendor reps :OO .

estrabd · Feb 5, 2014

Well done!

Terry_Kennedy · Feb 6, 2014

listentoreason said:
When I was starting to plan out the server I called WD tech support to see what advice they had for ZFS. 90% of the conversation was the tech person saying "Z what?", consulting with a similarly confused manager, and me trying to politely hang up. They are marketing to traditional RAID users, and all the home consumer RAID "solutions" I've seen involve a hardware RAID controller. I'm sure such systems end up with hardware constraints (technical or marketing) on the size of a RAID pool.

You have to realize that retail drives are a rather small part of WD's business. Most of their production goes to OEMs who buy in bulk and have dedicated sales / support reps. And almost of the "BYOD" NAS boxes are using some sort of Linux filesystem, so the end-user support people simply don't get a lot of calls about ZFS. So there's no point in training reps to answer a question they may never get asked, particularly when ZFS is evolving and the answer may be different next month.

I am far from an expert here, but my perception is that from the perspective of an individual drive, within a zpool it's a lonely little device getting and sending an isolated stream of data. It has no concept that it is part of a grander picture; the pool is entirely managed in software, each drive is internally a little island with no self-knowledge of the others. So I can't believe that the pool size would be relevant to any hardware "group limits."

Given that Backblaze has been using regular consumer drives in their Pods since they started, it shows that if you're creative enough, you can get away with things like that. The Red drives and RE drives are simply versions that don't try error recovery "forever" - the firmware in them assumes that a host RAID controller will deal with hard-to-recover errors. Which is exactly the opposite requirement to a single drive in a PC, where you want the drive to keep on trying until it's sure it can't get any of the data back.

I've commented before in a number of forums that rather than offering a drive in every color of the rainbow, I'd like to be able to specify a drive by capacity, spindle speed, and cache (in that order) and then pick a warranty (1, 3, or 5 years). And give me a mode setting tool to configure spindown / standby, error recovery preference, and so on. This would cut down dramatically on the number of SKUs the manufacturer, distributor, and retailes need to deal with.