RaidZ2 performance with GELI

bbzz · May 2, 2012

Inspired by well written posts here, I decided to post the performance of my raidz2 vdev and ask a couple of questions.

Code:

	NAME                 STATE     READ WRITE CKSUM
	storage              ONLINE       0     0     0
	  raidz2-0           ONLINE       0     0     0
	    label/disk0.eli  ONLINE       0     0     0
	    label/disk1.eli  ONLINE       0     0     0
	    label/disk2.eli  ONLINE       0     0     0
	    label/disk3.eli  ONLINE       0     0     0
	    label/disk4.eli  ONLINE       0     0     0
	    label/disk5.eli  ONLINE       0     0     0

Disks are Samsung SpinPoint F4 2TB each, and as far as I know they are advanced format drives which should be zfs ashift-ed to 4K, which is what I did at the time.

Code:

zdb | grep ashift
            ashift: 12

Disks were fed as a whole to zfs, they were not partitioned to start at a 1MB boundary. Reading one of previous posts (which eludes me right now) by Phoenix, he suggested that disks should be partitioned nonetheless, rather than given as a whole to zfs.

I did say before that this is just a storage server which need only saturate GigE link, but I'm curious if the performance could be better.

So here are the results:
# bonnie++ -d /storage/test -u 0:0 -s 24g

Code:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zhenbox.home.do 24G   140  99 97651  17 60965  16   347  89 215720  18 126.9   4
Latency               361ms     875ms    2083ms     350ms     204ms     890ms
Version  1.96       ------Sequential Create------ --------Random Create--------
zhenbox.home.domain -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 17048  95 +++++ +++  9910  93 10899  98 20674 100  6289  95
Latency              9517us     176us   44776us   12602us     111us   61520us
1.96,1.96,zhenbox.home.domain,1,1335966952,24G,,140,99,97651,17,60965,16,347,89,215720,18,126.9,4,16
,,,,,17048,95,+++++,+++,9910,93,10899,98,20674,100,6289,95,361ms,875ms,2083ms,350ms,204ms,890ms,9517us
,176us,44776us,12602us,111us,61520us

Configuration is Intel i7 920, with 12GB RAM. Also, very important, I use geli AES-XTS 256 on all disks. I noticed during test that CPU goes to 90%, making me think that encryption is limiting factor here. However, I'd still like to know if performance could be better here despite encryption.

Is it a mistake that whole disks are fed to zfs rather than partitions, and, would separate SSD for ZIL and L2ARC bring significant difference here or is performance limited purely by geli?

Any opinion welcomed.

Sebulon · May 2, 2012

@bbzz

Of course geli can be limiting. The partitioning as well, but mostly geli, I`d say. Have a look at the benchmarking I did:
GELI Benchmarks

It shows very well the difference between geli`s many encryption algorithms, lengths and if the en/de-crypting is hardware accelerated or not.

Even so, I still agree with phoenix; it is always best to partition at -b 2048. That way, you also have plenty of room for a small boot-partition at the beginning.

/Sebulon

bbzz · May 2, 2012

So unfortunately there's no way to know exactly what the performance hit is by not starting at -b 2048. I'm tempted to recreate pool to test this but there's no easy way to do it without getting new disks. Unless I could intentionally degrade 2 disks from vdev, copy from degraded pool to these two disks and recreate. Ouch.

It's interesting but I never heard about possible issue for not starting at -b 2048 even for "whole" disks. Isn't ashift=12 doing its deed? I don't really need boot partition as I boot from other disk.

Thanks!

t1066 · May 2, 2012

Maybe you can try when running benchmarks/bonnie++, also run the following command.

$ gstat -f disk0

The output should contain two lines with label/disk0 and label/disk0.eli in the last column. Compare the second last column that is labelled %busy. You could see that whether geli is indeed the limiting factor.

From my memory, running benchmarks/bonnie++ on my pool which consists of the following disks

Code:

$ camcontrol devlist
<ATA Hitachi HDS72101 A3EA>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA Hitachi HDS72101 A3EA>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD10EADS-00M 0A01>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD10EADS-00M 0A01>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA Hitachi HDS72101 A3EA>        at scbus0 target 4 lun 0 (pass4,da4)
<ATA WDC WD1000FYPS-0 1B02>        at scbus0 target 5 lun 0 (pass5,da5)

gives writing speed of 200+MB/s and reading speed of 300+MB/s.

bbzz · May 2, 2012

Yeah, there's quite a bit of perfomance hit without hardware acceleration:

# gstat -I 5s

Code:

dT: 5.004s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0     12      0      0    0.0     11    119    0.7    1.9| ada0
    0     12      0      0    0.0     11    119    4.7    3.3| label/zpool
    0    256      0      0    0.0    254  27576    2.2   26.0| ada1
    0    254      0      0    0.0    252  27409    2.2   28.2| ada2
    0    261      0      0    0.0    259  28120    2.2   28.9| ada3
    0    259      0      0    0.0    257  27997    2.0   29.2| ada4
    0    256      0      0    0.0    254  27313    2.1   29.3| ada5
    0    261      0      0    0.0    260  27988    2.2   27.3| ada6
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zpool/swap
    0     12      0      0    0.0     11    119    8.1    3.7| label/zpool.eli
    2    256      0      0    0.0    254  27576   11.7   64.2| label/disk0
    3    254      0      0    0.0    252  27409   10.2   64.0| label/disk1
    2    261      0      0    0.0    259  28120   12.5   68.1| label/disk2
    3    259      0      0    0.0    257  27997    9.9   70.7| label/disk3
    2    256      0      0    0.0    254  27313   11.0   68.7| label/disk4
    6    261      0      0    0.0    260  27988   12.4   66.7| label/disk5
    0      0      0      0    0.0      0      0    0.0    0.0| da0
   10    256      0      0    0.0    254  27576   27.5   97.5| label/disk0.eli
   10    254      0      0    0.0    252  27409   28.2  101.3| label/disk1.eli
    9    261      0      0    0.0    259  28094   25.9   96.4| label/disk2.eli
    8    259      0      0    0.0    257  27997   25.6  101.2| label/disk3.eli
   10    256      0      0    0.0    254  27313   27.3  100.3| label/disk4.eli
   10    261      0      0    0.0    260  27988   25.4   94.4| label/disk5.eli
    0      0      0      0    0.0      0      0    0.0    0.0| da0s1
    0      0      0      0    0.0      0      0    0.0    0.0| da0s1a

Still, I'm curious if recreating to -b 2048 boundary would make a difference.

Since I only have little more than 2TB filled up, I think I'll pull one disk out from raidz2 dev, copy what I can to that 2TB disk, recreate degraded vdev and copy back.

Question is, is it possible to make degraded raidz2 dev with 5 disks and then add one later?

Sebulon · May 2, 2012

Question is, is it possible to make degraded raidz2 dev with 5 disks and then add one later?

Adding one later, no. Only way of adding just one drive later is by forcing it into the pool as a single-drive vdev without any fault-tolerancy. From then on, if anything happens to that drive, the entire pool is heading down shits creek- without a paddle...

But it is very possible to create a degraded 6-disk raidz2 pool with only four drives, or five for that matter.

Four drives:

Code:

[CMD="#"]diskinfo -v da0[/CMD]
	512         	# sectorsize
	2000398934016	# mediasize in bytes (1.8T)
	3907029168  	# mediasize in sectors
        ...

[CMD="#"]echo "2000398934016 / 1024000 - 1" | bc[/CMD]
1953513

# dd if=/dev/zero of=/tmp/tmpdsk0 bs=1024000 seek=1953513 count=1
# dd if=/dev/zero of=/tmp/tmpdsk1 bs=1024000 seek=1953513 count=1
# mdconfig -a -t vnode -f /tmp/tmpdsk0 md0
# mdconfig -a -t vnode -f /tmp/tmpdsk1 md1
# gnop create -S 4096 md0

# zpool create -O mountpoint=none -o autoexpand=on pool1 raidz2 md0.nop md1 gpt/disk2 gpt/disk3 gpt/disk4 gpt/disk5
# zpool offline pool1 md0.nop
# zpool offline pool1 md1

# mdconfig -d -u 0
# mdconfig -d -u 1
# rm /tmp/tmpdsk0
# rm /tmp/tmpdsk1

Five drives:

# dd if=/dev/zero of=/tmp/tmpdsk0 bs=1024000 seek=1953513 count=1
# mdconfig -a -t vnode -f /tmp/tmpdsk0 md0
# gnop create -S 4096 md0

# zpool create -O mountpoint=none -o autoexpand=on pool1 raidz2 md0.nop  gpt/disk1 gpt/disk2  gpt/disk3  gpt/disk4  gpt/disk5

# zpool offline pool1 md0.nop

# mdconfig -d -u 0
# rm /tmp/tmpdsk0

And when the data is over, you just:
# zpool replace pool1 md0.nop gpt/disk0

This also allows you to, say, create a 2-disk mirror pool with only one real drive...
Create a 6-disk raidz3 pool with three real drives, perhaps...
Or, allowing you to go from a 2-disk mirror pool to a 3-disk raidz pool with only three real drives to work with.

Trixy hobbitses!

/Sebulon

bbzz · May 2, 2012

Ah yes, that's what I was thinking of! Many thanks. Let's hope that one drive doesn't die while I recreate the pool. :e

bbzz · May 2, 2012

Sorry for nitpicking now, just curious. Why do you subtract 1 in this:

Code:

echo "2000398934016 / 1024000 - 1" | bc

Can that virtual disk be just arbitrarily larger that other physical drives since the vdev size will be limited by them anyway?

bbzz · May 3, 2012

Code:

# diskinfo -v ada1
ada1
	512         	# sectorsize
	2000397852160	# mediasize in bytes (1.8T)
	3907027055  	# mediasize in sectors
	...

# diskinfo -v ada2
ada2
	512         	# sectorsize
	2000398934016	# mediasize in bytes (1.8T)
	3907029168  	# mediasize in sectors
	...

Even though these two disks are from same series and same manifacturer, they are slightly different size. Could this affect performance maybe? Would it be better to specify gpt partition to same size on both disks and also couple of MB less than total size to account for possibility of later upgrade size mismatch (possibly other manufacturer altogether)? So maybe something like this:

# gpart add -t freebsd-zfs -b 20480 -a 4k ada0

Sebulon · May 3, 2012

bbzz said:
Sorry for nitpicking now, just curious. Why do you subtract 1 in this:

Code:

echo "2000398934016 / 1024000 - 1" | bc

Can that virtual disk be just arbitrarily larger that other physical drives since the vdev size will be limited by them anyway?

Because -b 2048 subtracts 1MB.

Ok, check diskinfo against the diskX-partitions on your hard drives. Their sizes should match. You won't be needing to subtract anything with that, but even if you did, it wouldn't hurt.

Subtracting more from the partition size should also work, I guess. Never tried that one though, but I can't see anything wrong with it.

The virtual disks must be smaller than the real ones. Otherwise you won't be able to replace them with real disks later, because "Specified device too small" something, something, dark side...

/Sebulon

bbzz · May 3, 2012

Because -b 2048 subtracts 1MB.

A "duh" moment here.

I decided to switch to mirror configuration due to many advantages and recommendations. Since I don't need all the space for a while now, I'll make two three-way mirrors, and then when I need, switch to 3 mirror pairs and get one extra hot spare. I think it makes sense to sync two extra as part of mirror rather than hot spare since they aren't doing anything. Maybe reads would be faster also. I'll test.

I took off 2 disks from raidz2 vdev last night, made it into one mirror vdev and copied data over. It was horrendously slow (~50MB/sec) but I guess its either geli or maybe the fact that one pool was degraded, or maybe something else.

Anyway, these are two mirror disks in a vdev, I started with -b 2048 and didn't do anything else (sector size is from geli):

Code:

diskinfo -v /dev/gpt/disk0.eli 
/dev/gpt/disk0.eli
	4096        	# sectorsize
	2000396779520	# mediasize in bytes (1.8T)
	488378120   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	484502      	# Cylinders according to firmware.
	16          	# Heads according to firmware.
	63          	# Sectors according to firmware.
	S2H7J9CB314062s0	# Disk ident.

diskinfo -v /dev/gpt/disk1.eli
/dev/gpt/disk1.eli
	4096        	# sectorsize
	2000397860864	# mediasize in bytes (1.8T)
	488378384   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	484502      	# Cylinders according to firmware.
	16          	# Heads according to firmware.
	63          	# Sectors according to firmware.
	S2H7J9CB314061s0	# Disk ident.

There was never any complaint, but size difference is still about 1MB, which I found interesting since they are same disks and should be from same series.
I don't know if this matters at all performance wise.

Also I found that if I specify explicitly size of partition in gpart, no matter what it always starts at boundary 40, rather than 2048. Like so (just example):

Code:

# gpart add -t freebsd-zfs -b 2048 -a 4k -s 1800gb ada1
ada1p1 added
# gpart show ada1
=>        34  3907026988  ada1  GPT  (1.8T)
          34           6        - free -  (3.0k)
          40  3774873600     1  freebsd-zfs  (1.8T)
  3774873640   132153382        - free -  (63G)

It won't start at 2048 if I specify size. I tried pretty much everything.

Sebulon · May 3, 2012

@bbzz

Looked this up on IntelÂ´s site, regarding your CPU:
http://ark.intel.com/products/37147/Intel-Core-i7-920-Processor-%288M-Cache-2_66-GHz-4_80-GTs-Intel-QPI%29

AES New Instructions: No

Which explains why you have so lame performance with geli. Too bad, since it has decent specs otherwise... But to get better geli performance, you need a different CPU, one that has AES instructions built in.

/Sebulon

bbzz · May 3, 2012

Yeah, I knew about that. I'll probably need to look up which crypto cards are available for FreeBSD, as getting new CPU would mean new MB as well, etc.

What do you think about different disk sizes, does that matter (I don't really care anymore if it really does, just curious for future reference)?

Thanks again Sebulon for your help! :r

wblock@ · May 3, 2012

bbzz said:
Also I found that if I specify explicitly size of partition in gpart, no matter what it always starts at boundary 40, rather than 2048. Like so (just example):

Code:

# gpart add -t freebsd-zfs -b 2048 -a 4k -s 1800gb ada1 ada1p1 added # gpart show ada1 => 34 3907026988 ada1 GPT (1.8T) 34 6 - free - (3.0k) 40 3774873600 1 freebsd-zfs (1.8T) 3774873640 132153382 - free - (63G)

It won't start at 2048 if I specify size. I tried pretty much everything.

-a4k overrides -b, starting the partition at the next multiple of 4k rather than the given beginning.

bbzz · May 3, 2012

Ah, right.
And again, this slight size difference between partitions would matter?

Thanks.

Sebulon · May 3, 2012

bbzz said:
What do you think about different disk sizes, does that matter (I don't really care anymore if it really does, just curious for future reference)?

ZFS is built to tolerate it. It wasn't at first, but the founders soon noticed that it was going to be necessary just because of no drives being exactly the same. Even between the same disks, from the same series- as you also have seen. Don't know exactly how big the tolerance is though, MB's at most, it doesn't take that much for it to whine about them being too small. And the total available space per drive is calculated from the smallest drive in every vdev.

/Sebulon

bbzz · May 4, 2012

Thanks Sebulon!

Can I buy you a virtual beer? Hehe