NFS write performance with mirrored ZIL

AndyUKG

Well-Known Member

Reaction score: 22
Messages: 463

Sebulon said:
@AndyUKG

And the operative word in that sentence is "should":) I mean, what good does a cache flush command do when there´s no power to flush it´s caches with?
Well when there's no power, it isn't flushed and the ZFS write transaction isn't complete, end of story. ZFS is still in a consistent state but doesn't have any transactions relating to the last flush that didn't happen. Same idea as a database.
As you said, battery backup is a must for RAM disks or RAM write back cache. It's not technically required for disks (spinning or flash).

Sebulon said:
How come everyone else, including SUN/Oracle uses battery-backing for their logs? Well, in most cases it´s about battery-backed RAM, which would be useless otherwise, but if a consumer-grade SSD would do the same job, even without battery-backing, then I´m having a hard time understanding why they don´t just use that instead. Imagine the savings for Oracle if they ditched the ZEUS´s for ordinary Vertex´s, and getting better performance at the same time.

I have thought that you definitely needed battery-backing to maintain a consistant ZIL. Say you build a database NAS, or a VMWare datastore, export over NFS and once you´ve gone into production and that database/datastore has grown to 50TB- then you have a power-outage, which does happen to everyone from time to time. What happens then?

For everyone that wants to build mission critical systems based on FreeBSD and ZFS, is it really enough relying on a non battery-backed SSD?

/Sebulon
Haven't read up enough on it to know why Oracle use what they use, I guess reliability. I'd still expect the Vertex to work correctly most of the time, but if in 1 in a 1000 or 10000 power outages it doesn't store the data correctly thats enough for Oracle to choose a different more reliably technology. But also explains why you can't intentionally create a data error by turning the power off on your test rig...

cheers Andy.
 

peetaur

Active Member

Reaction score: 17
Messages: 167

Instead of dd, use gdd with conv=sync option. Normally the "sync" doesn't happen until the end of the file. Without sync happening, the client never discards uncleanly sent data, so it is ready to resend whatever is lost when the server reappears.

Or you could try mounting NFS with the "intr", "timeout" and "deadthresh" options, so the client drops when the server resets.

Code:
export PAGER=less
(how I hate how the FreeBSD version of "more" quits when it hits the end of the document...)
Code:
man mount_nfs
Code:
deadthresh=<value>
        Set the ``dead server threshold'' to the specified number
        of round trip timeout intervals before a ``server not
        responding'' message is displayed.

intr    Make the mount interruptible, which implies that file
        system calls that are delayed due to an unresponsive
        server will fail with EINTR when a termination signal is
        posted for the process.

timeout=<value>
        Set the initial retransmit timeout to the specified
        value.  May be useful for fine tuning UDP mounts over
        internetworks with high packet loss rates or an over-
        loaded server.  Try increasing the interval if nfsstat(1)
        shows high retransmit rates while the file system is
        active or reducing the value if there is a low retransmit
        rate but long response delay observed.  (Normally, the
        dumbtimer option should be specified when using this
        option to manually tune the timeout interval.
 

TheBang

New Member


Messages: 3

Sebulon, if I could ask, where did you purchase the Deneva 2 R Series? I've been looking to get one for a ZIL SLOG, but no one seems to have them in stock. Thanks!
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@TheBang

Of course you can:)

Correct, no one has them currently. We asked Dustin, a reseller in Sweden to order one in from the US. So it was a special order. But if more people starts asking, they might order in more to have in stock.

/Sebulon
 

ghandalf

New Member


Messages: 3

Hi,

This is a very nice and informative thread!

I have one question: Did you ever try over-provisioning the SSD and retest them? For the slog device, there is no need for more than 8-10GB. With over-provisioning, you could maybe gain some improvement in write throughput. If you still have the OCZ Deneva, I would be very interested with the over-provisioned performance!

Maybe, you could read these reviews:
http://www.storagereview.com/smart_storage_systems_xceedstor_500s_enterprise_ssd_review
http://www.storagereview.com/intel_ssd_520_enterprise_review
In both reviews, they test the performance with and without over-provisioning and the influence is enormous.

Regards ghandalf
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@ghandalf

Nice tip, thanks!

It didn´t make any difference in performance compared to what I had the last time I benchmarked it (67MB/s), but perhaps it´ll help maintain it. I ran dd from zero over the whole drive, repartitioned it with the same start boundary but made only a 48GB large partition for the SLOG and left the rest empty. Why 48GB? I provisioned it after 4x10GbE (a guy can dream right:)).

/Sebulon
 

ghandalf

New Member


Messages: 3

Sebulon said:
@ghandalf

Nice tip, thanks!

It didn´t make any difference in performance compared to what I had the last time I benchmarked it (67MB/s), but perhaps it´ll help maintain it. I ran dd from zero over the whole drive, repartitioned it with the same start boundary but made only a 48GB large partition for the SLOG and left the rest empty. Why 48GB? I provisioned it after 4x10GbE (a guy can dream right:)).

/Sebulon
Hi,

How did you do the overprovisioning? I read that it is not enough to make only a smaller partition. I read some articles about OP and they describe how to do it with linux, but unfortunately, they are in German. You can do this with hdparm, but I don't know if hdparm is available in freebsd FreeBSD!

Code:
root@ubuntu-10-10:~# hdparm -N /dev/sdb

/dev/sdb:
 max sectors   = 312581808/312581808, HPA is disabled
root@ubuntu-10-10:~#
Here you can see, that HPA (host protected area) is disabled.

With this command, you can enable it:
Code:
root@ubuntu-10-10:~# hdparm -Np281323627 /dev/sdb

/dev/sdb:
 setting max visible sectors to 281323627 (permanent)
Use of -Nnnnnn is VERY DANGEROUS.
You have requested reducing the apparent size of the drive.
This is a BAD idea, and can easily destroy all of the drive's contents.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.
root@ubuntu-10-10:~# hdparm -Np281323627 --yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb:
 setting max visible sectors to 281323627 (permanent)
 max sectors   = 281323627/312581808, HPA is enabled
The real enterprise SSDs uses approx. 28% OP and in the benchmark, they use up to 90% OP.

Maybe you can retest the SSD?! :e

Regards ghandalf
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@ghandalf

OMG that has to be the coolest command-flag EVER!:)

Okey, yeah I understand, just partitioning less might not do the trick. I´ll try installing the drive into a linux-box and do hdparm from there. Just a question about the hdparm-command; if I wanted to have only 48GB of useable space on it afterwards, would this commmand be correct?

# hdparm -Np49152000 --yes-i-know-what-i-am-doing /dev/sdX

/Sebulon
 

ghandalf

New Member


Messages: 3

@Sebulon,

I think, it is calculated this way:

48GB -> Byte = 51539607552 Byte

You need sectors:
51539607552 Byte / 512 Byte/Sectors = 100663296 Sectors.

BUT: I don't know, if the SSD has 512 Byte or 4096 Byte Sectors.
When you issue the command:
Code:
hdparm -N /dev/sdX
You will see how many Sectors you have.
An example calculation:
A 160GB Intel 320 SSD has 312581808 Sectors.
So 312581808 Sectors * 512 Byte/Sector = 160041885696 Byte => 149,05 GB usable space!

You should also note the max Sectors that you can reset it to factory defaults.

I really hope, that there is a gain in performance!:beergrin

Regards ghandalf
 

TheBang

New Member


Messages: 3

t1066

Active Member

Reaction score: 85
Messages: 227

Just found the following article. Basically, it says that sync writing to a seperate ZIL is done at queue depth of 1 (ZFS sent out sync write request one at a time to the log drive. Wait for the write to finish before sent out another one. Also, it work in a round robin way similar to how cache work. Hence, stripped log devices would not help). So the relevant data is the IOPS at queue depth 1 only, not the maximum IOPS.
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@ghandalf+TheBang

Somewhat delayed is here the result, that it didn't make any difference. Just wanted you to know that.

I still have my SLOG HPA'd down to 48GB and even though there wasn't any difference in performance, I'm going to keep it there since I don't need more space on that disk any way. Nice tip though, it was definitely worth a shot.

/Sebulon
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@peetaur

I have actually. We were fortunate to have a SUN 7300 with two STEC Zeus IOPS installed in one of it´s JBOD´s for our VMWare storage. Later, when we were about to decommission it, I pulled one out and installed it in the same server as I have tested the rest of the drives. If you read the high score back on #1, it was actually bested by quite alot, both locally and remote.

Perhaps they should have a sticker saying "Results may vary"?:)

But it may be quite big differences in HW and networking that could affect the outcome. I have performed all my test with 1GbE, whereas you mentioned 10GbE. I´m guessing that the Zeus performs very different depending on controller and driver used, due to it´s special nature.

/Sebulon
 

peetaur

Active Member

Reaction score: 17
Messages: 167

The Zeus IOPS is a flash array based SSD... I'm talking about a RAM based one.
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@peetaur

Oh sorry, I must have read you wrong. In that case no, I have not tested with a STEC RAM based SSD:)

I have however bought an ACARD 5.25" SATA-II SSD RAM DDR2 - ANS-9010BA, and it worked horribly, haha:) It was really crappy. After starting a write to the device with dd or something, messages was flooded with write DMA errors and then it vanished from the OS.

/Sebulon
 

peetaur

Active Member

Reaction score: 17
Messages: 167

Sebulon said:
@peetaur
...I have not tested with a STEC RAM based SSD:)

...ACARD 5.25" SATA-II SSD RAM DDR2 - ANS-9010BA, and it worked horribly, ...DMA errors and then it vanished...
I kind of think you should test your ACARD RAM with memtest, or return it and get a different one. :D

And I can't wait to hear your test results with the Zeus RAM based SSD. I am tempted to buy one for my vm datastore server but first I'm testing some NFS kernel tuning stuff. See: http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013994.html

So far, I've achieved 79 MB/s read, 75.8 MB/s write, and 51 sync writes from the guest OS. With the defaults, it was under 40 MB/s. But in all cases, it was above 100 without virtual machines. (all sequential with dd)

And unlike in the mailing list post, my NFS client doesn't fail, so I don't need a Linux client. Possibly this is because I upped the memory buffers, like in this guide: https://calomel.org/network_performance.html

Soon I'll test that on a 10 Gbps link, just need to move some hardware around.
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

peetaur said:
And I can't wait to year your test results with the Zeus ram based SSD.
No problem, if you´re paying for it :p Because I don´t have anywhere near that dough, I can tell you for sure. In case you thought differently, the money to test all this comes from my own pockets, (except for the Zeus of course, got lucky with that one), so saving that up is going to take a while, haha!

Next disk I´m eager to get my hands on is the OCZ Vertex 4 (either 128- or 256GB, not sure), which sadly comes without supercap, but I´d like to test it anyway to see what kind of performance to expect from the next generation Deneva.

Now, it´s not at all for sure the Deneva 3 is going to have the same specs as the Vertex 4, but that´s what they did with the previous generation; they let Vertex 3 live in the real world for a while and when all the child diseases where cured, they took Vertex 3´s HW and FW, added a supercap and called it Deneva 2, so that´s probably what they´ll do with the Deneva 3 as well.

/Sebulon
 

kvolaa

New Member


Messages: 1

Best cheap profi SSD for ZIL

Hi to all,

Do you know OWC Mercury Pro 6G SSDs ? Only one supplier that shows their enterprise SSDs prices publicly :). They sell it through their own web shop (maybe exclusively).

I'm using OWC Mercury Pro 6G 50GB as ZIL. Can get one for $400. It's fantastic.
It has over 60MB/s with sequential 4KB/s write (qd=1). But with endurance 730 TB. It has OP (over-provisioning) set to 28%. Of course, you can use 'hdparm -N' and set it higher to get longer endurance. It has capacitors, 7 years warranty.

So, for ZIL it is really dream (relatively cheap).

It's there http://eshop.macsales.com/shop/SSD/OWC/Mercury_6G/Enterprise

Reviews are there:
http://thessdreview.com/our-reviews/owc-mercury-enterprise-pro-6g-6gbps-ssd-review-owc-and-lsi-combine-for-a-great-enterprise-entry/
http://www.storagereview.com/owc_mercury_enterprise_pro_6g_ssd_review

Another good drive (the same price tag as OWC Mercury Pro) is Intel 710. It has capacitors too, but is slower then Mercury and has cca 1/3 endurance of Mercury (it is 100GB with 500 TB endurance, Mercury has 50GB with 730 TB).

By my mean, it's the best possibility, if you can't use DDRDRIVE or ZeusRAM, e.g. RAM based SSD. With OP, it can lasts for years. Mine is set to 20GB (the second Mercury drive I have is under testing of capabilities, so maybe I can raise OP to 10GB drive and get more endurance).

You can't use stripping ZIL for higher speed, because of qd=1, but for longer endurance it's possible. And it is often used in this manner.

In tests you must use dd(1) with conv=sync flag, to be as ZIL write. Look at source code of ZFS, it's free (best documentation :)).

Our use of ZFS in production just now is primarily for PostgreSQL Plus databases, e.g. cashing zpools for it. I can't test any NFS numbers now. But PostgreSQL shines at ZFS !

But just now I'm in building of new filer for our ESXi and RHEV (RedHat KVM hypervisor) machines. We plan to use VDI next year, so we test cheap technology (relative to ... Oracle ... gold is cheaper :)).
It's dual Xeon E5-2600 class machine, 256GB (RAM is cheap and we plan using deduplication hardly). Mainboard and case from SuperMicro. Processors are E5-2650, 8 cores (maybe overkill, but dedup, compress, ... and we have the same model in VM hosts too - so we want one type).
Bunch of LSI92xx SAS2 HBAs, dualport 10GE (board alone has Intel quad 1Ge).
Only boot, system, swap drives internally (SSD).

All of other drives I want in external (SuperMicro) JBOD cases (for case of service, replacement of server and so). Many of SSDs, Mercury Pro 6G 100-400GB - it's best and relatively cheap.

Why JBOD ? Because I'm hating RAIDs. Hehe, not really, but classic hw RAIDs, especially RAID5,6 are completely dead. ZFS is there. Because we plan to buy SATA drives for budget, there can be SATA/SAS expanders too (so I can use more JBOD cases in future; without be out of # of SAS2/SATA HBAs).

As SAN we are using SAS2 switches - nice little known thing - "from DAS (direct attached storage) to SAN". It's better, much cheaper, much speeder then 8G FC or 10GE or FCoe. From LSI, switch LSISAS 6160. Dual (HA config) multipath 24Gbit/s connect. Super. Cheap. Flexible. Simple. No additional protocol levels/stack on road (no encapsulation/translation: SCSI commands/FC/eth/switch/eth/FC/SCSI commands - simple SAS/switch/SAS - commands flown on wires :)). Try it.
Cons is only 2m range (10m cables exists, but must be active and are more expensive).

So, I haven't own modern experiences with practical daily use of NAS based on ZFS (and NFS, iSCSI, CIFS) in enterprise. Our ESXi's are filled through SAS2 directly. So one of my predicted tests will be using SAS2 HBA as target - e.g. I want to make my own ZFS based SAS2 storage (e.g. SAS2 "RAID storage box"). And test ZFS speeds.
As an operating system, maybe Linux (preferably RHEL), Solaris|OpenSolaris|OpenIndiana|..., FreeBSD - which one win competition in speed and reliability.

I want to test FCoE, iSCSI, NFS, CIFS, too - to make benchmarks of all of this. Compared to SAS2 fabric/network.
So, next time I hope I can present same numbers - especially about NFS and iSCSI speeds.

Cheers
 

TheBang

New Member


Messages: 3

Thanks for the pointer to the OWC drive. It looks like it has pretty good specs for a ZIL SLOG device (especially the all-important supercap). Nice to have an alternative out there. We've been using the Deneva 2 R Series, which has worked pretty well, and was readily available for a while. It seems to be difficult to source again now though.

However, for the future, I think we will be standardizing on the Intel SSD DC S3700. It is the true successor to the Intel 710, and from the early review, it looks like one of the best, most consistent, good performing, and affordable enterprise SSD's out there.

http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review

It has the battery backup necessary for the ZIL, it has very good sequential write speed, the performance consistency over its life, 5 year warranty, and it's only $235 MSRP for a 100 GB drive. Sounds like a winner to me. This drive should be the drive of choice for affordable SATA SSD ZIL SLOG devices.
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@kvolaa

Hi, thanks for sharing! Specs seems rather nice. I see it´s using more or less the same SandForce controller as the Deneva 2 so performance should be comparable. And thanks for the tip about "conv=sync". I´ll make sure to use that in future benchmarking.

Cool to hear you talking about RHEV also. Are you paying RH for RHEV-M or are you running oVirt? I´ve done some benchmarking from inside different guests with VirtIO HW and I´ve experienced some kind of logical boundary on writes. Both Hosts and Storage are connected via 2x1GbE LACP, and when the hosts are doing storage-related tasks, they can easily saturate that connection 2Gbps read and write. So I tried to install a Ubuntu guest and ran bonnie++, write IO was less than 1Gbps, while read IO was a full 2Gbps. Please share with us any tests you make from inside a guest, since that is the experience a "customer" will receive back from the system as a whole.

@TheBang

That is a drive that has been on my radar as well, looks very promising. Don´t remember if I found any benchmark of 4k write at QD=1 IOPS or write latency on it though... Either go with that, or wait for a Deneva 3 maybe.

If one is worried about the budget, maybe Vertex 4 without disabling cache flushing would also work.

/Sebulon
 

jrm@

Daemon
Developer

Reaction score: 473
Messages: 1,205

IOR benchmarks

Code:
% ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Tue Dec 18 16:09:32 2012
Command line used: ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
Machine: FreeBSD awarnach.mathstat.dal.ca

Summary:
        api                = MPIIO (version=2, subversion=2)
        test filename      = /mnt/archives/
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          60.95      60.95       60.95      0.00      60.95      60.95       60.95      0.00  33.59947   EXCEL
read           87.60      87.60       87.60      0.00      87.60      87.60       87.60      0.00  23.37930   EXCEL

Max Write: 60.95 MiB/sec (63.91 MB/sec)
Max Read:  87.60 MiB/sec (91.85 MB/sec)

Run finished: Tue Dec 18 16:10:29 2012
Hardware/Configuration
  • Asus RS300-E7-PS4 1U Server
  • E3-1230V2 Xeon CPU
  • 32GB Memory
  • 4 x Intel 60GB SSD (520 Series)
  • LSI 9205-8e SAS Controller
  • Supermicro SC847E16-RJB0D1
  • 10 x WD30EFRX 3TB Hard Drive

FreeBSD 9.3-RC3 is installed on a zfs mirror using two of the SSDs.
One SSD is used for a ZIL (will eventually mirror the ZIL as well) and the other SSD is used for L2ARC.
I created a raidz3 zpool with nine of the 3TB drives.
The file system I used to test with was created with:
# zfs create -o sharenfs="root,network 192.168.0.0,mask 255.255.255.0" -o atime=off -o compression=on -o setuid=off /storage/archives

The NFS share was mounted on an 8.3 system. IOR complained without the nolockd option.
# mount_nfs -o nolockd 192.168.101:/storage/archives /mnt/archives

I used IOR 2.10.3.
To compile IOR I have to add these lines to src/C/Makefile.config
Code:
####################
# FreeBSD SETTINGS #
####################
CC.FreeBSD = mpicc
CCFLAGS.FreeBSD = -g -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
LDFLAGS.FreeBSD = -lmpich
HDF5_DIR.FreeBSD =
NCMPI_DIR.FreeBSD =
and I had to modify src/C/utilities.c to replace #include <sys/statfs.h> with
Code:
#include <sys/param.h>
#include <sys/mount.h>
 
OP
OP
Sebulon

Sebulon

Aspiring Daemon

Reaction score: 128
Messages: 709

@jrm

Thank you for these numbers! I´m wondering about your choice of tool for benchmarking, is IOR any "better" or "worse" than the more commonly used bonnie++? How does the network between server and client look like? Any Jumbo frames, lagg?

Another fun thing I usually do while benchmarking is watching gstat for how stressed the SLOG gets. And how was the pool set up with regards to partition alignment and ashift?

/Sebulon
 

jrm@

Daemon
Developer

Reaction score: 473
Messages: 1,205

Hello @Sebulon;

The reason I used IOR was because a system administrator I know set up a similar storage system and he ran his benchmarks with IOR and I wanted to compare results. I think he mentioned he went with IOR because of the mpi support, but I have little experience with either benchmark program. The differences with his setup are that he is mounting from a GNU/Linux box and on the storage side he chose FreeBSD 8.3 / NFSD (v3) with async and no ZIL. Here are his benchmark results.

Code:
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Mon Dec 10 16:35:30 2012
Command line used: IOR-gcc -a MPIIO -t 1M -b 2G -i 1 -F -o /scratch1/test
Machine: Linux

Summary:
        api                = MPIIO (version=1, subversion=2)
        test filename      = /scratch1/test
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  --------- ---------  ----------   -------  --------
write          95.26      95.26       95.26      0.00      95.26 95.26       95.26      0.00  21.50000   EXCEL
read           75.19      75.19       75.19      0.00      75.19 75.19       75.19      0.00  27.23828   EXCEL

Max Write: 95.26 MiB/sec (99.88 MB/sec)
Max Read:  75.19 MiB/sec (78.84 MB/sec)

Run finished: Mon Dec 10 16:36:19 2012
I haven't tinkered with the network yet. Right now there is a single gigabit line going through two switches. The Asus RS300-E7-PS4 has four gigabit ports, so I plan to play with the network. I'll post more benchmark results before/after network tweaks with IOR and bonnie++.

I did pay attention to alignment when I set things up (both the storage pool and the pool the OS is installed on). Here is how I created the pool.

Code:
DISKS= 
for i in `seq 0 8`; do DISKS="$DISKS da$i "; done

for I in ${DISKS}; do
        NUM=$( echo ${I} | tr -c -d '0-9' )
        glabel create storage_disk${NUM} /dev/da${I}
done
glabel create spare_drive0 /dev/da9

gnop create -S 4096 /dev/label/storage_disk0


zpool create storage raidz3 /dev/label/storage_disk0.nop /dev/label/storage_disk1 ... /dev/label/storage_disk8
zpool export storage
gnop destroy /dev/label/storage_disk0
zpool import -d /dev/label storage
I see the that the ashift for the pool is indeed 12, but, now that you have caused me to take another look, I see the ashift for the ZIL is 9. Hopefully fixing the alignment on the ZIL will give a performance bump. I don't think I paid attention to alignment with the L2ARC either.
 

jrm@

Daemon
Developer

Reaction score: 473
Messages: 1,205

I 4k aligned the ZIL and L2ARC and the performance, according to IOR, actually dropped a little.

Code:
zpool remove storage label/zil
zpool remove storage label/l2arc
gnop create -S 4k label/zil
gnop create -S 4k label/l2arc
zpool add storage log label/zil.nop
zpool add storage cache label/l2arc.nop
zpool export storage
gnop destroy label/zil.nop
gnop destroy label/l2arc.nop
zpool import storage

zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        storage                  ONLINE       0     0     0
          raidz3-0               ONLINE       0     0     0
            label/storage_disk0  ONLINE       0     0     0
            label/storage_disk1  ONLINE       0     0     0
            label/storage_disk2  ONLINE       0     0     0
            label/storage_disk3  ONLINE       0     0     0
            label/storage_disk4  ONLINE       0     0     0
            label/storage_disk5  ONLINE       0     0     0
            label/storage_disk6  ONLINE       0     0     0
            label/storage_disk7  ONLINE       0     0     0
            label/storage_disk8  ONLINE       0     0     0
        logs
          label/zil              ONLINE       0     0     0
        cache
          label/l2arc            ONLINE       0     0     0

errors: No known data errors
Code:
% IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/            
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Thu Dec 20 00:54:57 2012
Command line used: ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
Machine: FreeBSD awarnach.mathstat.dal.ca

Summary:
        api                = MPIIO (version=2, subversion=2)
        test filename      = /mnt/archives/
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          57.83      57.83       57.83      0.00      57.83      57.83       57.83      0.00  35.41387   EXCEL
read           86.74      86.74       86.74      0.00      86.74      86.74       86.74      0.00  23.61122   EXCEL

Max Write: 57.83 MiB/sec (60.64 MB/sec)
Max Read:  86.74 MiB/sec (90.95 MB/sec)

Run finished: Thu Dec 20 00:55:57 2012
Am I missing something? Why would the performance go down compared to the results when the SSDs for the ZIL and L2ARC weren't 4k aligned?
 
Top