Best use of SSDs in ZFS?

Imagine I have a storage server, we'll assume mixed workload (random and sequential in equal portions, working dataset always smaller than ARC). Storage controller is the ubiquitous M1015/9240i SAS-6. Currently it has magnetic media in a mirror setup, 2x1TB for the root pool, 2x2TB for the data pool. The root pool is nevertheless only 220GB (it started on smaller disks and wasn't expanded, effectively short stroked). The data pool is over 1TB used space.

If I were to use/add/include 2x256gb SSD, would it be better to simply migrate the root pool onto a mirror of SSDs? Or to use 8gb of each SSD as a ZIL (mirrored) and gstripe the rest and use that as an L2ARC?

I'm aware that L2ARC is not persistent, but gets loaded after boot from slower media. I'm also aware a ZIL mainly affects OSYNC writes. So obviously boot will be faster just migrating the root pool to SSD. The machine stays on for a good period of time, so I assume the L2ARC would eventually fill, but until then, I'd be stuck with low random IOPs and lowish sequential writes (seq. reads are OK at 200-300MB/s already).

I'm aware I'll only get TRIM support as a vdev (I think?)

My question is simply whether there is any reason to use SSDs as L2ARC/ZIL rather than as vdevs assuming the root pool fits and that's where most IO is happening.
 
m6tt said:
My question is simply whether there is any reason to use SSDs as L2ARC/ZIL rather than as vdevs assuming the root pool fits and that's where most IO is happening.

I would suggest that you redo your entire setup. I´m having a very hard time understanding why you would want to have two pools, one for the OS and another for your data. Instead you should have one pool for all your data, then use the SSD's as SLOG and L2ARC. I would suggest to have the pool layed out like this:
Code:
  pool: pool1
 state: ONLINE
 scan: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	pool1            ONLINE       0     0     0
	  raidz2-0       ONLINE       0     0     0
	    gpt/disk1    ONLINE       0     0     0
	    gpt/disk2    ONLINE       0     0     0
	    gpt/disk3    ONLINE       0     0     0
	    gpt/disk4    ONLINE       0     0     0
	logs
	  gpt/log1       ONLINE       0     0     0
	cache
	  gpt/cache1     ONLINE       0     0     0
Be sure to follow proper guides to configure and optimize for 4k. Those four 1TB's would all be able to boot from, giving you quadruple protection from boot-failures. And the SSD's are located where they shine at; IOPS and low latency.

The filesystem should be layed out like this:
Code:
FS                                   MOUNTPOINT
pool1                                none
pool1/ROOT                           none
pool1/ROOT/default                   /
pool1/ROOT/default/usr               /usr
pool1/ROOT/default/usr/local         /usr/local
pool1/ROOT/default/usr/obj           /usr/obj
pool1/ROOT/default/usr/ports         /usr/ports
pool1/ROOT/default/var               /var
pool1/DATA                           none
pool1/DATA/home                      /usr/home
pool1/DATA/home/joe                  /usr/home/joe
Following vermaden´s guide on setting the filesystems up in use with BootEnvironments:
https://forums.freebsd.org/showthread.php?t=31662

Have fun!

/Sebulon
 
I'm not sure at all why I'd want to use RAIDZ for any reason.

If we set the SSDs aside, I will lose IOPs and redundancy for capacity.
Capacity is cheap, if I valued it I would buy more disks.
IOPs and reliability are harder to get, so I prefer a mirrored setup (as well as the performance therein).

The reason I use two pools is simple: If I ever wanted to move the (largely static) data pool, I can export import just that pool and not pull along an OS as well. For instance, to access this same data on Solaris, OpenIndiana or Linux. I have plenty of sources indicating that for N disks I will always get more random IOPs from a mirror. If I need more space, I would start zfs add'ing more mirrors, ala RAID10.

Secondly, although this is just a home hobby machine, it's a hold over from more enterprise setups. I always create RAID volumes for specific purposes, not one giant RAID5. This protects portions of the system from volume failure as well as ensuring that you are using the right kind of RAID for each application. In, for example, an App+Database server, you may eventually scale and want to split the app from the database to separate servers, this is harder if you've mixed them together onto a glorified RAID5. Or perhaps a scratch/temp volume is fine being a RAID0, but the database prefers RAID10 for reliability/performance and the backups are fine at RAID5 because we don't care about performance.

Finally, I'd rather a storage "partition" where heavy IO on one pool does not affect the IO capabilities of the other pool...I don't want a sequential read from the second pool affected by random IO on the OS pool.

I could be missing something here, but I don't see what I'd gain under RAIDZ.
 
m6tt said:
I'm not sure at all why I'd want to use RAIDZ for any reason.

If we set the SSDs aside, I will lose IOPs and redundancy for capacity.
Capacity is cheap, if I valued it I would buy more disks.
IOPs and reliability are harder to get, so I prefer a mirrored setup (as well as the performance therein).
Just look at the numbers. An ordinary HDD is good for about 100 random IOPS, while your SSD's are good for 100 000 IOPS. So no matter how many more HDD you put into the pool, they are never going to handle as much stress as the SSD's can.

m6tt said:
The reason I use two pools is simple: If I ever wanted to move the (largely static) data pool, I can export import just that pool and not pull along an OS as well. For instance, to access this same data on Solaris, OpenIndiana or Linux. I have plenty of sources indicating that for N disks I will always get more random IOPs from a mirror. If I need more space, I would start zfs add'ing more mirrors, ala RAID10.
With one RAID1, you can stand one disk failure. With your four disks configured as striped mirror RAID10, you have to loose the "right" two disks, because if two disks go out from the same vdev, you are toast. With the sizes of todays HDD's, the chances of suffering from another disk failure while rebuilding has become too great. But with four disks configured as RAID6(raidz2), you can stand to loose any two disks. That´s reliability you can count on.

m6tt said:
Secondly, although this is just a home hobby machine, it's a hold over from more enterprise setups. I always create RAID volumes for specific purposes, not one giant RAID5. This protects portions of the system from volume failure as well as ensuring that you are using the right kind of RAID for each application. In, for example, an App+Database server, you may eventually scale and want to split the app from the database to separate servers, this is harder if you've mixed them together onto a glorified RAID5. Or perhaps a scratch/temp volume is fine being a RAID0, but the database prefers RAID10 for reliability/performance and the backups are fine at RAID5 because we don't care about performance.
Your performance isn´t in the HDD's, it´s the SSD's that make the difference. You prioritized after 1) IOPS, and 2) reliability, so that´s what you got; the best of both:)

m6tt said:
Finally, I'd rather a storage "partition" where heavy IO on one pool does not affect the IO capabilities of the other pool...I don't want a sequential read from the second pool affected by random IO on the OS pool.
With a SLOG, random IO becomes sequential, as ZFS logs all random traffic, buffers it, and then commits everything in one large transaction. Random reads hits the L2ARC and can serve that up a thousand times faster than any normal HDD.

The point of ZFS is "hybrid storage", using every tool for what they are good at. Configuring the SSD's for performance, so that you can design your pool for maximum 1)redundancy and 2)storage.

/Sebulon
 
Sebulon said:
Just look at the numbers. An ordinary HDD is good for about 100 random IOPS, while your SSD's are good for 100 000 IOPS. So no matter how many more HDD you put into the pool, they are never going to handle as much stress as the SSD's can.

Probably more like 10000 IOPs max, but I agree.

Sebulon said:
With one RAID1, you can stand one disk failure. With your four disks configured as striped mirror RAID10, you have to loose the "right" two disks, because if two disks go out from the same vdev, you are toast. With the sizes of todays HDD's, the chances of suffering from another disk failure while rebuilding has become too great. But with four disks configured as RAID6(raidz2), you can stand to loose any two disks. That´s reliability you can count on.

But as the numbers of disk increases, so does the possibility of failure. The percentage of disks that must fail for complete volume destruction in a two disk mirror is the same as a four disk raidz6. A triple mirror can easily be created, pushing the "safe fail" percentage to 66%. Further, recovery on striped-type volumes can harm additional disks due to the heavy IO required to rebuild a disk. ZFS may mitigate this, as it's block aware, but recovery time is geometrically higher with a striped array.


Sebulon said:
With a SLOG, random IO becomes sequential, as ZFS logs all random traffic, buffers it, and then commits everything in one large transaction. Random reads hits the L2ARC and can serve that up a thousand times faster than any normal HDD.
I am perhaps wrong in thinking that the ZIL/SLOG is only for *sync* writes and that async writes are cached in memory until they can be commited to the main disk, which in this case would be a raidz6 with the performance of a single disk?

I will keep two pools, and set up an L2ARC on the large 2TB pool (I don't have 4x 1TB, not sure where that came from, read OP). This is fine with me (SSD mirror, TRIM on):

Code:
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
flatline.local  32G   113  99 356273  90 195778  51   306  99 802983  90  6545 179
Latency               691ms     783ms    1841ms   36062us   12921us    2887us
Version  1.97       ------Sequential Create------ --------Random Create--------
flatline.local      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 22900  98 +++++ +++ 20837  98 22663  97 +++++ +++ 22703  98
Latency             13091us     529us    2972us   28357us      40us      81us
1.97,1.97,flatline.local,1,1358527355,32G,,113,99,356273,90,195778,51,306,99,802983,90,6545,179,16,,,,,22900,98,+++++,+++,20837,98,22663,97,+++++,+++,22703,98,691ms,783ms,1841ms,36062us,12921us,2887us,13091us,529us,2972us,28357us,40us,81us

I may try the old 1TB mirror with L2ARC and ZIL after the L2ARC warms up if I get a chance.
 
That looks very fast, although yet again I find myself wondering why bonnie++ output is so difficult to read. Could you give more details on the hardware that test was run on?
 
Bonnie++ is the worst example of a text mode interface I have ever seen, aside from using "eliza" as your default shell.

Sorry if this is too many details!

10-CURRENT amd64, 2x CPUs are quad-core 41xx Opterons, 16GB ram, no swap (at the moment, usually 32g), lots of PCIe lanes, malloc_production is on, most debugging stuff is off

Controller is the classic IBM M1015 flashed to HBA mode.
SSDs are Samsung 830 256gb...I like these, used on my laptop for a while.

GPT, partitions start at 2048 (1M) and end 4k aligned, pool was gnopped for the ashift.

Here's all the ZFS info. Low amount of ghosting. A lot of the cache stats are 0 or 100 percent...I wonder how I turn off just "colinear" prefetch.

Code:
------------------------------------------------------------------------
ZFS Subsystem Report				Fri Jan 18 13:11:35 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				1000026 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		5000
	ZFS Filesystem Version:			5

FreeBSD 10.0-CURRENT #0 r245590: Fri Jan 18 07:34:28 PST 2013 root
 1:11PM  up  4:26, 1 user, load averages: 0.15, 0.20, 0.17

------------------------------------------------------------------------

System Memory:

	0.45%	72.00	MiB Active,	0.77%	121.27	MiB Inact
	12.09%	1.87	GiB Wired,	0.05%	7.72	MiB Cache
	86.64%	13.41	GiB Free,	0.01%	1.16	MiB Gap

	Real Installed:				16.00	GiB
	Real Available:			99.85%	15.98	GiB
	Real Managed:			96.90%	15.48	GiB

	Logical Total:				16.00	GiB
	Logical Used:			15.39%	2.46	GiB
	Logical Free:			84.61%	13.54	GiB

Kernel Memory:					1.26	GiB
	Data:				98.27%	1.24	GiB
	Text:				1.73%	22.28	MiB

Kernel Memory Map:				20.67	GiB
	Size:				4.97%	1.03	GiB
	Free:				95.03%	19.64	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				699.56k
	Recycle Misses:				705
	Mutex Misses:				318
	Evict Skips:				402

ARC Size:				11.02%	1.10	GiB
	Target Size: (Adaptive)		100.00%	10.00	GiB
	Min Size (Hard Limit):		80.00%	8.00	GiB
	Max Size (High Water):		1:1	10.00	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	77.29%	7.73	GiB
	Frequently Used Cache Size:	22.71%	2.27	GiB

ARC Hash Breakdown:
	Elements Max:				155.93k
	Elements Current:		99.98%	155.90k
	Collisions:				333.04k
	Chain Max:				6
	Chains:					31.96k

------------------------------------------------------------------------

ARC Efficiency:					14.16m
	Cache Hit Ratio:		95.74%	13.56m
	Cache Miss Ratio:		4.26%	602.58k
	Actual Hit Ratio:		93.10%	13.18m

	Data Demand Efficiency:		99.89%	11.76m
	Data Prefetch Efficiency:	22.46%	667.19k

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		2.67%	362.00k
	  Most Recently Used:		34.60%	4.69m
	  Most Frequently Used:		62.64%	8.49m
	  Most Recently Used Ghost:	0.04%	4.92k
	  Most Frequently Used Ghost:	0.06%	7.51k

	CACHE HITS BY DATA TYPE:
	  Demand Data:			86.63%	11.74m
	  Prefetch Data:		1.11%	149.86k
	  Demand Metadata:		10.61%	1.44m
	  Prefetch Metadata:		1.66%	224.58k

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			2.17%	13.06k
	  Prefetch Data:		85.85%	517.33k
	  Demand Metadata:		8.17%	49.25k
	  Prefetch Metadata:		3.81%	22.94k

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					60.13m
	Hit Ratio:			95.13%	57.20m
	Miss Ratio:			4.87%	2.93m

	Colinear:				2.93m
	  Hit Ratio:			0.00%	95
	  Miss Ratio:			100.00%	2.93m

	Stride:					56.56m
	  Hit Ratio:			100.00%	56.56m
	  Miss Ratio:			0.00%	46

DMU Misc:
	Reclaim:				2.93m
	  Successes:			0.10%	3.01k
	  Failures:			99.90%	2.93m

	Streams:				645.47k
	  +Resets:			0.00%	11
	  -Resets:			100.00%	645.45k
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
	kern.maxusers                           1358
	vm.kmem_size                            25769803776
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        329853485875
	vfs.zfs.arc_max                         10737418240
	vfs.zfs.arc_min                         8589934592
	vfs.zfs.arc_meta_used                   689803880
	vfs.zfs.arc_meta_limit                  2684354560
	vfs.zfs.l2arc_write_max                 8388608
	vfs.zfs.l2arc_write_boost               8388608
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.anon_size                       32768
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.mru_size                        626270720
	vfs.zfs.mru_metadata_lsize              152293888
	vfs.zfs.mru_data_lsize                  225083392
	vfs.zfs.mru_ghost_size                  10109991936
	vfs.zfs.mru_ghost_metadata_lsize        9859072
	vfs.zfs.mru_ghost_data_lsize            10100132864
	vfs.zfs.mfu_size                        401924608
	vfs.zfs.mfu_metadata_lsize              19728384
	vfs.zfs.mfu_data_lsize                  268651008
	vfs.zfs.mfu_ghost_size                  61257216
	vfs.zfs.mfu_ghost_metadata_lsize        394752
	vfs.zfs.mfu_ghost_data_lsize            60862464
	vfs.zfs.l2c_only_size                   0
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.nopwrite_enabled                1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.no_write_throttle               0
	vfs.zfs.write_limit_shift               3
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_max                 2144155648
	vfs.zfs.write_limit_inflated            51459735552
	vfs.zfs.write_limit_override            0
	vfs.zfs.prefetch_disable                0
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.top_maxinflight                 32
	vfs.zfs.resilver_delay                  2
	vfs.zfs.scrub_delay                     4
	vfs.zfs.scan_idle                       50
	vfs.zfs.scan_min_time_ms                1000
	vfs.zfs.free_min_time_ms                1000
	vfs.zfs.resilver_min_time_ms            3000
	vfs.zfs.no_scrub_io                     0
	vfs.zfs.no_scrub_prefetch               0
	vfs.zfs.mg_alloc_failures               12
	vfs.zfs.check_hostid                    1
	vfs.zfs.recover                         0
	vfs.zfs.space_map_last_hope             0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.trim_on_init               1
	vfs.zfs.vdev.max_pending                10
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.time_shift                 6
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.vdev.bio_delete_disable         0
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.cache_flush_disable             0
	vfs.zfs.trim_disable                    0
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.zio.exclude_metadata            0
	vfs.zfs.sync_pass_deferred_free         2
	vfs.zfs.sync_pass_dont_compress         5
	vfs.zfs.sync_pass_rewrite               2
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.super_owner                     0
	vfs.zfs.debug                           0
	vfs.zfs.version.acl                     1
	vfs.zfs.version.spa                     5000
	vfs.zfs.version.zpl                     5
	vfs.zfs.trim_txg_limit                  10

------------------------------------------------------------------------
 
No hard drives in that test.

I could possibly try the original mirror pool with an L2ARC and ZIL. I could break the SSD mirror, thrash one of the disks as a ZIL/L2ARC, and then reformat/resilver it I suppose.

I don't have 4 identical disks, so any raidz or mirror+stripe discussion is somewhat academic, although I could treat the 2TB disks as 1TB disks with a 2TB penalty...
 
Continuing forward, booted from the old disk based mirror.

I've attached a 200G L2ARC using a partition on on Samsung 830, aligned to 4K (starts at 16G, as I reserved some space for a ZIL...except:

Code:
cannot add to 'system': root pool can not have multiple vdevs or separate logs

Ok, no ZIL possible on the operating system pool...perhaps an argument for a secondary pool?
So I will run bonnie++ on the OS pool with just an L2 ARC, and then attach it to my secondary pool + the ZIL and run it again (they are faster disks as well, 7k3000 are *great*)

Code:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
flatline.local  32G   124  99 96131  28 60073  21   317  99 221937  24  3955  85
Latency               575ms     212ms    1190ms     103ms     216ms     200ms
Version  1.97       ------Sequential Create------ --------Random Create--------
flatline.local      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 11054  56 +++++ +++ 13873  99 22150  99 +++++ +++ 21357  99
Latency               220ms   78463us     421us   32203us     205us     788us
1.97,1.97,flatline.local,1,1358671830,32G,,124,99,96131,28,60073,21,317,99,221937,24,3955,85,16,,,,,11054,56,+++++,+++,13873,99,22150,99,+++++,+++,21357,99,575ms,212ms,1190ms,103ms,216ms,200ms,220ms,78463us,421us,32203us,205us,788us

That's two seagates in a mirror with an L2ARC...read performance on bonnie++ is about usual compared to without the L2ARC.
Here's the zfs-stats -a
Code:
------------------------------------------------------------------------
ZFS Subsystem Report				Sun Jan 20 00:42:38 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				1000026 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		5000
	ZFS Filesystem Version:			5

FreeBSD 10.0-CURRENT #3 r245340: Sat Jan 12 13:14:21 PST 2013 root
12:42AM  up 44 mins, 1 user, load averages: 0.70, 0.75, 0.78

------------------------------------------------------------------------

System Memory:

	1.56%	247.57	MiB Active,	0.98%	155.80	MiB Inact
	5.99%	949.89	MiB Wired,	0.01%	1.34	MiB Cache
	91.45%	14.16	GiB Free,	0.01%	1004.00	KiB Gap

	Real Installed:				16.00	GiB
	Real Available:			99.85%	15.98	GiB
	Real Managed:			96.90%	15.48	GiB

	Logical Total:				16.00	GiB
	Logical Used:			10.57%	1.69	GiB
	Logical Free:			89.43%	14.31	GiB

Kernel Memory:					373.21	MiB
	Data:				93.92%	350.51	MiB
	Text:				6.08%	22.70	MiB

Kernel Memory Map:				14.18	GiB
	Size:				1.92%	278.81	MiB
	Free:				98.08%	13.91	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				630.05k
	Recycle Misses:				1.19k
	Mutex Misses:				3.04k
	Evict Skips:				350

ARC Size:				1.46%	149.58	MiB
	Target Size: (Adaptive)		100.00%	10.00	GiB
	Min Size (Hard Limit):		80.00%	8.00	GiB
	Max Size (High Water):		1:1	10.00	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	90.99%	9.10	GiB
	Frequently Used Cache Size:	9.01%	922.49	MiB

ARC Hash Breakdown:
	Elements Max:				272.73k
	Elements Current:		99.46%	271.25k
	Collisions:				297.77k
	Chain Max:				6
	Chains:					72.35k

------------------------------------------------------------------------

ARC Efficiency:					9.72m
	Cache Hit Ratio:		95.20%	9.25m
	Cache Miss Ratio:		4.80%	466.21k
	Actual Hit Ratio:		94.47%	9.18m

	Data Demand Efficiency:		99.91%	8.39m
	Data Prefetch Efficiency:	0.00%	449.87k

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		0.63%	58.53k
	  Most Recently Used:		2.39%	221.23k
	  Most Frequently Used:		96.84%	8.96m
	  Most Recently Used Ghost:	0.10%	9.13k
	  Most Frequently Used Ghost:	0.04%	3.78k

	CACHE HITS BY DATA TYPE:
	  Demand Data:			90.63%	8.39m
	  Prefetch Data:		0.00%	6
	  Demand Metadata:		8.60%	795.92k
	  Prefetch Metadata:		0.77%	71.44k

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			1.69%	7.87k
	  Prefetch Data:		96.49%	449.86k
	  Demand Metadata:		1.34%	6.25k
	  Prefetch Metadata:		0.48%	2.23k

------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
	Passed Headroom:			48.06k
	Tried Lock Failures:			29.14k
	IO In Progress:				65
	Low Memory Aborts:			0
	Free on Write:				61
	Writes While Full:			3.72k
	R/W Clashes:				13
	Bad Checksums:				0
	IO Errors:				0
	SPA Mismatch:				43.70k

L2 ARC Size: (Adaptive)				32.27	GiB
	Header Size:			0.12%	40.60	MiB

L2 ARC Breakdown:				460.17k
	Hit Ratio:			2.07%	9.52k
	Miss Ratio:			97.93%	450.65k
	Feeds:					5.05k

L2 ARC Buffer:
	Bytes Scanned:				1.91	TiB
	Buffer Iterations:			5.05k
	List Iterations:			306.98k
	NULL List Iterations:			8.60k

L2 ARC Writes:
	Writes Sent:			100.00%	4.06k

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					48.12m
	Hit Ratio:			99.56%	47.91m
	Miss Ratio:			0.44%	209.33k

	Colinear:				209.33k
	  Hit Ratio:			0.01%	17
	  Miss Ratio:			99.99%	209.32k

	Stride:					47.35m
	  Hit Ratio:			100.00%	47.35m
	  Miss Ratio:			0.00%	14

DMU Misc:
	Reclaim:				209.32k
	  Successes:			0.47%	976
	  Failures:			99.53%	208.34k

	Streams:				558.89k
	  +Resets:			0.00%	8
	  -Resets:			100.00%	558.88k
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
	kern.maxusers                           1358
	vm.kmem_size                            25769803776
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        329853485875
	vfs.zfs.arc_max                         10737418240
	vfs.zfs.arc_min                         8589934592
	vfs.zfs.arc_meta_used                   115336200
	vfs.zfs.arc_meta_limit                  2684354560
	vfs.zfs.l2arc_write_max                 8388608
	vfs.zfs.l2arc_write_boost               8388608
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.anon_size                       46592
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.mru_size                        79300608
	vfs.zfs.mru_metadata_lsize              23968768
	vfs.zfs.mru_data_lsize                  41249792
	vfs.zfs.mru_ghost_size                  84169216
	vfs.zfs.mru_ghost_metadata_lsize        26213888
	vfs.zfs.mru_ghost_data_lsize            57955328
	vfs.zfs.mfu_size                        5024256
	vfs.zfs.mfu_metadata_lsize              1552384
	vfs.zfs.mfu_data_lsize                  274944
	vfs.zfs.mfu_ghost_size                  10646612480
	vfs.zfs.mfu_ghost_metadata_lsize        2743808
	vfs.zfs.mfu_ghost_data_lsize            10643868672
	vfs.zfs.l2c_only_size                   23872720896
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.nopwrite_enabled                1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.no_write_throttle               0
	vfs.zfs.write_limit_shift               3
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_max                 2144156160
	vfs.zfs.write_limit_inflated            51459747840
	vfs.zfs.write_limit_override            0
	vfs.zfs.prefetch_disable                0
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.top_maxinflight                 32
	vfs.zfs.resilver_delay                  2
	vfs.zfs.scrub_delay                     4
	vfs.zfs.scan_idle                       50
	vfs.zfs.scan_min_time_ms                1000
	vfs.zfs.free_min_time_ms                1000
	vfs.zfs.resilver_min_time_ms            3000
	vfs.zfs.no_scrub_io                     0
	vfs.zfs.no_scrub_prefetch               0
	vfs.zfs.mg_alloc_failures               12
	vfs.zfs.check_hostid                    1
	vfs.zfs.recover                         0
	vfs.zfs.space_map_last_hope             0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.trim_on_init               1
	vfs.zfs.vdev.max_pending                10
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.time_shift                 6
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.vdev.bio_delete_disable         0
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.cache_flush_disable             0
	vfs.zfs.trim_disable                    0
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.zio.exclude_metadata            0
	vfs.zfs.sync_pass_deferred_free         2
	vfs.zfs.sync_pass_dont_compress         5
	vfs.zfs.sync_pass_rewrite               2
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.super_owner                     0
	vfs.zfs.debug                           0
	vfs.zfs.version.acl                     1
	vfs.zfs.version.spa                     5000
	vfs.zfs.version.zpl                     5
	vfs.zfs.trim_txg_limit                  64

------------------------------------------------------------------------
 
And here's the 7k3000 mirror with an L2ARC and ZIL.

First though, let me express the following flaws in this methodology!
-Bonnie++ uses random data equal to 2x RAM amount by default. L2ARC is optimized for frequently used data. In essence, this test is similar to the creation of an iSCSI volume and immediately accessing it. Subsequent access should improve in read speed (it's too large for ARC, but would make it into the L2ARC which is much larger, until cache pressure forced it out).
-These disks are highly dissimilar to the Seagates used in the previous test...they have much better performance
-These disks *only* contain data (as opposed to both previous tests which the operating system was running from the pool tested) this means that OS related IO is not occuring on these disks, leaving bonnie++ free from disk access that may have affected previous tests
-Only one SSD was applied to the disk tests, although it does not appear that it would help to use two.

Without further ado, here's the bonnie numbers:
Code:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
flatline.local  32G   139  99 110579  29 62766  21   326  99 229942  34  4623 131
Latency               504ms     185ms    1311ms   32392us     407ms   29119us
Version  1.97       ------Sequential Create------ --------Random Create--------
flatline.local      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 25858  99 +++++ +++ 23348  99 23557 100 13882  33 14571  98
Latency             16242us     128us     187us   33427us     201ms   12057us
1.97,1.97,flatline.local,1,1358669768,32G,,139,99,110579,29,62766,21,326,99,229942,34,4623,131,16,,,,,25858,99,+++++,+++,23348,99,23557,100,13882,33,14571,98,504ms,185ms,1311ms,32392us,407ms,29119us,16242us,128us,187us,33427us,201ms,12057us

and zfs-stats -a
Code:
------------------------------------------------------------------------
ZFS Subsystem Report				Sun Jan 20 01:10:55 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				1000026 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		5000
	ZFS Filesystem Version:			5

FreeBSD 10.0-CURRENT #3 r245340: Sat Jan 12 13:14:21 PST 2013 root
 1:10AM  up  1:12, 1 user, load averages: 0.44, 0.60, 0.68

------------------------------------------------------------------------

System Memory:

	1.54%	243.67	MiB Active,	1.10%	173.95	MiB Inact
	6.36%	1008.80	MiB Wired,	0.01%	1.34	MiB Cache
	90.99%	14.08	GiB Free,	0.01%	1004.00	KiB Gap

	Real Installed:				16.00	GiB
	Real Available:			99.85%	15.98	GiB
	Real Managed:			96.90%	15.48	GiB

	Logical Total:				16.00	GiB
	Logical Used:			10.90%	1.74	GiB
	Logical Free:			89.10%	14.26	GiB

Kernel Memory:					364.67	MiB
	Data:				93.78%	341.97	MiB
	Text:				6.22%	22.70	MiB

Kernel Memory Map:				14.15	GiB
	Size:				1.87%	271.12	MiB
	Free:				98.13%	13.88	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				1.34m
	Recycle Misses:				2.38k
	Mutex Misses:				4.58k
	Evict Skips:				350

ARC Size:				1.41%	144.34	MiB
	Target Size: (Adaptive)		100.00%	10.00	GiB
	Min Size (Hard Limit):		80.00%	8.00	GiB
	Max Size (High Water):		1:1	10.00	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	92.10%	9.21	GiB
	Frequently Used Cache Size:	7.90%	808.85	MiB

ARC Hash Breakdown:
	Elements Max:				351.40k
	Elements Current:		86.71%	304.70k
	Collisions:				686.83k
	Chain Max:				9
	Chains:					82.62k

------------------------------------------------------------------------

ARC Efficiency:					19.62m
	Cache Hit Ratio:		95.28%	18.69m
	Cache Miss Ratio:		4.72%	925.83k
	Actual Hit Ratio:		93.22%	18.29m

	Data Demand Efficiency:		99.92%	16.73m
	Data Prefetch Efficiency:	0.00%	898.94k

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		2.04%	381.74k
	  Most Recently Used:		2.42%	452.54k
	  Most Frequently Used:		95.41%	17.83m
	  Most Recently Used Ghost:	0.09%	17.27k
	  Most Frequently Used Ghost:	0.03%	6.06k

	CACHE HITS BY DATA TYPE:
	  Demand Data:			89.43%	16.72m
	  Prefetch Data:		0.00%	6
	  Demand Metadata:		8.40%	1.57m
	  Prefetch Metadata:		2.17%	405.07k

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			1.48%	13.67k
	  Prefetch Data:		97.10%	898.94k
	  Demand Metadata:		1.05%	9.73k
	  Prefetch Metadata:		0.38%	3.49k

------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
	Passed Headroom:			91.31k
	Tried Lock Failures:			58.76k
	IO In Progress:				65
	Low Memory Aborts:			0
	Free on Write:				147
	Writes While Full:			7.19k
	R/W Clashes:				22
	Bad Checksums:				0
	IO Errors:				0
	SPA Mismatch:				1.97m

L2 ARC Size: (Adaptive)				32.05	GiB
	Header Size:			0.14%	47.23	MiB

L2 ARC Breakdown:				919.78k
	Hit Ratio:			2.00%	18.43k
	Miss Ratio:			98.00%	901.35k
	Feeds:					9.57k

L2 ARC Buffer:
	Bytes Scanned:				3.47	TiB
	Buffer Iterations:			9.57k
	List Iterations:			578.62k
	NULL List Iterations:			69.31k

L2 ARC Writes:
	Writes Sent:			100.00%	7.82k

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					95.99m
	Hit Ratio:			99.72%	95.72m
	Miss Ratio:			0.28%	270.70k

	Colinear:				270.70k
	  Hit Ratio:			0.01%	31
	  Miss Ratio:			99.99%	270.67k

	Stride:					94.61m
	  Hit Ratio:			100.00%	94.61m
	  Miss Ratio:			0.00%	51

DMU Misc:
	Reclaim:				270.67k
	  Successes:			0.45%	1.22k
	  Failures:			99.55%	269.45k

	Streams:				1.10m
	  +Resets:			0.00%	9
	  -Resets:			100.00%	1.10m
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------
... trimmed, same as above (I'm over the 10000 limit)

My conclusion is that if I were running an iSCSI backend, or NFS server and space was a priority, I would run a triple (or quadruple) mirror or raidz6 with L2ARC and ZIL. Although the percentage of disks that must fail before the array is compromised is the same as a single mirror, I must acknowledge Sebulon's comment, the *probability* of failure is higher for a single mirror. The reason being that the probability of three disks failing before replacement (assumed to be within 72 hours) is much lower than the probability of two disks failing in 72 hours. Both are quite low, however. In the raidz6 situation specifically, I would also want to test recovery time after maximum permitted failure, as this changes the probabilities somewhat (as the functional disks would be under much heavier than normal IO). The reason I would choose this setup in those cases is that economically it is much easier to hit 10TB+ for storage while retaining reasonable write speeds and read speeds for non-random data.

In my particular case, I don't need space at all. It makes more sense for me to accept the miniscule additional risk of failure in exchange for the performance. If I ever need to scale storage without exponential expense, however, I would need to consider doing something as above and tuning it to my workset. If I ever need to scale reliability, I would add a third SSD of different manufacture but similar performance (unfortunately mine have sequential SN#...).
 
m6tt said:
Probably more like 10000 IOPs max, but I agree.
Just for the record, I googled a bit to get the most "true" numbers I could find, and you where closer in guessing. These are the numbers that Samsung has posted for the 830:
http://www.anandtech.com/show/4863/the-samsung-ssd-830-review
Random Read: Up to 80k
Random Write: Up to 36k
But when really tested, here´s what it really turns out like:
http://www.custompcreview.com/reviews/samsung-830-revisited-256gb-ssd-review/15023/6/
Random Read: 4828
Random Write: 21355

You see, I knew a had seen numbers for the drive, that I recalled was up there at around 100k, but I never took the time to see it really get tested. Well, now we know:)

m6tt said:
I will keep two pools, and set up an L2ARC on the large 2TB pool (I don't have 4x 1TB, not sure where that came from, read OP).
My bad, I completly misread that. Well in that case I agree, keep it like that.

And about bonnie, yes you won´t be seeing L2ARC helping you out at all, since it generates unique random data to test with every time. That is also true for any other benchmarking tool you may use. But you could see it like bonnie giving you the "worst case" numbers and they will only get better with time. Also note that ZFS doesn´t cache streaming data(big files e.g.) in L2ARC by default. If you want that, you can insert this:
/etc/sysctl.conf:
Code:
vfs.zfs.l2arc_noprefetch=0
Which had an extremly positive effect on our workload (Virtual stores over NFS, Application data over iSCSI and user data over SMB)

PS. There´s no such thing as "raidz6". That would imply a vdev with 6xparity. What you are referring to is dual parity raidz2, RAID6, or RAID-DP(NetApp)

/Sebulon
 
Sebulon said:
Just for the record, I googled a bit to get the most "true" numbers I could find, and you where closer in guessing. These are the numbers that Samsung has posted for the 830:
http://www.anandtech.com/show/4863/the-samsung-ssd-830-review

But when really tested, here´s what it really turns out like:
http://www.custompcreview.com/reviews/samsung-830-revisited-256gb-ssd-review/15023/6/


You see, I knew a had seen numbers for the drive, that I recalled was up there at around 100k, but I never took the time to see it really get tested. Well, now we know:)

I have seen the same thing with Sandforce etc...They were claiming 20K IOPs on the first generation and getting about 2K. 100K IOPs would be great, but I never actually read the marketing, and assumed something around 10K would be a good guess.

Sebulon said:
And about bonnie, yes you won´t be seeing L2ARC helping you out at all, since it generates unique random data to test with every time. That is also true for any other benchmarking tool you may use. But you could see it like bonnie giving you the "worst case" numbers and they will only get better with time. Also note that ZFS doesn´t cache streaming data(big files e.g.) in L2ARC by default. If you want that, you can insert this:
/etc/sysctl.conf:
Code:
vfs.zfs.l2arc_noprefetch=0
Which had an extremly positive effect on our workload (Virtual stores over NFS, Application data over iSCSI and user data over SMB)
Yes, Bonnie++ is a worst case test. In real life on a file server (and in my own, unfortunately undocumented, tests with L2ARC) performance is increased greatly, especially seek time for frequently access files. I wish I had known of that sysctl, as I think part of the caching methodology is "recently accessed", I wonder if it would have still accelerated bonnie's sequential performance?


Sebulon said:
PS. There´s no such thing as "raidz6". That would imply a vdev with 6xparity. What you are referring to is dual parity raidz2, RAID6, or RAID-DP(NetApp)
Sorry, I misread something earlier! I haven't run RAIDZ at all, so my knowledge of it is quite lacking :) You can mentally transpose my silly "raidz6" to "raidz where n/2 disks may fail without compromising the array".
 
Back
Top