The best SAN

Hi

I will soon assemble a team to SAN function and had thought of using ZFS, but do not know if performance will be up to par, as I have recommended other options based on Linux.

This SAN is being used to service-based virtual systems XEN and VMware. He had intended to use 2 SSD as cache and log, with E5 Intel-based processor and 16GB of memory, 1TB SATA 16HDD, ISCSI connection with 10GB Ethernet cards. Know if would have IO bottleneck or similar, as though he had clearly using FreeBSD 9.2 with ZFS, the recommendation of several people to use Linux-based systems, I do have doubts :(.

Does anyone have benchmarks? Any recommendations to adjust for best performance? Scenario similar? Thanks!

Best regards
 
What is your IO requirement and what is your storage capacity requirement?

Performance of the above will vary wildly depending on whether you use mirrors, RAIDZ, RAIDZ2, etc.

FreeBSD 9.2 is not out yet, perhaps you mean FreeBSD 8.2?
 
For ISCSI, use JUBMO FRAMES (9000 instead of ~1500).

RAID10 will have high read and write speed, RAID5/6 will be too slow (in terms of IOPS) for that scenario, use 2 SSD's for ZIL (mirror), another one for L2ARC.

Also check HAST with CARP or uCARP to make that storage highly available.
 
A note to all of the above, VMs are a good candidate for dedup.

Beware though that dedup is very memory dependent. A pair of striped SSDs for CACHE and a mirror for LOG is highly recommended.
 
Do not use ZFS depup as long you're concerned about speed. Even if you have 1TB of memory. It's SLOW. If you need dedup, I highly recommend using DragonflyBSD and Hammer, as the dedup implementation in Hammer is a lot more optimized for speed.

I have no problems maxing a gigabit line with /usr/ports/net/istgt
Will it be faster than Linux? Probably no. Will it be more reliable, not sure, that depends on istgt, there is just not too many people with good and long production experience with it yet. ZFS alone is very reliable though, and not to mention all the other good things you get by using it. ZFS is not slow on FreeBSD at all. If it's slow, you or your hardware is doing something really wrong.
 
olav said:
Do not use ZFS depup as long you're concerned about speed. Even if you have 1TB of memory. It's SLOW. If you need dedup, I highly recommend using DragonflyBSD and Hammer, as the dedup implementation in Hammer is a lot more optimized for speed.

I don't want to start a flame here but if dedup is used properly it:

a) Increases storage space
b) Can increase read performance on deduped data.

There is a big myth regarding dedup and FreeBSD.
 
olav said:
Do not use ZFS depup as long you're concerned about speed. Even if you have 1TB of memory. It's SLOW. If you need dedup, I highly recommend using DragonflyBSD and Hammer, as the dedup implementation in Hammer is a lot more optimized for speed.

Considering we can saturate a gigabit link (915 Mbps so far) doing a "zfs send|zfs recv" between two systems (both with dedupe enabled); and we can hit just shy of 400 MBps of writes to a ZFS pool with dedupe enabled; and considering we can do rsync backups of 50 remote school servers in under 5 hours writing to a ZFS pool with dedupe enabled, I would have to say that the dedupe implementation in ZFS is not slow across the board.

Is there a performance hit? Of course, as every write now includes an update to the DDT. Is it horribly slow? Of course not. Can it be horribly slow? Sure, if you don't have enough RAM to hold the entire DDT, and you don't have enough L2ARC to hold the entire DDT.

But, there are ways to mitigate the performance hit.

And, not every workload suits a dedupe setup. Sometimes, just compression works better than dedupe or dedupe+compression. It all depends on your workload.

ZFS is not slow on FreeBSD at all. If it's slow, you or your hardware is doing something really wrong.

Exactly. ;)
 
This is the main backups server for all non-school sites, including web servers, groupware servers, database servers, SMTP servers, file/print servers, etc. There's approx 50-odd servers here, with only about 20 being direct-installed on physical hardware. The rest are VMs (using LVM for disk storage, not filesystem image files).

Code:
[fcash@alphadrive  ~]$ zdb -DD storage
DDT-sha256-zap-duplicate: 39820749 entries, size 1166 on disk, 188 in core
DDT-sha256-zap-unique: 84371551 entries, size 1194 on disk, 193 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    80.5M   8.18T   5.28T   5.61T    80.5M   8.18T   5.28T   5.61T
     2    22.8M   2.52T   1.80T   1.88T    50.0M   5.51T   3.95T   4.12T
     4    9.10M    836G    604G    639G    46.1M   4.10T   2.97T   3.14T
     8    2.52M    230G    144G    156G    27.7M   2.47T   1.52T   1.65T
    16    1.67M    180G   91.9G    101G    37.6M   4.00T   2.03T   2.22T
    32     981K    109G   66.2G   70.5G    41.7M   4.63T   2.84T   3.02T
    64     329K   37.0G   24.4G   25.7G    28.6M   3.22T   2.07T   2.19T
   128     468K   57.3G   35.3G   37.0G    85.1M   10.4T   6.38T   6.70T
   256     158K   18.2G   10.6G   11.2G    43.4M   4.92T   2.92T   3.10T
   512    17.3K   1.46G    884M    973M    11.0M    873G    518G    576G
    1K    2.38K    165M   66.1M   82.4M    3.13M    208G   84.7G    106G
    2K      737   22.9M   11.3M   16.1M    2.17M   65.5G   31.4G   45.8G
    4K      431   16.4M   7.78M   10.8M    2.10M   79.1G   39.1G   53.9G
    8K      310   1.55M    700K   2.93M    3.66M   19.2G   9.42G   36.4G
   16K       28    261K     31K    240K     570K   5.71G    612M   4.76G
   32K        7    260K   11.5K   71.9K     278K   8.90G    419M   2.72G
   64K        4    257K   12.5K   48.0K     329K   21.7G   1.09G   3.91G
  128K        3   1.50K   1.50K   24.0K     510K    255M    255M   3.98G
 Total     118M   12.1T   8.04T   8.50T     464M   48.7T   30.6T   32.6T

dedup = 3.83, compress = 1.59, copies = 1.06, dedup * compress / copies = 5.73
Code:
[fcash@alphadrive  ~]$ zfs list
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
storage                                 34.9T  4.78T   288K  none
IOW, ~150 TB of data stored in just 35 TB of disk. :)

And this is the main backups server for schools. There's 1 Linux server per school that is the "jack-of-all-trades" server handling proxy, cache, web, mail, packet filtering (in the elems only), file/print, and diskless booting duties. Secondary schools also have separate webmail and firewall boxes.

There's approx 100-odd physical servers being backed up every night, in under 5 hours. Gotta love rsync. :)

Code:
[fcash@betadrive  ~]$ zdb -DD storage
DDT-sha256-zap-duplicate: 18470415 entries, size 784 on disk, 177 in core
DDT-sha256-zap-unique: 60064456 entries, size 751 on disk, 169 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    57.3M   5.11T   3.54T   3.67T    57.3M   5.11T   3.54T   3.67T
     2    11.2M   1.14T    935G    955G    24.8M   2.53T   2.04T   2.08T
     4    3.72M    403G    261G    270G    18.1M   1.89T   1.24T   1.28T
     8    1.11M   90.9G   54.8G   58.5G    11.6M    960G    583G    622G
    16     447K   42.7G   20.2G   21.6G    8.92M    853G    405G    434G
    32     809K   27.3G   18.2G   22.3G    38.5M   1.33T    922G   1.09T
    64     293K   23.4G   11.3G   12.7G    26.7M   2.14T   1.03T   1.15T
   128    50.8K   1.80G    782M   1.04G    8.06M    266G    119G    164G
   256    10.4K    229M    132M    192M    3.55M   84.4G   49.0G   69.2G
   512    5.05K    196M    126M    152M    3.48M    137G   90.1G    108G
    1K    1.40K   42.6M   24.4M   32.2M    1.87M   59.3G   33.8G   44.2G
    2K    1.03K   32.6M   13.1M   19.0M    2.77M   85.2G   34.6G   50.4G
    4K      407   14.8M   6.50M   8.85M    2.26M   77.7G   33.6G   47.1G
    8K      183   4.83M   1.68M   2.74M    2.08M   54.4G   18.2G   30.5G
   16K      143   2.63M    930K   1.77M    3.27M   56.6G   19.3G   39.6G
   32K      263   1.76M    716K   2.36M    9.15M   77.7G   33.0G   92.1G
   64K       11    145K     12K   85.2K     867K   9.90G    898M   6.50G
  128K        8   4.50K      4K   56.8K    1.41M    799M    723M   10.0G
  256K        6   11.5K   6.50K   42.6K    2.01M   3.18G   1.90G   14.3G
 Total    74.9M   6.82T   4.81T   4.98T     227M   15.7T   10.1T   11.0T

dedup = 2.20, compress = 1.55, copies = 1.08, dedup * compress / copies = 3.15
Code:
[fcash@betadrive  ~]$ zfs list
NAME                             USED  AVAIL  REFER  MOUNTPOINT
storage                         11.6T  10.2T   256K  none
So, approximately 35 TB of data stored in 11 TB of disk.

The off-site backups server currently has a dead OS drive, so I can't connect to it to grab stats for that one. But it receives ZFS sends from the above servers across a gigabit link, and iftop shows it hitting peaks of 915 Mbps, with the 40s average hovering above 600 Mbps.

All of the above have LZJB compression and dedupe enabled.

Forgot to mention that all three pools are configured the same:
  • 4 raidz2 vdevs with 6 drives each
  • 32 GB cache vdev
  • 2 GB log vdev (although this is currently only for testing)
  • 16 GB UFS for OS
  • 8 GB of swap (the last 4 items are all on the same SSD)
  • 20, 24, 32 GB of RAM (alpha/beta/omegadrive respectively)
 
phoenix said:
Forgot to mention that all three pools are configured the same:
  • 4 raidz2 vdevs with 6 drives each
  • 32 GB cache vdev
  • 2 GB log vdev (although this is currently only for testing)
  • 16 GB UFS for OS
  • 8 GB of swap (the last 4 items are all on the same SSD)
  • 20, 24, 32 GB of RAM (alpha/beta/omegadrive respectively)

Excellent setup! I really love the part where the last 4 items are being in the same SSD. You forgot to mention the version this is running, although I assume you were following 8.2-STABLE.

Everyone, can you thing of a better solution at this budget?
 
Hi,

He had thought of this configuration:

Code:
[B]20[/B] HDD SATA 3 (6Gbps) 2TB [B](Storage)[/B]
[B]2[/B] SSD OCZ Vertex3 MAX IOPS 240GB or SSD OCZ Vertex4 128GB [B](Cache)[/B]
[B]2[/B] SSD OCZ Vertex3 120GB [B](Log)[/B]
[B]48GB or 96GB[/B] DIMM
[B]Dual Intel Xeon 5620[/B]
[B]LSI Hard disk[/B] controller

I've been looking through the forum and configuration examples I use to configure my SAN. He had planned to use freebsd FreeBSD 9.0, raidz2 (or similar. raidz? none?), ucarp and HAST for high availability.

The biggest question, I have is whether to give SAN performance (IOPS) for XEN/VMWare virtual machines connected via iSCSI.

Also I have doubts about creating the pools, if to create one for VPS or one common to the entire system, etc.. As best performance of the machine. if compression is enabled or not (I guess penalized in performance).

I need to create hardware RAID? I think not, because it is through zfs, right?

I will read all your comments and bring me to see if it is a viable option. Thanks!

Regards.
 
SacamantecaS said:
The biggest question, I have is whether to give SAN performance (IOPS) for XEN/VMWare virtual machines connected via iSCSI.

Really depends on how I/O intensive your VMs are. If they do very little I/O but need lots of space then you'd go RAIDZ or RAIDZ2.

If they need lots of write I/O (e.g., MS Exchange) then you'd go for striped mirrors to get more VDEVs. In ZFS, write performance is the same as the write performance of one drive per VDEV (in terms of I/Os per second, not throughput - throughput being megabytes per second).

So with six drives, RAIDZ2 would give you the write performance of one drive as you only have one VDEV. With three striped mirrors, you'd get the write performance of three drives (forgetting the effects of caching, etc here for a moment).

Unless you really need the space, go for striped mirrors. You can also expand a striped mirror set by adding another mirror VDEV with two drives. To expand a RAIDZ2, you need to create another VDEV to stripe across. If that VDEV is also going to be RAIDZ2, you need another six drives.

With the above config I'd go for ten mirror VDEVs.

I need to create hardware RAID? I think not, because it is through zfs, right?

I will read all your comments and bring me to see if it is a viable option. Thanks!

Regards.

No hardware RAID. If you have a RAID controller that won't give you single disks, you can maybe set them up as individual RAID0 LUNs with a single disk each (this is what I ended up doing with a Dell PERC 6 - 6 drives = six RAID0 arrays of one drive each)
 
SacamantecaS said:
2 SSD OCZ Vertex3 120GB (Log)
That is mirrored, I presume, since you won´t gain any performance by striping them. And I can definitely say that 128GB is half as good as 240GB. For SLOG, you want as much performance as you can get, and in this case, bigger is better.

The SandForce SF-2281-controlled Vertex3 is worse at data that is non-compressable (which is everything going through the ZIL). The new Indilinx Everest 2-controlled Vertex4 is rid of that inferiority. Look at the difference between Vertex3 240GB and Vertex4 256GB:

Anandtech - OCZ Vertex 4 Review (256GB, 512GB) QD=3
45406.png


NFS write performance with mirrored ZIL:
t1066 said:
Just found the following article. Basically, it says that sync writing to a seperate ZIL is done at queue depth of 1 (ZFS sent out sync write request one at a time to the log drive. Wait for the write to finish before sent out another one. Also, it work in a round robin way similar to how cache work. Hence, stripped log devices would not help). So the relevant data is the IOPS at queue depth 1 only, not the maximum IOPS.
Regardless of this test being at QD=3, instead of QD=1, we can clearly see that the Vertex4 would be a serious ZIL boost compared to Vertex3.

But while I know that the Vertex3 obeyed cache-flush commands sent upon power-outs, and I hope the Vertex4 does as well, I went with OCZ´s Deneva 2 R-series for SLOG, since it has a built-in supercapacitor (battery) that ensures that all writes reach stable storage.

/Sebulon
 
These are simple consumer SSDs (Kingston SSDNow V 64 GB):
Code:
[root@omegadrive ~]# gpart show -l ada0
=>       34  125045357  ada0  GPT       (59G)
         34        256     1  boot      (128k)
        290       1758        - free -  (879k)
       2048   33554432     2  root      (16G)
   33556480   16777216     3  swap      (8.0G)
   50333696    4194304     4  log       (2.0G)
   54528000   67108864     5  cache     (32G)
  121636864    3408527        - free -  (1.6G)
 
Hi,

Then discard the hardware RAID option (as I had thought at first), and would force me to create the RAID initially (I think), I will use the controller because it is necessary for that amount of records. When not using hardware RAID, the option with battery and the like, could improve anything?

With two replicated SAN, I think with raidz would be enough. Right?

I have read all your advice and I still keep in doubt, but in the coming weeks I will provide all the hardware, so I'd rather try everything I have said and perhaps clarify some doubts me before disturb again with this.

I also attached some link I read about this:

http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance
http://www.zfsbuild.com/2010/06/03/howto-our-zpool-configuration/
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
http://icesquare.com/wordpress/how-to-improve-zfs-performance/
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/filesystems-zfs.html

I promise to write the conclusions of tests :). Thanks!

Best regards
 
Back
Top