Failover cluster. CARP + LAGG + HAST + iSCSI

eg1l · Mar 12, 2013

Hello.

I'm new to the forum, and I'm also new to FreeBSD. I have good experience with Linux systems, but for my system, I think FreeBSD is what I want.
I have tried searching, but I cannot for the life of me, wrap my head around a solid setup.

What I'm trying to achieve is a storage cluster with automatic failover for use as storage for virtualization hosts.
If a storage server dies, there will be minimal downtime. The virtualization hosts accepts iSCSI and NFS.
Currently, I have set up two storage servers with CARP, in my lab.

The new storage servers will have:

8 3TB Disks each
USB stick with OS
6 NIC

I have some questions about further setup:

How should I arrange my disks for best performance and redundancy?
Should I create a striped mirror on each storage server and then put HAST on top of that?
Should I only create HAST volume on the master node?
How should I configure iSCSI when I use two hosts?
Can I use LAGG + CARP?

Would performance of the storage cluster be usable for a 5 host virualization cluster?
I am more than willing to make a blog post or something similar to help others if my plan is even possible

gkontos · Mar 12, 2013

Have a look at this thread, it is a bit old but the principles remain the same.

Regarding your questions, briefly:

Striped mirrors offer better performance.
Hast works at the vdev level, meaning that your pool will be on top of hast.
I prefer NFS over iSCSI because the later sucks in FreeBSD.
You can combine LAGG + CARP.
Keep 2 interfaces with LAGG for HAST synchronization.
Using SSD's for LOG will improve writes in your situation.

Depending on the controller, expect to max at 250-300MB internal throughput assuming you don't start reaching the 80% capacity of your pool.

vermaden · Mar 13, 2013

eg1l said:
Currently, I have set up two storage servers with CARP, in my lab.

I would rather use uCARP (net/ucarp) for flexibility.

8 3TB Disks

Do not user RAIDZ/RAID5 with that (at least RAID6 or RAID10)

USB stick with OS

Make that two with ZFS mirror or gmirror.

How should I arrange my disks for best performance and redundancy?

I would go for RAID10.

Should I create a striped mirror on each storage server and then put HAST on top of that?

Generally its that way: RAW DISKS --> HAST --> ZFS --> pool/datasets --> /mount/points

Should I only create HAST volume on the master node?

HAST means two nodes, its useless on one node, if You come from Linux background then its like DRBD.

How should I configure iSCSI when I use two hosts?

You can use iSCSI to share that to other hosts.

Can I use LAGG + CARP?

Yes ... and You probably should

Would performance of the storage cluster be usable for a 5 host virualization cluster?

Should do, also depends on that cluster demands/load.

I am more than willing to make a blog post or something similar to help others if my plan is even possible

Sharing of experiences and thought is always nice

You should also add 2 * SSD (fast write speed) in mirror for ZIL and another 1-2 SSDs (fast read) for L2ARC.

Something like that:

2 * USB @ MIRROR = FreeBSD OS 2 * SSD @ MIRROR = ZFS ZIL 2 * SSD = ZFS L2ARC (each SSD configured as separate L2ARC) N * HDD = ZFS pool

&quot said:
I prefer NFS over iSCSI because the later sucks in FreeBSD.

I recently played with istgt, worked ok for me, whats wrong with it?

eg1l · Mar 13, 2013

Make that two with ZFS mirror or gmirror.

Brilliant

I would go for RAID10.

If I were to just have one server without HAST, I would:

(Mirror two disks as one vdev) * 4
Create a stripe across those vdevs

This would translate to RAID10?

How would I do this, when creating a pool on top of HAST?

HAST means two nodes, its useless on one node, if You come from Linux background then its like DRBD.

Yes, but when creating the pool, should this only be done on the master node?

What happens in a failover situation?

You can use iSCSI to share that to other hosts.

Yes, I have done this for one node. Like this:

Code:

Master----\
----------------> Virtual IP (iSCSI target from master) -> Virtualization cluster
Slave-----/

Should I create a target also on the slave?

ZIL and L2ARC sound really cool, but how does this work for, say a virtualized guest, on the virtualization cluster?

gkontos · Mar 13, 2013

vermaden said:
I recently played with istgt, worked ok for me, whats wrong with it?

I had performance issues and occasional time outs with some linux clients. NFS just worked out of the box.

vermaden · Mar 13, 2013

eg1l said:
If I were to just have one server without HAST, I would:

(Mirror two disks as one vdev) * 4

Create a stripe across those vdevs

This would translate to RAID10?

Yes.

How would I do this, when creating a pool on top of HAST?

Create HAST device for each disk in two nodes, for example You have disks ada0 and ada1 in each nodeN. Then create hast0 replication between ada0@node0 <-> ada0@node1 and hast1 as ada1@node0 <-> ada1node1. Then create ZFS pool on top of the hastN devices.

It would be good to use separate network bot iSCSI/NFS export and HAST replication/synchronization, I would also suggest JUMBO frames (MTU 9000) for both of these networks if possible, at least for the HAST network.

Yes, but when creating the pool, should this only be done on the master node?

HAST is always master-slave, so You always do operation only on one node.

What happens in a failover situation?
On fail node switch all hast devices to 'secondary', on standby node put all hast devices to 'primary' role, probably with FORCE option.

Yes, I have done this for one node. Like this:
Master----\
----------------> Virtual IP (iSCSI target from master) -> Virtualization cluster
Slave-----/

Yes, like that.

Should I create a target also on the slave?

You need to create a cluster service that would consist of:
- hast devices
- import/export zfs pool
- iscsi target daemon
That is why uCARP is better here because it does not only switch the IP but also execute sctipts upon IP change.

ZIL and L2ARC sound really cool, but how does this work for, say a virtualized guest, on the virtualization cluster?

This works everywhere, ZIL is for faster writes, L2ARC is for faster (cached) reads and for DDT table if You are out of RAM when using deduplication.

gkontos said:
I had performance issues and occasional time outs with some linux clients. NFS just worked out of the box.

Ok, thanks for sharing.

eg1l · Mar 13, 2013

Create HAST device for each disk in two nodes, for example You have disks ada0 and ada1 in each nodeN. Then create hast0 replication between ada0@node0 <-> ada0@node1 and hast1 as ada1@node0 <-> ada1node1. Then create ZFS pool on top of the hastN devices.

Say, I have a failover situation where only one server is active.
If one of the disks fail on that server, the pool will be destroyed?

Would I be better off creating four mirrored vedvs (4*2 disks) on each node and then create HAST0..3? It that possible?

It would be good to use separate network bot iSCSI/NFS export and HAST replication/synchronization, I would also suggest JUMBO frames (MTU 9000) for both of these networks if possible, at least for the HAST network.

Noted.

On fail node switch all hast devices to 'secondary', on standby node put all hast devices to 'primary' role, probably with FORCE option.

So it seems like I have to create a script that is triggered by uCARP that:

Switches HAST roles
Exports/Imports zfs pool
Initializes a new iSCSI target on the slave node

DigitalDaz · Mar 13, 2013

I'll be very interested to see your script for the uCARP.

I'm trying to do an almost identical thing here but using pure SSD for the pool. I've just got two in each at the moment to prove the concept. I've had it working fine in the past using NFS after following the other thread by gkontos.

I'm slightly worried about using SSDs as I lost 4 out of 8 when I tried putting them in a server using DRBD though they were only Crucial V4 drives.

I'm trying this with Samsung 840 Pros at the moment but will probably try the Intel 520 in production or rvrn the DC3700s.

I'm assuming that if you use spinning disks in the pool you are also going to need to include ZIL and L2ARC drives if you are using them.

My set up is for VOIP PBXs on ESXi so I don't need much in the way of storage so I'm hoping that at maximum 4X480GB Intel 520s will be sufficient for each box to handle around three hosts. I figure that will give me about 850GB of usable pool space.

I'm wondering though if the HAST setup slows the whole thing down significantly enough to waste the benefits of even bothering with a pure SSD pool.

Sebulon · Mar 14, 2013

DigitalDaz said:
I'm wondering though if the HAST setup slows the whole thing down significantly enough to waste the benefits of even bothering with a pure SSD pool.

In terms of throughput, yeah probably a slight penalty. But SSD's will always run loops around an ordinary HDD when in comes to random access and latency.

/Sebulon

frijsdijk · Mar 14, 2013

Curious for results. HAST seems nice, but it's still not very 'strong', as in, it's very easy to fsck the system up by pulling cables or cutting power, ie. simulating hardware failure.

ondra_knezour · Mar 14, 2013

gkontos said:
I had performance issues and occasional time outs with some linux clients. NFS just worked out of the box.

FYI Native iSCSI Target

The goal of this project is to create a native, high performance, iSCSI target facility for FreeBSD. While configuration and connection setup and teardown are handled by a userland daemon, unlike previous target frameworks, all data-movement is performed in the kernel. The iSCSI target is fully integrated with the CAM Target Layer meaning that volumes can be backed by files or any block device. The hardware offload capabilities of modern network adapters will also be supported.

DigitalDaz · Mar 14, 2013

I now have this setup working but not using uCARP just the regular CARP. What problems if any am I likely to face? The current setup is only using a single interface for the iSCSI, for testing I'm using two HP N40L microservers so I'm a little limited.

I'm a little unsure of any potential split-brain scenario simply because at the moment I can't seem to simulate one

One of the boxes always gets the master and just assumes the primary role. I can happily pull the cables from either box and they both switch over as expected.

Anyone familiar with FreeBSD should find this all fairly straightforward as I know next to nothing.

I just used the @gkontos guide and slightly modified things. Firstly, that guide seems to deal with NFS so the first thing was to install the iSCSI target. I didn't want the target to start automatically so I did not include it in the rc.conf file. IIRC I have only changed two lines in the default failover script.

At the beginning of the slave section I have put:

Code:

/usr/local/etc/rc.d/istgt stop

Without that, the pool will not export.

Similarly, at the end of the master section, after the pool is imported:

Code:

/usr/local/etc/rc.d/istgt onestart

I'm sure I will come a cropper but now I want to get more interfaces going on for redundancy.

DigitalDaz · Mar 14, 2013

Already I'm now thinking bigger and different technologies though I don't know much about each. I have a few 4GB FC cards. I'm thinking in the same way as I can use the devd and carp to start and stop iscsi targets so I should be able to do the same with FC targets. I'm thinking here of a couple of 16/24 bay boxes maybe with a 10GBe card in each for the HAST replication.

Then I could have 4GB FC in each hypervisor. In my head it doesn't look that difficult, is this easily achievable? It would still be tons cheaper than HA EQL SAN kit.

eg1l · Mar 15, 2013

DigitalDaz, would you mind sharing your configs?

DigitalDaz · Mar 16, 2013

Sure, I'll share them.

The devd.conf is the default file with this tagged at the end:

Code:

notify 30 {
	match "system" "IFNET";
	match "subsystem" "carp0";
	match "type" "LINK_UP";
	action "/usr/local/bin/failover master";
};

notify 30 {
	match "system" "IFNET";
	match "subsystem" "carp0";
	match "type" "LINK_DOWN";
	action "/usr/local/bin/failover slave";
};

The failover script:

Code:

#!/bin/sh

# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>
# Modified by George Kontostanos <gkontos.mail@gmail.com>
# Modified by Darren Williams <darren@directvoip.co.uk>

# The names of the HAST resources, as listed in /etc/hast.conf
resources="disk1 disk2"

# delay in mounting HAST resource after becoming master
# make your best guess
delay=3

# logging
log="local0.debug"
name="failover"
pool="tank"

# end of user configurable stuff

case "$1" in
	master)
		logger -p $log -t $name "Switching to primary provider for ${resources}."
		sleep ${delay}

		# Wait for any "hastd secondary" processes to stop
		for disk in ${resources}; do
			while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
				sleep 1
			done

			# Switch role for each disk
			hastctl role primary ${disk}
			if [ $? -ne 0 ]; then
				logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
				exit 1
			fi
		done

		# Wait for the /dev/hast/* devices to appear
		for disk in ${resources}; do
			for I in $( jot 60 ); do
				[ -c "/dev/hast/${disk}" ] && break
				sleep 0.5
			done

			if [ ! -c "/dev/hast/${disk}" ]; then
				logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
				exit 1
			fi
		done

		logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."


		logger -p $log -t $name "Importing Pool"
		# Import ZFS pool. Do it forcibly as it remembers hostid of
                # the other cluster node.
                out=`zpool import -f "${pool}" 2>&1`
                if [ $? -ne 0 ]; then
                    logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}."
                    exit 1
                fi
                logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported."
                # Start the iscsi target
                /usr/local/etc/rc.d/istgt onestart 

	;;

	slave)
		logger -p $log -t $name "Switching to secondary provider for ${resources}."
                # Stop the iscsi target otherwise the pool will not export
                /usr/local/etc/rc.d/istgt stop
		# Switch roles for the HAST resources
		zpool list | egrep -q "^${pool} "
        	if [ $? -eq 0 ]; then
                	# Forcibly export file pool.
                	out=`zpool export -f "${pool}" 2>&1`
               		 if [ $? -ne 0 ]; then
                        	logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}."
                        	exit 1
                	 fi
                	logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported."
        	fi
		for disk in ${resources}; do
			sleep $delay
			hastctl role secondary ${disk} 2>&1
			if [ $? -ne 0 ]; then
				logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
				exit 1
			fi
			logger -p $log -t $name "Role switched to secondary for resource ${disk}."
		done
	;;
esac

The hast.conf:

Code:

resource disk1 {
        on san1 {
                local /dev/ada0
                remote 192.168.2.7
        }
        on san2 {
                local /dev/ada0
                remote 192.168.2.6
        }
}
resource disk2 {
        on san1 {
                local /dev/ada1
                remote 192.168.2.7
        }
        on san2 {
                local /dev/ada1
                remote 192.168.2.6
        }
}

The istgt.conf:

Code:

[Global]
  Comment "Global section"
  NodeBase "iqn.2007-09.jp.ne.peach.istgt"

  # files
  PidFile /var/run/istgt.pid
  AuthFile /usr/local/etc/istgt/auth.conf

  # syslog facility
  LogFacility "local7"

  # socket I/O timeout sec. (polling is infinity)
  Timeout 30
  # NOPIN sending interval sec.
  NopInInterval 20

  # authentication information for discovery session
  DiscoveryAuthMethod Auto

  # reserved maximum connections and sessions
  # NOTE: iSCSI boot is 2 or more sessions required
  MaxSessions 32
  MaxConnections 8

  # iSCSI initial parameters negotiate with initiators
  # NOTE: incorrect values might crash
  FirstBurstLength 65536
  MaxBurstLength 262144
  MaxRecvDataSegmentLength 262144

[UnitControl]
  Comment "Internal Logical Unit Controller"
  AuthMethod Auto
  #AuthMethod CHAP Mutual
  AuthGroup AuthGroup1
  # this portal is only used as controller (by istgtcontrol)
  # if it's not necessary, no portal is valid
  #Portal UC1 [::1]:3261
  Portal UC1 127.0.0.1:3261
  # accept IP netmask
  #Netmask [::1]
  Netmask 127.0.0.1

# You should set IPs in /etc/rc.conf for physical I/F
[PortalGroup1]
  Comment "T1 portal"
  Portal DA1 192.168.1.10:3260
# for dhcp clients use 0.0.0.0 (max. 1 declaration!)
#  Portal DA1 0.0.0.0:3260

[InitiatorGroup1]
  Comment "V1 group"
#  InitiatorName "iqn.1993-08.org.debian:01:16498f7229"
  InitiatorName "ALL"
  Netmask 192.168.1.0/24

[LogicalUnit1]
  TargetName disk0 
  Mapping PortalGroup1 InitiatorGroup1
  AuthGroup AuthGroup1
  UnitType Disk
  QueueDepth 255
#  QueueDepth 0
  LUN0 Storage /dev/zvol/tank/vmware Auto

The rc.conf:

Code:

hostname="san1.local"
keymap="uk.iso.kbd"
ifconfig_igb0=" inet 192.168.1.6 netmask 255.255.255.0"
ifconfig_bge0=" inet 192.168.2.6  netmask 255.255.255.0"
#CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 192.168.1.10 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0"
hastd_enable=YES
defaultrouter="192.168.1.1"
sshd_enable="YES"
ntpd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="NO"

The rc.conf is obviously slightly different on the second box because of the IP.

I haven't really looked at these yet, I'm sure some can be tweaked but they work.

eg1l · Mar 22, 2013

I have tried to set up two hosts with Nas4Free.
I am nearly there, but iSCSI is not starting.

hastctl status

Code:

nas1:~# hastctl status
hast0:
  role: primary
  provname: hast0
  localpath: /dev/ada0
  extentsize: 2097152 (2.0MB)
  keepdirty: 64
  remoteaddr: 129.241.104.224
  replication: fullsync
  status: degraded
  dirty: 10485760 (10MB)
  statistics:
    reads: 133
    writes: 185
    deletes: 0
    flushes: 14
    activemap updates: 5

zpool status:

Code:

nas1:~# zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

	NAME          STATE     READ WRITE CKSUM
	tank          ONLINE       0     0     0
	  hast/hast0  ONLINE       0     0     0

errors: No known data errors

CARP + HAST failover works fine, but when trying to start istgt, this happends:

Code:

Starting istgt.
istgt version 0.5 (20121123)
normal mode
using kqueue
using host atomic
LU1 HDD UNIT
LU1: LUN0 file=/dev/hast/hast0, size=1000204822016
LU1: LUN0 1953525043 blocks, 512 bytes/block
istgt_lu_disk.c: 642:istgt_lu_disk_init: ***ERROR*** LU1: LUN0: open error(errno=17)
istgt_lu.c:2091:istgt_lu_init_unit: ***ERROR*** LU1: lu_disk_init() failed
istgt_lu.c:2166:istgt_lu_init: ***ERROR*** LU1: lu_init_unit() failed
istgt.c:2803:main: ***ERROR*** istgt_lu_init() failed
/etc/rc.d/iscsi_target: WARNING: failed to start istgt

Have I missed a obvious step?

vermaden · Mar 22, 2013

eg1l said:

CARP + HAST failover works fine, but when trying to start istgt, this happends:

Code:

Starting istgt.
istgt version 0.5 (20121123)
normal mode
using kqueue
using host atomic
LU1 HDD UNIT
LU1: LUN0 file=/dev/hast/hast0, size=1000204822016
LU1: LUN0 1953525043 blocks, 512 bytes/block
istgt_lu_disk.c: 642:istgt_lu_disk_init: ***ERROR*** LU1: LUN0: open error(errno=17)
istgt_lu.c:2091:istgt_lu_init_unit: ***ERROR*** LU1: lu_disk_init() failed
istgt_lu.c:2166:istgt_lu_init: ***ERROR*** LU1: lu_init_unit() failed
istgt.c:2803:main: ***ERROR*** istgt_lu_init() failed
/etc/rc.d/iscsi_target: WARNING: failed to start istgt

Have I missed a obvious step?

Yep.

# zpool create -V tank/LUN0
In the istgt.conf file set /dev/zvol/tank/LUN0 auto as LUN0 backend.