Other Looking for advice on active-active storage cluster

I've been daydreaming again and I need some help ironing out the details of this little adventure. I've looked at HAST and I don't think it's applicable, the reasons for which should become apparent.

What I'm thinking here is a three box setup containing two FreeBSD servers (node1 and node2) connected to a shared DAS JBOD. Both servers can see, and access, all disks at the same time. For the sake of argument the DAS has 16 disks. node1 uses disks 0-7 to create pool1 and node2 uses disks 8-15 to create pool2. CARP is in play here and there are two virtual IPs, ip1 in which node1 is the primary and ip2 in which node2 is the primary. The kernel iSCSI target is used to present a LUN on each node on it's virtual IP. So far, at least to me, this seems pretty straight forward.

Now in the event of a graceful reboot of a node the hand off would look something like this:
1. node1 stops ctld(8).
2. node1 does zpool export pool1
3. node1 shuts down.
4. CARP updates ip1, node2 is now going to receive traffic.
5. node2 becomes aware node1 is gone.
6. node2 does zpool import pool1
7. node2 adds the configuration for the LUN on pool1 and does ctld(8) reload.

Any holes in this so far? I'm not sure the best way to achieve #5. Perhaps heartbeat?

In the event of an ungraceful disappearance of node1 I would think things would play out like above, but without steps 1-3 and the import on node2 might take a little bit longer because perhaps the contents of the ZIL need to be replayed or something else.

Now, when node1 comes back how do I control the start of ctld(8) (because node2 has to remove the LUN from ctld conf and reload first) and the import of the pool (because node2 has to export it first)? Again, perhaps this is best achieved through heartbeat? I would also assume that zfs(8) and ctld(8) should not be enabled in rc.conf so that nothing automatically happens with them.

So I said that I didn't think HAST was appropriate. The two main reasons are it has the notion of local and remote disks, and it wants to replicate data from the active to the passive. Neither of these are applicable here. None of the disks are remote, they are all local and it's just a question of who is using them. And there is no replication of data, just a (hopefully) seamless hand off of the pool and service configuration.

I have set up a pair of FreeBSD VMs with shared disks so I should be able to test this pretty well. Any thoughts on the matter?
 
If you're doing it seriously and have the money, look at ixSystems TrueNAS kit. As far as I'm aware that functions similar, but they hired FreeBSD src committers to make it work.

Now, when node1 comes back how do I control the start of ctld (because node2 has to remove the LUN from ctld conf and reload first)

This bit sounds unnecessary to me. This is an Active-Standby setup. When node2 is Active, node1 can just boot up and be the Standby node. If node2 fails, you just perform the same steps you already listed (with the nodes reversed). I see no reason for node1 to take over if node2 is working.

I would also try not to go to the effort of modifying ctld.conf all the time unless it's the only way it will work. You want to start ctld when a node becomes Active, and in a graceful node shutdown, stop it before exporting the pool.

Yes, you'll almost certainly want to control zfs/ctld startup manually, probably in the CARP up/down scripts.

Of course this is the sort of stuff that takes a bit of work to get running, but a lot of work to get right.
 
This bit sounds unnecessary to me. This is an Active-Standby setup. When node2 is Active, node1 can just boot up and be the Standby node. If node2 fails, you just perform the same steps you already listed (with the nodes reversed). I see no reason for node1 to take over if node2 is working.

Well, it's active-passive at the pool level but it's active-active at the service level. So under normal operation each node has it's pool and it's servicing iSCSI requests. There are two iSCSI target IPs, two LUNs, two physical machines making use of their RAM and CPU. In this sense I consider it active-active, perhaps I am wrong? If one node fails the other takes over the pool, CARP update takes place. There are still two iSCSI target IPs, two LUNs, but now there is one physical machine handling it all. When the failed node comes back it should reclaim it's pool and begin servicing requests once more.

Of course this is the sort of stuff that takes a bit of work to get running, but a lot of work to get right.

Oh I know. I'm under no illusions that this will be trivial or problem free. I'm just an enthusiast that likes to experiment and this seems like a fun project.
 
Thecus has HAST running on the NSeries NAS.

Works very well!

http://thecus.com/product.php?PROD_ID=85

And wont destroy your budget.

Interesting! Is that Linux based? I can't go cheating on FreeBSD with Linux for this project :)

Ah sorry, I completely missed the bit about both being used at the same time by different front ends.
Have to think about it a bit more...

No worries. Any thoughts no matter how crazy are welcomed. In the meantime I'm going to look at CARP up/down scripts as you mentioned, as well as sysutils/heartbeat and possibly devd(8). The more checks/control the better, the last thing I want is both nodes attempting to use the same pool at the same time. Actually now that I think about it, won't you get an error if you try to import a pool that wasn't cleanly exported? I'll have to check into that, but it could provide a little protection right there.

Edit: Sorry mods, I'll format my posts properly.
 
Last edited by a moderator:
Does Linux actually have HAST or is it using drdb?

the last thing I want is both nodes attempting to use the same pool at the same time. Actually now that I think about it, won't you get an error if you try to import a pool that wasn't cleanly exported? I'll have to check into that, but it could provide a little protection right there.

Unfortunately you have to forcefully import the pool when going active, otherwise automatic failover after a node failure will not work (Which is what we're after...). Getting this to happen when needed, and only when needed, is the hard bit.

Servers can fail in thousands of ways and what happens if a node disappears from the network briefly (but still has the pool) and then comes back up in the middle of the other node going active? That's where you have to start looking at stuff like "Shoot the other node in the head" logic; When node2 decides to make the jump and take over node1's devices, node1 should just stay down. Once node2 is active for node1's services, you can have some period job that checks node1 to see if it's up, and when it is up and stable, kick off a process to export its lun/pool gracefully and tell it to go active again.
 
Servers can fail in thousands of ways and what happens if a node disappears from the network briefly (but still has the pool) and then comes back up in the middle of the other node going active? That's where you have to start looking at stuff like "Shoot the other node in the head" logic; When node2 decides to make the jump and take over node1's devices, node1 should just stay down. Once node2 is active for node1's services, you can have some period job that checks node1 to see if it's up, and when it is up and stable, kick off a process to export its lun/pool gracefully and tell it to go active again.

All good points. I think you're right about a failed node staying failed. Make fail over automatic and fast, but don't fail back automatically. Give someone a chance to inspect the system, see what went wrong, and then explicitly run a command to fail back. The good news is this approach actually simplifies things greatly.
 
Really interesting stuff, looks like they've put a lot of work into it. Is it really Active/Active though? To me Active/Active means that either controller can handle I/O for the same LUN, requiring basically no switchover on failure. I'm not even sure that's possible on ZFS with it being tied to the hostid of one system and storing a lot of in-flight writes in RAM. If any switchover needs to happen then really it's Active/Passive, regardless of whether the other controller is Active for other LUNs. (You've just got multiple Active/Passive pairs with a different Active node for each LUN)

Looking into it more though, it looks like even enterprise kit claiming to be Active/Active isn't really. The closest most get is "Asymetric Active/Active or ALUA", whereby the Standby contoller can proxy I/O for LUNs owned by the other controller and take over if it doesn't respond, which to be fair is close enough.

Not that I'm knocking the software, it looks really good and I think we'd be hard pressed to do better without off the shelf proprietary hardware/software.
 
Cool, I like the multiple heatbeat methods. It works a bit like I was thinking with services failing over automatically, the original node being blocked, and then the switch back being done later so you don't have to worry about something like an intermittent server causing havoc. Locking the disk itself so the faulty node won't try to import it is clever.
 
I suppose it doesn't have to be FreeBSD, it's just what I'm most familiar with currently. Is clustering under Solaris easier to achieve?
 
First of all you have to remember that ZFS in solaris != (open)ZFS in FreeBSD. Second, there are costs if you want to use it in production (OS+clustering software) and I for one, would compare this to the prices offered by high-availability.com. Third, you will most likely need 3 box'es just for the cluster itself (costs) since I do not believe you have Oracle hardware at hand :); with Oracle hardware you need a minimum of 2x nodes and 1x scsi device for the quorum part.

If it's easier to be achieved on Solaris or not, depends on your definition of "easy". Still, be warned, things are quite different on Solaris. What I can say is that it does offer, out of the box, pretty much what you desire (active-active setup - http://www.oracle.com/technetwork/a...ge-admin/o11-088-zfs-nas-cluster-1371042.html), iSCSI, disk locking when one node goes down (at least in active/pasive mode - you would need to double check for the other scenario), dual network connections between the nodes (something that would be nice to have in HAST as well), a quorum server and other things.


LE: Give Solaris a spin in a vm, see what you think of it. Another alternative would be OmniOS with corosync and pacemaker (http://blog.zhaw.ch/icclab/use-pace...mos-omnios-to-run-a-ha-activepassive-cluster/) but I would highly recommend you use something that knows disk/node locking.
 
LE2:
Any holes in this so far? I'm not sure the best way to achieve #5. Perhaps heartbeat?
No, you can use devd (https://www.freebsd.org/doc/handbook/disks-hast.html). Pay attention to section "18.14.2.1. Failover Configuration".

As an example, here is my modified carp_failover.sh script:
Code:
#!/bin/sh
# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>
# Modified by George Kontostanos <gkontos.mail@gmail.com>
# and by Claudiu Vasadi <claudiu.vasadi@gmail.com>
# The names of the HAST resources, as listed in /etc/hast.conf
# Use a space delimited list if there is more than 1 resource
resources="disk1 disk2"
# delay in mounting HAST resource after becoming master
# make your best guess
delay=10
# Scripts to be started/stopped when we have a failover
scripts="\
/usr/local/etc/rc.d/tcpserver.sh
/usr/local/etc/rc.d/qmail.sh
/usr/local/etc/rc.d/sa-spamd
/usr/local/etc/rc.d/clamav-clamd
/usr/local/etc/rc.d/clamav-freshclam
/usr/local/etc/rc.d/courier-authdaemond
/usr/local/etc/rc.d/courier-imap-imapd
/usr/local/etc/rc.d/courier-imap-imapd-ssl
/usr/local/etc/rc.d/mysql-server"

# logging
log="local0.debug"
name="carp_failover.sh"
pool="qmail"

# Peer
peer="mail2.local"
# end of user configurable stuff
#
# Functions
#

check_peer() {
     ping -c 1 $peer > /dev/null 2>&1
        if [ $? = "0" ];then
            for i in $resources
                do
        nc -z $peer 22 > /dev/null 2>&1 || {\
                echo "ssh port 22 not opened"
                exit 1
                }
            hast_status_peer="`ssh $peer hastctl status | grep $i | awk '{print $3}'`"
            if [ $hast_status_peer = "master" ];then
                echo "Peer $peer has resource $i role as master. Make sure all resources are set to secondary or init before proceeding."
                exit 1
            fi
                done
        fi
}

start_rcd_scripts() {
    for i in $scripts
        do
    $i onerestart && logger -p local0.debug -t hast "Starting/restarting $i"
    done
}

stop_rcd_scripts() {
    for i in $scripts
        do
    $i onestop && logger -p local0.debug -t hast "Stopping $i"
    done
}

hast_switch_to_primary() {
logger -p $log -t $name "Switching to primary provider for ${resources}."
        sleep ${delay}

        # Wait for any "hastd secondary" processes to stop
        for disk in ${resources}; do
            while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
                sleep 1
            done

            # Switch role for each disk
            hastctl role primary ${disk}
            if [ $? -ne 0 ]; then
                logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
                exit 1
            fi
        done
      
        # Wait for the /dev/hast/* devices to appear
        for disk in ${resources}; do
            for I in $( jot 60 ); do
                [ -c "/dev/hast/${disk}" ] && break
                sleep 0.5
            done
      
            if [ ! -c "/dev/hast/${disk}" ]; then
                logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
                exit 1
            fi
        done

        logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."
}

hast_switch_to_secondary() {
     logger -p $log -t $name "Switching to secondary provider for ${resources}."
        for disk in ${resources}; do
            sleep $delay
            hastctl role secondary ${disk} 2>&1
            if [ $? -ne 0 ]; then
                logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
                exit 1
            fi
            logger -p $log -t $name "Role switched to secondary for resource ${disk}."
        done
}

import_zfs_pool() {
logger -p $log -t $name "Importing Pool"
        # Import ZFS pool. Do it forcibly as it remembers hostid of
                # the other cluster node.
                out=`zpool import -f "${pool}" 2>&1`
                if [ $? -ne 0 ]; then
                    logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}."
                    exit 1
                fi
                logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported."
}

export_zfs_pool() {
    logger -p $log -t $name "Exporting Pool"
        zpool list | egrep -q "^${pool} "
            if [ $? -eq 0 ]; then
                    # Forcibly export file pool.
                    out=`zpool export -f "${pool}" 2>&1`
                     if [ $? -ne 0 ]; then
                            logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}."
                            exit 1
                     fi
                    logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported."
            fi
}


#
# Execute script
#
case "$1" in
    master)
    check_peer
    hast_switch_to_primary
    import_zfs_pool
    start_rcd_scripts
    ;;
    slave)
    stop_rcd_scripts
    export_zfs_pool
    hast_switch_to_secondary
    ;;
esac

It's quite easy to see that I use it for a mail server with local disks.

Have phun :)
 
LE3: There is a thread here somewhere where if I remember correctly, HAST was not presenting a failed disk to ZFS or it was not presenting it as failed/missing for quite some time. Not sure if this is still the case.
 
Back
Top