Other Design of a Highly Available SAN (ZVOLs - iSCSI - Jails - gmirror)

Hello everyone,

I see FreeBSD as a very powerful option for designing a formidable high availability SAN. The storage options in FreeBSD are amazing, yet there are so many ways of doing this that my head is spinning, and I am trying to find what could be considered good practices. Here is the scenario I have in mind
  • As an example, the data to be stored in the SAN is general storage, think of a company's messy fileserver. Random files of random sizes being accessed at random times.
  • At least 2 servers are dedicated purely to host large zpools. RAID10 for the best performance, or RAID-Z2 for a balance of redundancy and storage space. Eventually another server can be added for scaling horizontally.
  • Another server will be dedicated to host jails containing a basic userland and have the jails be a client of the storage servers. Assume that the network conditions are perfect and it's not a bottleneck.
  • For simplicity, lets consider a "Samba fileserver" jail on the dedicated jail server.
    • Storage servers share via iSCSI a ZVOL each of the same size to the Samba Jail. We assume the jail is properly configured to have it's own devd devices.
    • The Samba Jail uses gmirror to create a mirror device out of the two ZVOLS. The jail has a mirrored block disk device available. Should encryption be needed, using a layer of GEOM GELI can be added. (Decryption key on the jailed userland?)
    • Samba Jail partitions the networked mirrored disk and uses UFS as it's filesystem. (I don't think that the jail should use ZFS, considering the storage servers are using ZFS already on the ZVOLs)
    • The samba jail has all the Samba configuration on its jailed userland and uses as its data share the iSCSI mirrored disk.
  • Writes are a bit slower since the data must reach both storage servers, but reads should be faster.
  • If one storage server goes down, the Samba Jail has it's data available albeit at a degraded state since one block device is missing. Once the storage server comes back up, the jailed gmirror should copy the missing data on the late ZVOL.
I see this as the most basic design of a somewhat efficient SAN. I know there is a lot of more configuration to be detailed and performance tests are needed to see how affected are the writes.
I know that FreeBSD also comes with the daemon hastd which literally means "Highly Available storage", but I think it works more as a failover structure and having the secondary servers be "useless" waiting for a fail to happen is a bit of a waste.
Using iSCSI and mirroring the secondary servers can provide extra read operations, storage servers can be added for a horizontal scalability, and jails can have another virtual ZVOL added and have the mirror be expanded into RAID10 (maybe, just thinking out loud).

Does anybody see any problems with this design, or things to keep in mind? I still have no clue how to make the jails be redundant as well, maybe HAST can be useful there and have the jails userland be redundant.

I would love to hear your experiences with designing a HA-SAN :)
 

Attachments

  • HA-SAN.png
    HA-SAN.png
    233.3 KB · Views: 363
Just so that I understand correctly:

You want to make a gmirror out of three iSCSI devices?

You would be one of the few people on Earth doing raid from network devices. Not much experience out there.
 
You want to make a gmirror out of two iSCSI devices?

You would be one of the few people on Earth doing raid from network devices
Yes, I want to make a gmirror out of two iSCSI devices. FreeBSD is so great because it gives you the power to make strange decisions.

My idea is to try it out and see how (and if) it works and if it's also useful. Do you have any experience in managing a SAN?
 
Yes, I want to make a gmirror out of two iSCSI devices. FreeBSD is so great because it gives you the power to make strange decisions.

My idea is to try it out and see how (and if) it works and if it's also useful. Do you have any experience in managing a SAN?

I run several fileservers with their own redundancy and NFS and Samba. No jailing, no network devices (iSCSI).
 
a project called "The BeaST storage"
Thanks for your message, I did see the threads you posted but only now did I read the more thoroughly. The "BeaST classic" seems like an apt name for it haha. It feels as if all the parts are there for FreeBSD to be a powerful SAN solution, just a little bit more digging is needed. I will see what I can find about this project and if I find something useful I will link it in this thread

I definitely want to experiment on my own to have more solutions, but I dont have enough physical servers on my homelab to make enough tests, and doing them with VMs might work but I doubt how actual scenarios with real data would differ.

Each time I think about this topic I get more questions in my mind. One lately is if I expose two ZVOLs, create a mirror and then create a zpool with the virtual mirror, would it actually work as expected? I want to get the benefits of ZFS on the server that has the two LUNs, mainly ZFS snapshots and ZFS replication.

EDIT: It seems this is the latest BeaST project paper https://mezzantrop.wordpress.com/wp-content/uploads/2016/04/the_beast_grid_raid_ctlha_bq.pdf from 2020.08.28.
 
I definitely want to experiment on my own to have more solutions, but I dont have enough physical servers on my homelab to make enough tests, and doing them with VMs might work but I doubt how actual scenarios with real data would differ.

Each time I think about this topic I get more questions in my mind. One lately is if I expose two ZVOLs, create a mirror and then create a zpool with the virtual mirror, would it actually work as expected? I want to get the benefits of ZFS on the server that has the two LUNs, mainly ZFS snapshots and ZFS replication.

EDIT: It seems this is the latest BeaST project paper https://mezzantrop.wordpress.com/wp-content/uploads/2016/04/the_beast_grid_raid_ctlha_bq.pdf from 2020.08.28.
I have done all my internships in a virtual environment, and it has worked perfectly.

If you can do that, the volumes you expose can be mirrored.

For example, in my thread I am exposing two Zvols, one on each server. Those Zvols are inside a Pool of mirrors.

That is, you have a redundant Zvol in a pool of mirrors, and the client that imports those two Zvols as (LUN) creates a mirror with gmirror. If a pool falls on one of the servers, the mirror that I created with those two LUNs will be available.

When the fallen pool is back online and you re-join it with the active LUN, they will be synchronized again.

I don't know if I have explained myself well. And if I have read that PDF that you mentioned.

But I have opted to use Solaris, in the thread that I have mentioned I discuss whether it is viable to use Solaris since this system is becoming more and more forgotten every day. I have left it aside a bit since I am interested in other things at the moment, but I have problems installing Solaris cluster software.
 
This entire design make no sense to me.
Everything in the storage except the backplane is n+1. Each storage have dual psu, dual controllers (active/standby), dual sas ports to each disk and multipath to the each application server.
When you want to have HA storage you put them in different DC and replicate them with Synchronous replication or Asynchronous replication. It's not the job of the application server to take care of the redundancy of the storage layer. You need to minimize the complexity and administrative overhead and you are doing just opposite. Take some time and read about some best practices for example from Dell PowerVault or HPE MSA.
 
If a pool falls on one of the servers, the mirror that I created with those two LUNs will be available.

When the fallen pool is back online and you re-join it with the active LUN, they will be synchronized again.
I see that as the main benefit of this complex storage schema, plus having the gmirror provide higher reads. Still, there are many questions left open, so I guess there's nothing else to do but to start experimenting. Good to know that it worked in your virtualized environment! It won't be useful to measure performance improvements, but it does work to test many scenarios of storage servers failing.
It's not the job of the application server to take care of the redundancy of the storage layer. You need to minimize the complexity and administrative overhead and you are doing just opposite
I agree that the design seems overly complex and it may be adding complexity on the application server. Having multiple storage server replicate themselves with HAST is simple and it should work straight out of the box, but I am trying to think of another solution without a failover mechanism. Having multiple iSCSI drives across storage servers and mirror them on the application server (or an intermediate control server) might be interesting too, if not a lot more complex.

I will be checking out the best practices from Dell and HPE to keep learning more. Thanks for the suggestion
 
I agree with VladiBG the design is very complex and not the most suitable for infrastructures. I understand your point since I am in the same situation, at work we use HP 3PAR and IBM Storwize, but I have always wanted to use a clustered storage with ZFS. I have not seen a system that can do this by default other than Solaris with its Sun Solaris Cluster software. Alternatively, if you want, you can look at this demo:

https://www.oracle.com/docs/tech/simulator-guide.pdf

It would be like using an HP 3PAR system etc... but with ZFS. ZFS Storage Appliance, which I have used, is not like using ZFS on the FreeBSD operating system or others, since that Oracle system has its own software.

You won't be able to create a cluster as the demo doesn't allow it, but you have the options to do so. If you had a complete and licensed system, but they are expensive to do some practice...

I'm about to abandon the Solaris cluster thread, I think the most advisable thing would be to do specific training on some storage solution, if it's with ZFS the only ones I know are ZFS Storage Appliance and Netapp.
 
I agree with VladiBG the design is very complex and not the most suitable for infrastructures.
I know that the complexity is high, but the benefits might be worth it, specially for an infrastructure that requires a very high HA.

Having HAST/CARP on multiple DCs with either sync replication (data must be written to disk on all HAST nodes before write is confirmed) or async replication (data is written to disk on the primary node, and to memory on the secondary nodes) is the simplest, but there might be a failover time period where the storage servers are not available. Using async replication might be dangerous as well since some writes can be lost, meaning sync replication is pretty much the only secure way of using HAST. This introduces a noticeable level of delay.

I guess the only thing to do is to start testing things out and see if the benefits of iSCSI and ZVOL mirroring benefits outweighs the complexity.
 
You can post your process or questions in this thread if you want. I could help you. On the other hand, this thread made me want to take up this topic again.

I may take up the Solaris topic again, I could update the process in case there is someone else interested in these topics.
 
This couldn't have come at a better time!!
I've been thinking about redundant storage for awhile now and this https://forums.freebsd.org/threads/ha-zfs.92452/ gives me the skeleton to build upon!
Hello! While I didn't continue much on this topic, I still keep an interest on this. I've rebuilt my homeserver ZFS a few times, but it's not highly available.

If you want to post updates here, or have some questions, I am happy to think with you :)
 
Hello! While I didn't continue much on this topic, I still keep an interest on this. I've rebuilt my homeserver ZFS a few times, but it's not highly available.

If you want to post updates here, or have some questions, I am happy to think with you :)
Great!!
I have one question at the moment (I'm currently setting up the test environment) what's this arbiter?
Might you have a clue as to what it is or what it does/how it works?
 
I have one question at the moment (I'm currently setting up the test environment) what's this arbiter?
I'm not sure I follow you, what arbiter are you talking about?

I recall reading the documents on the BEAST storage system that there must be a driver/arbiter that makes sure that the HA storage connected to the Client are always available and do not cause data corruption. That's what I remember at a glance, you may read more here: https://github.com/mezantrop/BeaST
 
while I'd love to have FreeBSD as an option for HA storage, unfortunately for most of my clients it just did not meet the requirements. If you plan to use it in a corporate environment, I'm afraid I have to disappoint you and suggest using one of the various Linux solutions out there. I moved those systems mostly to Linux + DRBD (great), some to Linux + CEPH (good), and some to Linux + glusterfs (should be avoided). DRBD is the more advanced version of FreeBSD HAST and works really great - it is not limited to two nodes and performance stays on a very high level should one node fail (contrary to my experience with HAST).

If you do not need HA at the filesystem/block storage level, I recommend having a look at garage or minio, both S3 options are available in FreeBSD. And if you are satisfied with a HA database you also have very good options with FreeBSD.
 
I'm afraid I have to disappoint you and suggest using one of the various Linux solutions out there. I moved those systems mostly to Linux + DRBD (great)
That was I was afraid to listen, but slowly coming to accept this fact. FreeBSD + ZFS + HAST sounds so good, but it seems that it isn't there yet. Plus, buying the latest hardware always seems to be 100% compatible with Linux, and FreeBSD is just a step behind.

Of course, we need HA at all the layers. At the aplication level, it mostly depends on the App if it supports it, plus a good Nginx proxy configuration. HA at the database is mostly handled by Postgres clustering, so at least two layers can be hosted by FreeBSD. The only thing that remains is HA at the storage level.

I will check out Linux + DRDB. It's good to hear that you tried it out and you can compare it with real experience on HAST. Thanks for your comment.
 
In case you're interested, the only operating system I've found that offers native support for creating a ZFS cluster is Oracle Solaris with its solution 'Solaris Cluster 4.4'. If you want to take a look regardless of your operating system, you can do so. It only requires an Oracle account, as the license is free of charge.

On the other hand, keep in mind that Solaris seems to have its days numbered, from what I've seen, and you may encounter some difficulties since it's a different operating system with a different way of administration.
 
Back
Top