Other One SAN, two machines, sharing storage without NFS

Nasrudin · Jul 12, 2019

The subject provides a brief summary. Essentially we have two machines connected using gmultipath to the same SAN target. If I mount this target on both machines naively (via gpart on both and newfs on one or the other), I'm pretty sure this won't end well. What software or alternatives may I use to achieve this goal? NFS is only an option of absolute last resort.

Thanks in advance.

ralphbsz · Jul 12, 2019

Nearly all file systems expect to have exclusive control of the underlying block device. Matter-of-fact, many file systems will actually refuse to mount the same device on the second node, because it will see the device as currently mounted.

The easiest answer is to give up on your dream, have only one server active at any given moment, and then use that server as a NAS server. Why are you so dismissive about using NFS? It can work very well in this environment. This is particularly true since modern network interfaces are both inexpensive and very fast (10gigE or better, Infiniband), and often have latency and throughput performance that is as good or better than disk interfaces. For example, if you have one or two disks, connected via 3gig SATA (300MByte/s at most, in practice more likely 100-150MByte/s), and your nodes are connected via 10gigE, then your network is no longer the bottleneck.

If you want the availability of having two nodes, you can use HAST; it is described in the handbook.

If you absolutely insist on having the storage on the SAN and two server, then you need is a SAN file system or cluster file system. Such things exist. In the FOSS world, I think the best known ones are Gluster, Ceph and Lustre. I don't know whether Lustre is available on FreeBSD, but Ceph is at least in theory. Warning: Setting up a Ceph cluster is not easy under the best of circumstances. To my knowledge, the Isilon cluster file system is not freely available, even though it uses FreeBSD underneath. If you are willing to go to commercial products (often very expensive, quite a few require a minimum investment that's tens or hundreds of thousands, and some require significant administration effort), there are quite a few options, but I don't know which ones support FreeBSD.

In reality, we're discussing an XY problem: You are asking exactly how to do something. In order to give you a sensible answer, you need to first explain what problem you are really trying to accomplish. What are you trying to build? What is the intended workload and usage? What hardware are you intending to use?

Nasrudin · Jul 12, 2019

ralphbsz said:
Nearly all file systems expect to have exclusive control of the underlying block device. Matter-of-fact, many file systems will actually refuse to mount the same device on the second node, because it will see the device as currently mounted.

Because I was curious, I actually tested this since I had the testbed and it wasn't in production. Both systems can mount a UFS2 filesystem on the same block device. They can't seem to see the changes the other did though. I'm fairly sure there are other issues in using it this way; e.g. just how does the OS on the second server know that a write has finished on the first one?

ralphbsz said:
The easiest answer is to give up on your dream, have only one server active at any given moment, and then use that server as a NAS server.

Actually, the two servers are both initiators. We have a target that is a third machine.

ralphbsz said:
Why are you so dismissive about using NFS? It can work very well in this environment.

Mostly to get alternative answers, not to reject the solution out of hand. I noted that I would probably answer "NFS" when I originally worded the question. I wanted to see what other options are available. Next, because of conditions of the environment, setting up NFS is a lot of extra work and some expense. NFS also has locking and edge case issues that the users in question would be unfamiliar with. (I also considered HAST, splitting the SAN into two partitions, and using HAST to mirror both partitions using a one-to-one mapping of partition to server. I'm not sure if that's viable.) Finally, I'm sorry you got that I'm dismissive about it...I should learn to be clearer then; NFS is my solution of last resort because I know it will work. I just find it odd that we can have things like a block disk device attachable to more than one machine without the actual software to make it a shared file system between N servers (where N > 1) being readily available and easily found.

ralphbsz said:
If you absolutely insist on having the storage on the SAN and two server, then you need is a SAN file system or cluster file system. Such things exist. In the FOSS world, I think the best known ones are Gluster, Ceph and Lustre. I don't know whether Lustre is available on FreeBSD, but Ceph is at least in theory. Warning: Setting up a Ceph cluster is not easy under the best of circumstances. To my knowledge, the Isilon cluster file system is not freely available, even though it uses FreeBSD underneath. If you are willing to go to commercial products (often very expensive, quite a few require a minimum investment that's tens or hundreds of thousands, and some require significant administration effort), there are quite a few options, but I don't know which ones support FreeBSD.

Thank you, that's the kind of information I am looking to find.

ralphbsz · Jul 12, 2019

Nasrudin said:
Both systems can mount a UFS2 filesystem on the same block device.

That's odd, I would expect the second one to see that the file system is already mounted, and claim that it wasn't properly dismounted.

They can't seem to see the changes the other did though.

Welcome to caching. Both file systems will be caching the content of the disk, and they have no idea when the cache needs to be invalidated.

I'm fairly sure there are other issues in using it this way; e.g. just how does the OS on the second server know that a write has finished on the first one?

That's only a small part of the problem. Imagine what happens when both servers want to create a new file. Both will see that there is some free space, and allocate it, and both will write in the same space; whoever writes second will win. Then both will update the directory to show where the new file is, and their two directory writes will also end up on top of each other. Unfortunately, the question of "who wins" depends solely on the luck of timing, so on disk you will probably end up with a salad: finely chopped bits of data. As long as nobody has to read back from disk and just works out of cached content in RAM, this will actually seem to work reasonably well. The moment disk reads happen, all hell will break loose.

I just find it odd that we can have things like a block disk device attachable to more than one machine without the actual software to make it a shared file system between N servers (where N > 1) being readily available and easily found.

Well, early on disks used to be connected to exactly one machine. To connect them to another one required physically moving cables (and if you have ever worked with bus+tag cables, you know that this is not easy, the cables are 1" diameter, and the connectors as big as a brick). So the whole traditions of file systems for the last ~60 years started around single-attach disks. Actually, this is not quite true: Even in the 70s and 80s quite a few disk drive models could be connected to two hosts, and moved over from one to another with a front panel switch; this was for active/passive standby configurations: one computer goes down, switch all the disks over and toggle to the second one.

SANs are a relatively new invention. Excluding the Digital CI and its first distributed file system / cluster file system technology (from 1983, extremely early), multi-accessor disk only became popular with FibreChannel in the late 90s (theoretically, it was possible to use parallel SCSI with multiple initiators, but with the cabling limitations it was impractical with rare exceptions, like Sequent). And then, as soon as Fibre Channel was broadly available, SAN file systems sprang up like mushrooms after a rain. I worked on a few of them (my signature is on the very first shipping box that SAN-FS was delivered in). But building a shared-disk file system is surprisingly difficult, because you need *all* the computers to exactly agree on who is on charge of what. That is quite the task, given that communication between the computers can be disrupted any moment. It also comes with significant penalties; for example caching things is so hard that SAN file systems initially struggled to get the same performance as single-node file systems for single-node workloads. And the software complexity is overwhelming; getting this to work bug-free and efficient takes an enormous amount of effort, which is why there are few and only expensive commercial offerings, while the free software is stuck in niches (like Ceph and Lustre serving the supercomputer market, and surviving on very lucrative support contracts from large customers, given that installing and operating a cluster file system is very hard).

The thing is: Today, with SAS, it is nearly trivial to build a 2- or 4-initiator SAN; most JBOD disk enclosures have connectors that you can have 4 initiators no problem. But using that hardware resource is very hard. Nearly all users are better off setting up a single server, and then using a network protocol.

If you think NFS has problems (you mention locking), then just switch to a modern protocol, like NFSv4 or CIFS. It is quite possible to build high-performing and efficient clusters on top of NAS technology.