How does distributed computing work with FreeBSD?

poorandunlucky · Jan 23, 2018

I'm under the impression that you can have a bunch of computers and hook them all up together, and have one, big computer running a single instance of FreeBSD, and essentially all running as a single machine.

Is that correct?

Like can I hook up a bunch of computers together to make a single one where each CPU's cores add-up, but essentially is seen as a single CPU with a lot of cores in it?

Like on Windows, if I want to dream or imagine a massive machine, it's pretty easy... I just look up the latest hardware, look-up the price tags, choose the most expensive of each, and voilà, the best computer you can build at that given time. It's not exactly accurate, but that's pretty much how it goes for Windows...

How do I dream or imagine a FreeBSD massively awesome computer capable of running the latest games in Windows 10 under bhyve not for one, but for hundreds of users at the same time?

Is that even feasible?

I'm just trying to get my bearings straightened out here...

Eric A. Borisch · Jan 23, 2018

There is ScaleMP (http://www.scalemp.com/) which basically does what you want by using a high-speed fabric (10GbE+ / Infiniband / etc.) to create one virtual shared memory / processor system. You have to have the right hardware to support it. No clue if anyone has ever tried FreeBSD on it.

Other, more wodespread approaches include MPI (with potentially multiple distributed - as in separate computers - memory systems running executables that sync & communicate with each other to perform tasks in concert... although it can also work nicely on just a single multicore system, too) or OpenSHMEM which is somewhat similar to ScaleMP but as an API layer rather than a ~virtualization platform. The net/mpich port is an implementation of the MPI API. These all fall under “cluster” computing, typically — at least in my taxonomy.

MPI is the de facto standard for all the large supercomputers.

Then there is “grid” style computing, where multiple nodes work together to solve a problem, but in a less rigid structure than MPI imposes. Cloud-scale things are typically some flavor of this for their flexibility and resiliency. Frequently you’ll have coordination nodes and processing nodes. Map-reduce runs in ~this fashion.

Basically things like weather forecasting (one big simulation that is solved — at some scale —as a “whole”) fits well with MPI, or OpenSHMEM, or ScaleMP, while things like facial recognition on tons of images would use something like map-reduce.

Be warned that “grid” and “cluster” are used quite interchangeably by different people, these are how I use them. YMMV. Both involve lots of instances of OSes running on multiple machines. Only ScaleMP (as far as I am aware, but I imagine there may be others) stitches together multiple machines to run one OS across.

Beeblebrox · Jan 23, 2018

I'm under the impression that you can have a bunch of computers and hook them all up together, and have one, big computer running a single instance of FreeBSD

Not exactly. You set up thin (or diskless) clients to boot from PXE the kernel and modules. Then you mount system root for these clients via likes of NFS or GlusterFS, giving you a shared file space; but all hardware units are now running their own instances. Separately, you can set up NFS root as read-only and NFS home as read-write so that clients have limited system access. This setup can be used for an office for example, where staff use browser, spreadsheets and word processors.

When you want to get into HPC, the same construct applies, but now you use a management layer for distributed computing. Setting these up is fairly easy and depends on the particular software overlay. Speed/efficiency of the total system depends on several factors and that's where the real magic comes in.

My thin clients boot out of a Jail, if that helps any...

grehan@ · Jan 23, 2018

Also check out Tidalscale, which makes a distributed hypervisor, and actually uses FreeBSD for the host o/s (as well as being able to run FreeBSD as a guest) https://www.youtube.com/user/TidalScale

tingo · Jan 24, 2018

Also note, that in general, HPC or distributed computing is not a "general purpose computer" that you can run a standard operating system on. It is more of a platform that requires specially written applications for the tasks you want to compute. Note: I haven't looked at ScaleMP or TidalScale, they might work in a different way.

PacketMan · Jan 24, 2018

tingo said:
... requires specially written applications for the tasks you want to compute.

If a genius could write a FreeBSD program, that allows a pool of machines to create a 'processing instrance' (call it cloud) over the Internet 'to run other programs from one hosting machine' would be cool. So in other words a bunch of guys use this program to build our distributed processing cloud instance, using 'that program', and then we run our programs inside that cloud processing instance. I run my program 'A', Joe runs his program 'B', Shirely runs her program 'C', and each uses the CPU on all our machines. Now that would be sweet, but might not be as enjoyable as one would think because each Internet link could a choke point especially on the upload side. Now if said 'program' could leverage Bittorrent under the bonnet, you just might have something truly incredible. Trust would also have to establish, and ideally configurable in ways.

ralphbsz · Jan 24, 2018

poorandunlucky said:
I'm under the impression that you can have a bunch of computers and hook them all up together, and have one, big computer running a single instance of FreeBSD, and essentially all running as a single machine.

My personal definition of "operating system instance" is: you can run a single multi-threaded process (the thing you would see in commands like ps or top) which can use the resources (memory, CPU cores, IO) of all the hardware. And do that within the full instruction set of the CPU: Each load and store instruction can access the whole memory directly. The multiple threads of that one process can use all CPU cores in the hardware, and can use normal thread-to-thread communication primitives (shared address space, atomic memory operations such as CAS, locks, condition variable) to communicate and synchronize. IO (through the file system usually, other hardware resources are more tricky) can use all storage devices attached to all nodes.

Note that today in a multi-socket machine, memory is already "NUMA", or Non Uniform Memory Access: Memory DIMMs are typically attached to a CPU chip. A thread that is running on a core in a socket will have better access times to memory that's physically attached to that socket, than to memory attached to the chip in the other socket. This is in addition to L1/L2/L3 cache. In order for things like locks and condition variables to work, and in order for multi-threaded processes to not go crazy, the caches on the various CPU chips of such a machine need to be coherent, which is why the technical term for multi-chip machines is ccNUMA, or cache-coherent NUMA.

As machines get bigger, it gets harder to provision the required bandwidth for memory accesses, and stay cache coherent. That's why consumer machines top out at two sockets, and most enterprise machines at four sockets: the hardware requirements for keeping all the memory accesses organized get to be harder and harder, and it slows down memory accesses more and more.

There still are real supercomputers that are single-instance machines with very large CPU counts. All these machines are configurable to be partitioned into more or fewer "nodes" or instances. The two examples I know of are HP's Superdome (which can have hundreds of Itanium CPUs, and I think up to 32 CPU chips can form one instance), and the IBM P7-IH (which is built from drawers that contain 32 CPU chips, but multiple drawers can be linked into one cache-coherent NUMA machine with de-facto unlimited CPU count). Since each CPU chip then has dozens of cores, the parallelism of these machines is pretty astounding. I remember logging in to one of them (using Linux), and the file /proc/cpuinfo showed 1024 CPUs present.

The problem with these ccNUMA supercomputers is that memory accesses get slower, the further the memory is away: the fastest memory (not including cache) is attached to the CPU chip itself; it's a little slower if attached to a bridge chip that's perhaps shared by multiple CPU chips, slower if attached to a CPU chip in the same module or building block or blade, slower if attached to a different blade, and even slower if accesses have to go over a proprietary network within the computer. For this reason, most of these supercomputers are actually partitioned into many relatively small nodes; it's quite rare to run with a single instance consuming a lot of hardware, since needing to keep all memory accesses cache-coherent will make all nodes slow.

The alternative to running a single OS instance over many chips is to run a cluster, with many "computers" or nodes. Then run software that uses networking prototocols (such as MPI, OpenShMem or map-reduce) to communicate with processes on other computers within the cluster. As described above, that can be done with free software. Whether this works well or badly depends on balance: On one hand, if the program can be partitioned to most memory accesses are local, and there is little need for communication between nodes, this works well. On the other hand, if very fast networking hardware with very low latency (in particular Infiniband) is available, one can handle more communication between nodes. So this balance between communication need of the program, and communication availability of the hardware will eventually determine the performance. Note that this balance is the same, whether one is running a cluster of commodity computers, or a ccNUMA supercomputer: both have communication needs of the program, and communication capabilities of the hardware. The real difference is that the ccNUMA supercomputers have much faster internal networking, custom-built for the CPUs involved, and heinously expensive. In addition, the real supercomputers also employ whatever hardware "excesses" can be used to make them faster: water cooling, fiber-optic internal networking, DC power distribution (where power supplies are physically separate from the compute modules), unusual form factors (the components are much larger than standard 19" rackmount), and massive IO bandwith (for example, one of those machines has 16 PCIe slots per compute module).

An interesting alternative is to run ScaleMP, which is a virtualization technology (like Xen or VMWare) which pretends to create a multi-core computer that uses CPUs and memory from all computers in the cluster. The problem with ScaleMP is that it can't beat the fundamental limitation of the balance between communication need and hardware: even with 100gig Infiniband, access to (virtualized) memory on the other end of a network cable is much slower than access to local memory. This is the same effect that causes the big supercomputers to usually be partitioned into smaller nodes.

Is that even feasible?

Feasible: Yes, see above. Desirable: For most problems, massively parallel single-instance machines are not desirable, because they are too slow; cluster technologies are more sensible today, but require a large investment in software, and in rewriting the code that actually performs the calculations. In some cases, massively parallel machines are still used.

poorandunlucky · Jan 26, 2018

*hopes and waits for even more interesting, quality input*

vchan · Feb 7, 2018

Most distributed cluster computing programs work in a client server approach. So you would have one main machine that you do all your work off of and it would utilize the resources of the other machines in the cluster to complete the task. Distcc is one such program where you can use it to build programs across multiple computers reducing compile times.