I'm under the impression that you can have a bunch of computers and hook them all up together, and have one, big computer running a single instance of FreeBSD, and essentially all running as a single machine.
My personal definition of "operating system instance" is: you can run a single multi-threaded process (the thing you would see in commands like
ps
or
top
) which can use the resources (memory, CPU cores, IO) of all the hardware. And do that within the full instruction set of the CPU: Each load and store instruction can access the whole memory directly. The multiple threads of that one process can use all CPU cores in the hardware, and can use normal thread-to-thread communication primitives (shared address space, atomic memory operations such as CAS, locks, condition variable) to communicate and synchronize. IO (through the file system usually, other hardware resources are more tricky) can use all storage devices attached to all nodes.
Note that today in a multi-socket machine, memory is already "NUMA", or Non Uniform Memory Access: Memory DIMMs are typically attached to a CPU chip. A thread that is running on a core in a socket will have better access times to memory that's physically attached to that socket, than to memory attached to the chip in the other socket. This is in addition to L1/L2/L3 cache. In order for things like locks and condition variables to work, and in order for multi-threaded processes to not go crazy, the caches on the various CPU chips of such a machine need to be coherent, which is why the technical term for multi-chip machines is ccNUMA, or cache-coherent NUMA.
As machines get bigger, it gets harder to provision the required bandwidth for memory accesses, and stay cache coherent. That's why consumer machines top out at two sockets, and most enterprise machines at four sockets: the hardware requirements for keeping all the memory accesses organized get to be harder and harder, and it slows down memory accesses more and more.
There still are real supercomputers that are single-instance machines with very large CPU counts. All these machines are configurable to be partitioned into more or fewer "nodes" or instances. The two examples I know of are HP's Superdome (which can have hundreds of Itanium CPUs, and I think up to 32 CPU chips can form one instance), and the IBM P7-IH (which is built from drawers that contain 32 CPU chips, but multiple drawers can be linked into one cache-coherent NUMA machine with de-facto unlimited CPU count). Since each CPU chip then has dozens of cores, the parallelism of these machines is pretty astounding. I remember logging in to one of them (using Linux), and the file
/proc/cpuinfo showed 1024 CPUs present.
The problem with these ccNUMA supercomputers is that memory accesses get slower, the further the memory is away: the fastest memory (not including cache) is attached to the CPU chip itself; it's a little slower if attached to a bridge chip that's perhaps shared by multiple CPU chips, slower if attached to a CPU chip in the same module or building block or blade, slower if attached to a different blade, and even slower if accesses have to go over a proprietary network within the computer. For this reason, most of these supercomputers are actually partitioned into many relatively small nodes; it's quite rare to run with a single instance consuming a lot of hardware, since needing to keep all memory accesses cache-coherent will make all nodes slow.
The alternative to running a single OS instance over many chips is to run a cluster, with many "computers" or nodes. Then run software that uses networking prototocols (such as MPI, OpenShMem or map-reduce) to communicate with processes on other computers within the cluster. As described above, that can be done with free software. Whether this works well or badly depends on balance: On one hand, if the program can be partitioned to most memory accesses are local, and there is little need for communication between nodes, this works well. On the other hand, if very fast networking hardware with very low latency (in particular Infiniband) is available, one can handle more communication between nodes. So this balance between communication need of the program, and communication availability of the hardware will eventually determine the performance. Note that this balance is the same, whether one is running a cluster of commodity computers, or a ccNUMA supercomputer: both have communication needs of the program, and communication capabilities of the hardware. The real difference is that the ccNUMA supercomputers have much faster internal networking, custom-built for the CPUs involved, and heinously expensive. In addition, the real supercomputers also employ whatever hardware "excesses" can be used to make them faster: water cooling, fiber-optic internal networking, DC power distribution (where power supplies are physically separate from the compute modules), unusual form factors (the components are much larger than standard 19" rackmount), and massive IO bandwith (for example, one of those machines has 16 PCIe slots per compute module).
An interesting alternative is to run ScaleMP, which is a virtualization technology (like Xen or VMWare) which pretends to create a multi-core computer that uses CPUs and memory from all computers in the cluster. The problem with ScaleMP is that it can't beat the fundamental limitation of the balance between communication need and hardware: even with 100gig Infiniband, access to (virtualized) memory on the other end of a network cable is much slower than access to local memory. This is the same effect that causes the big supercomputers to usually be partitioned into smaller nodes.
Feasible: Yes, see above. Desirable: For most problems, massively parallel single-instance machines are not desirable, because they are too slow; cluster technologies are more sensible today, but require a large investment in software, and in rewriting the code that actually performs the calculations. In some cases, massively parallel machines are still used.