Characteristics of hardware for SOHO NAS boxes & mini/micro servers

Mjölnir · Jan 22, 2021

Your valuable feedback is welcome, as well as links & hints to reasonable hardware (or parts).

When researching for (refurbished/used) pre-built systems or (preferably used) parts to build my own homegrown SOHO microserver, I was shocked that most offerings do not fulfill even my most basic requirements and violate basic principles of engineering. Therefore, I'd like to share some thoughts. Let's begin with some

Basic physics & conclusions (left/right and "No, there's no alternative facts to this, because there's no alternative physics")

Warm air flows up (if surrounded by colder air & not artificially blown or sucked otherwise).	A fan working against this natural air flow needs more energy. Instead fans should amplify the natural air flow.
Colder air can dissipate more heat than warmer air.	If the fan fails, parts likely to overheat quickly are: - the upper HDD's if placed horizontally (in a stack), - the upper components of a motherboard placed vertically.
The lifetime of electronic devices relates to their operating temperature. It reduces in a hot environment. Thermal problems are the main source of failure for HDDs.	When the system is idle, but HDDs keep spinning and the fan goes off, the upper HDDs are warmer if placed horizontally (stacked). This holds true for non-rotating devices (SSD). Thus the upper ones are more likely to fail.
Holes in the vertical sides of the case can not accidentally be covered easily (e.g. a manual placed on top). Holes in the bottom or lower vertical sides allow fresh, cold air to flow in.	Holes in the upper (preferably vertical) sides of the case allow for hot air to flow out naturally, even when the fan fails or is off.
The PSU (power supply unit) is a source of heat.	Thus, the best placement is: horizontal MB below vertical HDDs, PSU beside or at the top or external. Experience shows that a vertical MB is fine in most conditions.
Heat conductivity: aluminium (-alloy) > metal > plastic/GRP. Parts made of plastic are more likely to break.	Preferred materials for the case are aluminium or metal.
A case with a standard bezel for the external connections of the mainboard can easily be upgraded by replacing a modern MB.	In contrast, most existing NAS boxes can not be upgraded easily, unless one is able & willing to cut off the bezel from the rear side of the case.
Parts (MB, PSU) of propietary sizes can not easily be replaced by commodity parts of standard sizes.	Avoid devices with a motherboard of a propietary form factor. Watch out for the connections for the PSU: ATX-type is the most compatible (caution: ATX 1.x vs. ATX 2.x).
For disks >9 TB, the likelyhood of an undetected bit-failure is astonishingly high. Naturally, this can also affect a stored checksum.	For disks >9 TB, a 2-way mirror or any std. RAIDx with one parity disk, is not safe anymore. Instead, use at least 3-way mirrors or RAID with more than one parity disk. The point here is that a 4-disk NAS box can then only be used for a 3-way mirror with one spare, not as RAIDn, n>=3.
Two (or more) network interfaces can be configured for automatic fail-over, and/or bundled with a standardized protocol (LACP).	Usually, one network iface is sufficient for most use cases, until the day when this plastic clip to secure the cable into the plug breaks...
Any reasonable means of remote management, e.g. KVM-switch, avoids the need to plug in a keyboard and monitor to access the console before the OS runs. A serial console is only ok if the boot loader is runnable.	A dedicated network iface for LOM/OOB management allows to place the box in a remote office, cellar, cubbyhole, or wardrobe (ventilation!). IPMI is a standardized protocol and even some mITX mainboards support it.
~~The likelyhood for natural cosmic radiation to cause a bit flip in a 4 GB array of volatile RAM is astonishingly high.~~	~~On average, expect one bit flip in 4 GB RAM every day (of runtime).~~ If ~~that~~ a bit error happens undetected to a vital piece of kernel memory, the machine will likely crash. Using non-ECC RAM for any server ~~with more than 4 GB RAM~~ eventually will cause trouble.

The above thoughts are all guided by emphasising reliability and environmental friendliness: if a fan fails, a system in a "naturally cold" case will have more time for a controlled shut down. Re-using an existing case reduces waste, and avoids energy consumption to manufacture a new one. The form factor miniITX allows for one extra PCI card, miniDTX allows for two, but many manufacturers solve the need to allow for two PCI cards by using a riser card in an mITX board, which is perfectly ok IMHO.

ralphbsz · Jan 23, 2021

Most of what you're saying is right. But some of these observations are exaggerated.

mjollnir said:
Warm air flows up

Yes, but the power density (W/liter or W/square centimeter of surface area) of most modern electronics is so high that convection cooling alone won't be sufficient. Also, the temperature differences on electronics are not very large: convection works really well when you have an open flame (hundreds of degrees C difference); between room temperature (25) and a warm CPU (65) or a disk (45 degrees), the difference is small, and doesn't cause much convection. Once you introduce fans for moving air, the little bit of convection is not very important.

In this department, you overlooked an important observation:
Water has much higher heat capacity than air, and is similarly easy to move around.
Water cooling is an excellent idea. Shame it is only being used by crazy overclockers and people who think of their computers as status symbols and decorations, and large industrial computing installations.

A fan working against this natural air flow needs more energy. Instead fans should amplify the natural air flow.

In practice, that make little difference. The energy usage of interior fans is a small fraction of the energy usage of a computer. Other things (like air conditioning and large-scale heat managememt) are big factors. In the bad old days, data centers used to use about 2W for every 1W the computer dissipated, with the "waste" split between power supply inefficiency, local fans (in the case), building fans, and actual cooling. Modern data centers have this down to about 1.1W (or 10x less waste), so trying to make cooling more efficient is no longer a huge battle. But clearly, consumer computers are much less well optimized. This is one of the reasons one should outsource computing to data centers: they are much more efficient than home equipment (in particular home-built).

If the fan fails, parts likely to overheat quickly are: ...

True. But if a fan fails, the remaining redundant fans should keep the system running. Or the system should shut down instantaneously, long before overheating causes damage. This requires tight integration between fan control, the rest of the hardware, the OS, and larger control systems (like batch queueing systems that need to know how many computers are available for the workload).

The lifetime of electronic devices relates to their operating temperature.

Nearly completely true. Except don't overdo it: really cold temperatures are actually quite bad for some hardware. Disk drives (which have moving parts) will struggle to work at temperatures of 10 or 12 degrees, and often will go into verify-after-write mode. I think part of the reason is that lubricants (spindle, head actuator) are too sticky. And other components like PC boards risk condensation occurring if parts of the environment are very cold.

The PSU (power supply unit) is a source of heat.

Which is why modern computer architectures split the power supply into various levels. For example, while CPUs internally run on 1.2 or 1.3V, that power is made locally (right next to the CPU), stepped down typically from 12V. Large-scale computer gear tends to run with DC supplies, with common voltages ranging from 48V to 350V. What DC voltage to use is a complex tradeoff, between safety (350V DC is very deadly), efficiency (less copper, less ohmic loss), suitability for backup batteries, and power supply efficiencies.

If you water cool one thing in the computer, it is the CPU (or GPU today). The second thing is typically the power supply. Water cooling RAM or disks is only needed in extreme machines.

Parts (MB, PSU) of propietary sizes can not easily be replaced by commodity parts of standard sizes.
Avoid devices with a motherboard of a propietary form factor.

For large-scale installations, commodity parts and standard sizes are irrelevant. The FAANG all have custom components, and unusual form factors, for higher efficiency. If you buy computers by the million, getting 1% more efficiency out by custom-building all boards is more important than being able to buy off-the-shelf parts.

For disks >9 TB, the likelyhood of an undetected bit-failure is astonishingly high.

You are exaggerating ... at 9TB, it is "only" 7.2% (assuming an uncorrectable BER of 10^-15). Which is still very scary: If you have over a dozen disks that size, you will likely have an uncorrected error.

Naturally, this can also affect a stored checksum.

And that's why you don't store the checksum right next to the data. Instead, you want to have the checksum on a separate device, ideally stored even more redundantly than the data itself.

For disks >9 TB, a 2-way mirror or any std. RAIDx with one parity disk, is not safe anymore. Instead, use 3-way mirrors or RAID with more than one parity disk.

Very true. About 15 years ago, the CTO of NetApp referred to single-fault-tolerant RAID as "professional malpractice".

A serial console is only ok if the boot loader is runnable.

From a purist viewpoint, that's true. But the boot loader works most of the time, so a serial (or networked) console covers 99% of the problems.

A dedicated network iface for LOM/OOB management allows to place the box in a remote office, cellar, cubbyhole, or wardrobe ...

True. But having to string a completely separate physical layer (separate interfaces, patch cables, switches, and IP numbers) is too much effort. Again, you can get 99% of the benefit by configuring VLANs, and reserving one VLAN for OOB management.

The likelihood for natural cosmic radiation to cause a bit flip in a 4 GB array of volatile RAM is astonishingly high.

No, that number is nonsense. It was calculated by taking the physical size of ancient RAM cells and multiplying with the high densities of modern RAM. It's like saying "What would happen if you had a computer with 4 GB of RAM, built using 64 Kbit chips" ... which nobody would ever do, because you would need millions of chip. The error rate is to first order proportional to the physical volume of silicon that's an active capacitor cell, and that volume has not changed significantly in the last ~40 years for common computers.

There is a good paper (alas, already ~10 or 15 years old) by Bianca Schroeder, using Google data, which went into this in gory details. The correct answer is: cosmic rays are irrelevant to memory errors. But ...

Using non-ECC RAM for any server ~~with more than 4 GB RAM~~ eventually will cause trouble.

... but this remains true (if you ignore the 4 GB): Memory errors exist, they are real. But they're not caused by cosmic rays, nor are they proportional to the amount of RAM. Most memory errors are caused by hardware problems, of which loose connections is the leading cause. And on servers, memory errors can have catastrophic consequences, which is why ECC is highly recommended.

Snurg · Jan 23, 2021

There are not many professional-grade boards with energy-saving CPU(s): Atom etc.
E.g. 8+ sockets for ECC DRAM modules, 3,4 PCIE slots, 2+ NIC onboard

PSU that actually is conform to ATX standard regarding different loads on +3.3/5 and +12V without exceeding the allowable output voltage ranges (the manufacturers all claim to be ATX compliant, but in my laboratory research of a few dozen power supplies I found only one manufacturer whose PSUs actually were in-spec in all compliance tests)

Not-too-big case (max. side length ~40cm) that allows for tool-less handling of at least 4 drives.
(Ideal would be if the "front plate" would be on the backside, at least power LED, reset and power buttons.)

Air inlets placed in a way so that one can place air filter sheets over of them so that the insides stay clean.

Separate enclosure for backup tape device - not in server case, to avoid dirt.

Mjölnir · Jan 23, 2021

I wrote "case" above, which is ambivalent; I will now use "housing", "casing" or "chassis" instead.

ralphbsz said:
Yes, but the power density (W/liter or W/square centimeter of surface area) of most modern electronics is so high that convection cooling alone won't be sufficient. Also, the temperature differences on electronics are not very large: convection works really well when you have an open flame (hundreds of degrees C difference); between room temperature (25) and a warm CPU (65) or a disk (45 degrees), the difference is small, and doesn't cause much convection.

1st, AFAIK e.g. an heating engineer estimates a difference of 20 K for (a reasonable amount of) heat transfer to occur. So when the CPU surface is 65°C and the interior of the box is 45°C, heat will be dissipated and convection occurs. 2nd, I'm talking about small (not so powerful) servers, SOHO NAS boxes in particular, where the CPU is often (usually?) passively cooled. E.g. for a machine that is so powerful that it can do more than just serve as NAS box: even in the HPE microserver 10+ (interior), the CPU does not have a dedicated fan; instead, a dual-part semi-passive cooler (placed in the air flow of the housing fan) dissipates the heat off the CPU. In small SOHO NAS boxes, the casing fan can go off when the machine is idle. Then, the only cooling is by convection (loosely speaking), and my conclusion holds true (if the disks keep spinning, which they typically do). Also, I've occasionally read in hardware forums about the upper HDD beeing much hotter than the lower one.

ralphbsz said:
In this department, you overlooked an important observation:
Water has much higher heat capacity than air, and is similarly easy to move around.
Water cooling is an excellent idea. Shame it is only being used by crazy overclockers and people who think of their computers as status symbols and decorations, and large industrial computing installations.

Thanks for the reminder, I've totally forgotten about that. Water cooling is an excellent method to get a near-noiseless computer. I will calculate and research if it's even possible to use natural convection (w/o a pump), but I my 1st guess is that the height needed for that is too large. Besides that, I remember an article in a computer magazine about the ultimate noiseless, but powerful computer: the chassis was replaced by an aquarium filled with standard synthetic engine oil (oil does not carry electrical current). The heat was dissipated by natural convection, and it worked fine...

that's what I call a clever solution! Of course the MB has to be placed vertically with external connectors above oil level, facing upwards.

ralphbsz said:
But clearly, consumer computers are much less well optimized. This is one of the reasons one should outsource computing to data centers: they are much more efficient than home equipment (in particular home-built).

I'm talking about small SOHO servers, and I think there's good reasons & demand to have such.

ralphbsz said:
True. But if a fan fails, the remaining redundant fans should keep the system running.

Most SOHO NAS boxes have only one fan, even the relatively powerful machine mentioned above. Seems that fan failures do not happen that often, i.e. the longevity of fans exceeds the typical lifetime of a computer.

ralphbsz said:
For large-scale installations, commodity parts and standard sizes are irrelevant. The FAANG all have custom components, and unusual form factors, for higher efficiency. If you buy computers by the million, getting 1% more efficiency out by custom-building all boards is more important than being able to buy off-the-shelf parts.

I'm talking about SOHO hardware, not FAANG datacenters... besides that, even these have SOHO hardware in their numerous remote small offices, right? The point of concern is: most of the SOHO NAS boxes I found are built to be used once, then thrown away. If they were designed to possibly take standard commodity parts, at least an old chassis could be re-used and pimped out with a modern motherboard. As the case may be, the PSU has to be replaced, too.

ralphbsz said:
You are exaggerating ... at 9TB, it is "only" 7.2% (assuming an uncorrectable BER of 10^-15). Which is still very scary: If you have over a dozen disks that size, you will likely have an uncorrected error.

I wouldn't call that exaggerating. Your numbers support my statement (BER := bit error rate). BTW P(undetected error on any disk) > 50% ("likely") with 10 disks ((1-P(err))^N). The calculation to get the "safety" (1-P(error on all disks)) is 1 - 7.2%^2 = 99,48% ("not safe enough") and 1 - 7.2%^3 = 99.96% ("somewhat safe") and 1 - 7.2%^4 = 99.997% (considered "safe" (nearly "5 nines")), where the exponent denotes the #disks in a mirror; same for a RAIDx, x>1: then it's the #parity disks + 1. Conclusion: using bigger disks saves energy, but increases the likelyhood of unrecoverable data errors.

ralphbsz said:
And that's why you don't store the checksum right next to the data. Instead, you want to have the checksum on a separate device, ideally stored even more redundantly than the data itself.

Where does that come from? Who says that? Which (ideally: open source) LVM can do that? Is that available for FreeBSD? Can ZFS store it's checksums on a dedicated, mirrored device? Does FreeBSD's geom(4) or gvinum(8) support such functionality? Is that planned for a future version? Or will large drives of the future have their own checksums stored in (possibly mirrored) NVRAM? The point is: ZFS's checksum can only detect, not repair bad data. I.e. when the checksum is undetected errorneous, correct data will be flaged as bad, but it's very close to impossible that undetected bad data is flaged good.

ralphbsz said:
Very true. About 15 years ago, the CTO of NetApp referred to single-fault-tolerant RAID as "professional malpractice".

IIRC NetApp leverages FreeBSD, so I'm in good society.

ralphbsz said:
From a purist viewpoint, that's true. But the boot loader works most of the time, so a serial (or networked) console covers 99% of the problems. [...] True. But having to string a completely separate physical layer (separate interfaces, patch cables, switches, and IP numbers) is too much effort. Again, you can get 99% of the benefit by configuring VLANs, and reserving one VLAN for OOB management.

OK, I'll keep that in mind. Maybe I can live with installing the box once with HIDs attached, and then place it in my wardrobe in the hope I'll never have to touch the BIOS again. Then that HPE microserver Gen10+ (or even the previous model Gen10) may make it into my watchlist again; it lacks an M.2 connector and a 5th SATA connector for a separate boot device (SSD). The latter would really be nice to keep the HDDs for data only.

ralphbsz said:
No, that number is nonsense. [...] The correct answer is: cosmic rays are irrelevant to memory errors. But ...

Thanks for the explanation. Didn't now that, all I remembered was a paper I fetched from IBM's website about 12 years ago. Maybe they wanted to scare their customers to keep their hands off commodity hardware.

Snurg said:
There are not many professional-grade boards with energy-saving CPU(s): Atom etc.
E.g. 8+ sockets for ECC DRAM modules, 3,4 PCIE slots, 2+ NIC onboard

For the typical SOHO use case, noise and thus energy consumption is of high concern. IMHO two DIMM slots are enough, because every module constantly consumes power, and for a low-power CPU, 2-channel RAM access to 2 slots should offer more than adequate performance. On a board w/ 4 DIMM slots, I'm going to disable two by clamping a cardboard into the slots: "Keep empty or prepare to use a larger PSU". Likewise the # PCI slots: every card uses energy, and space. AFAIK SuperMicro has proven to be highly compatible with FreeBSD, and one can comfortably select criteria to find a matching product on their website. Of course, many other manufacturers offer compatible and very good boards, too.

Snurg said:
[...], but in my laboratory research of a few dozen power supplies I found only one manufacturer whose PSUs actually were in-spec in all compliance tests.

Which one is that? Maybe a good external DC PSU is more efficient?

Snurg said:
Not-too-big case (max. side length ~40cm) that allows for tool-less handling of at least 4 drives.

IMHO, the term mini implies a cube of max. 30cm -> miniITX, miniDTX or FlexATX. YMMV.

Snurg said:
Air inlets placed in a way so that one can place air filter sheets over of them so that the insides stay clean.

Do you have a cat? You can glue the filter with cellotape. Yes, simple rails to hold a filter sheet would be nice. Never seen that, though.

Now I'm looking for a motherboard, taking that HPE microserver 10+ mentioned above as guideline. Some boards have a M.2 adapter to connect an extra-fast SSD (e.g. for a cache device), but I guess that doesn't make much sense in an otherwise rather low-performance system (CPU TDP max. 65W, the HPE box has 72W); all the more, as it would increase TCO (also through power consuption). An NVMe adapter is comparable to that (even better), right? I'd like to keep the PSU at/below 180W max., aiming at a box of (agregated) about 2-4 x the CPU-power of my laptop (2-core Broadwell-U@2.6-3.2GHz turbo, 14nm die, TDP 15W). Ugh, now I have to research about all these new connection standards...

I'm comparing 3 mini-ITX boards with Xeon D-15xx & D-21xx (12 cores, same gen. like my laptop (Broadwell) & 8 core, next gen. Skylake, resp.; both 14nm). Namely SuperMicro X10SDV-12C+-TLN4F & X11SDV-8C-TLN2F (w/o CPU fan and -8C+- w/ CPU fan). I'm asking myself if that's adequate for my goal, as the 1st has much more cores than I expected (4-8 (6 implying a higher clk. freq.)). These are SoC w/o on-chip GPU, does that make up for a factor of up to 1.5 for the #cores (TDP: 4x15W=60W)? A laptop of 2015 is more or less a SoC, too, isn't it? Found the answer: the D-1567 (12 cores) runs @2.1GHz, 20% lower than my laptop, so it will have less single-thread performance (can't find about turbo freq.). Hmm. Have to think about that, don't know if I want that. The next gen. D-2141I (8 cores) runs @2.2GHz (3.0GHz turbo). Comparing only the turbo freq., that looks ok as it's not significantly slower than my laptop (94%), but many tasks will run for quite some time and the turbo doesn't apply then? Also I'll have to research if the integrated network is crap, and about the differences between these Xeon D-15xx (2 RAM channels) and D-21xx (4 ch.), as well as between Broadwell & Skylake architecture (presumably better power efficiency, +DDR4 DIMM, +basic AVX-512, -FIVR, ...). When the VRM is external to the CPU (Broadwell: integrated), does that mean it's power loss is subtracted from the TDP? These VRM dissipate much heat, external ones have faily large heatsinks. If yes, then the D-21xx should have significantly better performance at the same TDP, and I have to estimate the VRM's power loss and add that to the machine's total! Will have to think about the pros & cons of SoC "server" vs. a "workstation/desktop"-class CPU, for the latter has integr. GPU (don't need?) & better single-thread performance (want?). Skylake 's GPU supports hardware-assisted decoding of additional media formats, and OpenCL 2.0. Do I have any advantage of that when I run an X11 app. on that remote machine? Any disadvantage w/o GPU? I never had to massively re-code media data. If I want that in the future, I can buy a graphics card and plug it in. Currently my viewpoint is: buy now only what you want/need now, and do not buy possibilities that eventually never get used. Pros of D-2141I vs. D-1567: I have future-safe copper-based network, and if RAM access is a bottleneck for my workflow, I can upgrade from 2 to 4-channel interleaved memory access. I'm shure I can find a performance comparison of D-12xx vs. D21xx. Last not least I'll try to estimate a price/performance ratio. Reading both motherboard's manual now...

Snurg · Jan 23, 2021

Yes, I have cats. And this mess in the computers... argh... I have to thoroughly vacuum the computer insides at least 1, 2 times annually.

I use a cheap 2-core Atom as router, fw,dns-adblocker/adlogger and NAS, Something like D2550 ITX with a 1x Ethernet and 1x Wifi onboard, and a 4-port network card in the only slot. The thingy has only 2x SATA, and this is annoying, as one needs to unplug the boot CDROM after installing FreeBSD and then plug in the second HDD for ZFS mirror.

There are sweet ATX form factor Atom server boards that are not so annoyingly limited like most ITX stuff. For this reason I don't like ITX; the 10cm more case side length are acceptable for the gain you get with bigger boards.

Regarding power supplies, what I particularly do not like is when other rails get affected by different loads on a rail. Good PSUs that regulate 3.3, 5 and 12 independently are very rare to find. For this reason I kept that small batch of Supermicro PSUs. These had a strong 12V rail, which also supplied two DC-DC converters to 3.3 and 5V. Technically opulent and expensive, as this is practically three instead of only one transformer, and so you don't find such quality in consumer grade stuff. The components, from dual sided FR4 board to quality Japanese capacitors, were all excellent.

This was the only PSU that didn't spike on the other voltages when big load changes occurred, and whose voltages were constant independently from the load ratios on the different voltages. Some "good branded" other PSUs took more than 1 sec to stabilize after addition or withdrawal of 20A load. And within this time there are often extreme overvoltages or undervoltages on the other rails.
You don't want 2.8V or 4.1V on 3.3, or 9.8 or 15.2V on 12V.

But, I have to say, it was just one model from Supermicro, and nobody knows how good the others are unless one examines them.

Edit: Another reason why I don't like ITX is the lack of room for drives. And the PSUs are usually crappy. There are good ATX cases ~40x40x20cm that can take 4 3.5 drives internally and have three 5.25 bays. This I prefer.

ralphbsz · Jan 24, 2021

mjollnir said:
I'm talking about small (not so powerful) servers, SOHO NAS boxes in particular, where the CPU is often (usually?) passively cooled. ...

For low-power CPUs, this works great. Matter-of-fact, the Raspberry Pi doesn't even need a heatsink at all for versions 0 through 2, 3 works better with a heat sink (which is about 1.2 x 1.2cm and 5mm tall), and RPi4 is the first one that really ought to have any fan. My home server (which is also a "NAS" with 4 disks) has a 1.8 GHz Atom, and just a big heat sink on the CPU.

Water cooling is an excellent method to get a near-noiseless computer.

We built a 27-node cluster with the hottest CPUs available at the time, and with about 300 disk drives, and extreme networking (about 20 years ago), and it had zero fans and was water-cooled (but without any water circulating inside the compute cluster). The amazing thing: it was nearly completely silent in the lab! Look up pictures of "IBM Ice Cube" sometime.

Besides that, I remember an article in a computer magazine about the ultimate noiseless, but powerful computer: the chassis was replaced by an aquarium filled with standard synthetic engine oil (oil does not carry electrical current).

Immersion cooling. Great idea ... until you have to do maintenance. You can use oil, although I would not use engine oil (it has some strange things mixed in, like detergents and solvents), but mineral oil that you get in a pharmacy or drug store. I've actually used that for some electrical parts that needed to be used in the deep ocean (at high pressure); to prevent saltwater from getting into them, we put them in a water-tight but flexible enclosure that's filled with oil. Unfortunately, most electronic don't handle extreme pressures well, so we only put transformers and similar passive components in an oil bath. It's also lots of fun to go to a pharmacy and request 25 liters of mineral oil (it is usually used in small doses for constipation).

The problem with oil is that it is incredibly messy: You pull out your board, and now it drips oil everywhere. Doing something like plugging in a new connector will completely cover your hands and the whole work surface with a horrible mess. I know that some overclockers do it, but it's insane. The right type of fluid to use is specialized electronics immersion fluid, which is thin and water-like (not clingy), evaporates reasonably quickly (so leaving your boards in the air causes them to dry). There was at least one Cray supercomputer that worked completely submerged in these fluids.

Most SOHO NAS boxes have only one fan, even the relatively powerful machine mentioned above. Seems that fan failures do not happen that often, i.e. the longevity of fans exceeds the typical lifetime of a computer.

True, SOHO boxes seem to never have redundant fans. Enterprise and large-scale computers usually have them. Strange. Even my small home server has two fans, so if one fails, it would probably continue functioning at least so-so.

Which reminds me: I'm not monitoring the temperature nor the fans in my home server! Just added that to my to-do list.

(About storing checksums away from the data)

Which (ideally: open source) LVM can do that? Is that available for FreeBSD? Can ZFS store it's checksums on a dedicated, mirrored device? Does FreeBSD's geom(4) or gvinum(8) support such functionality? Is that planned for a future version? Or will large drives of the future have their own checksums stored in (possibly mirrored) NVRAM? The point is: ZFS's checksum can only detect, not repair bad data. I.e. when the checksum is undetected errorneous, correct data will be flaged as bad, but it's very close to impossible that undetected bad data is flaged good.

ZFS stores the checksum not next to the block (nor in the block), but away from it: namely in the parent block. Since ZFS supports single-disk (non-redundant) pools, that's the best it can do. When running on a redundant pool (mirrored or RAID-Zx), the checksum will be protected by redundancy, just like data.

And checksums help with redundancy too: Imagine as a simple example a mirrored system. You read one copy, the disk drive says everything is OK, but the checksum is incorrect. At this point, you can read the other copy (and hope that it has a good checksum).

The only systems I know of that actually separate the checksums into different pools are commercial, high-end file systems and some RAID systems. For example, IBM's GPFS can store internal metadata on separate "pools" (I'm using a ZFS term here), and those can be configured to be on separate devices, and/or have higher redundancy. For example, one might configure the data to be stored in 2-fault tolerant parity-based RAID, but the metadata on 4-way mirrored on SSD (exceedingly fast, and can tolerate 3 failures). Obviously doing this would require many disk drives, but those systems are intended for systems with hundreds or thousands of disks.

My home server is probably about 8 years old (I forget whether I bought the hardware in 2010 or 2012). At the time, I looked for convenience, small form factor, and low power consumption. The motherboard is a Mini-ITX, a Jetway NF99FL, with a 4-core 32-bit Atom. It has more than enough CPU power for me. One particularly crazy reason I needed that specific motherboard was that I wanted a parallel port for a printer, and those were already hard to come by. The board has 6 SATA connectors (good!), but unfortunately no ECC. At that time, there was no practical way to get an ECC motherboard for the consumer market, without going to very high end CPUs and large form factor boards. I spent significant money on a high-quality power supply, which is efficient even when under very low load, and has a large fan. I added a case fan, which gives me some redundancy (the two fans are de-facto in series, the PSU fan ejects warm air, the case fan blows cold air into the box). The case itself is a Lian-Li all-metal case, which has slots 4 removable slots for 3.5" disks, plus space for a few SSDs. The case is a little bigger than a shoe box. I've been quite happy with that system, except for the memory limitation (32-bit CPU means only 3GB usable RAM), and the lack of ECC.

Snurg · Jan 31, 2021

ralphbsz said:
We built a 27-node cluster with the hottest CPUs available at the time, and with about 300 disk drives, and extreme networking (about 20 years ago), and it had zero fans and was water-cooled (but without any water circulating inside the compute cluster). The amazing thing: it was nearly completely silent in the lab! Look up pictures of "IBM Ice Cube" sometime.

This makes me think of swimming pool reactors

ralphbsz said:
True, SOHO boxes seem to never have redundant fans. Enterprise and large-scale computers usually have them. Strange. Even my small home server has two fans, so if one fails, it would probably continue functioning at least so-so.

Which reminds me: I'm not monitoring the temperature nor the fans in my home server! Just added that to my to-do list.

As long as there is sufficient air blow/pull when you feel with your hands, things are usually ok.
It is a thing I check regularly when cleaning, takes only a few seconds.

Another sign for defective fans is the case getting warmer and warmer over time.
And when I open the computer in operation, I check the fans, too (touch them, do they have some power or do they stop at the slightest resistance?).

But, in the rare cases of a sudden fan stop, dual fans are extremely helpful!
However, both must be same strength, which is often not the case in home-user-targeted cases.
Otherwise very bad effects can happen.

Personally I never had a heat kill (some close calls, though).
But I got plenty of those for repair, usually because of a combination dirt clog and fan failure.
Often when the PSU dies, it kills a lot of other stuff, too, making the neglect cost even more painful.

Characteristics of hardware for SOHO NAS boxes & mini/micro servers

Mjölnir

ralphbsz

Snurg

Mjölnir

Snurg

ralphbsz

Snurg