Other Mixed HDDs brands in RAID

Planning purchase of HDDs for a RAID using different brands but using same disk sizes and speeds.
I have read that it is a good idea to mix brands because there's less chance of disks failing at the same time due to manufacturing defects and may even reduce noise and vibration due to their different tolerances but I have also read that the differences in firmware might affect performance or, even worse, could cause the disks to fail sooner.
Interested in knowing from users experiences of mixing HDD brands in RAID configurations.
 
Sigh. This is an extremely complicated topic. In the following, I will just use Ananas and Banana as fictitious brand names for disk drive makers, since I have too many friends who work for Seagate, WD, Hitachi and so on, and I don't want to insult any of them.

First aspect of it: Reliability, dependability, durability, availability, all that stuff. You give the argument that having disks whose failures are not correlated with each other will improve things. Good argument. It also demonstrates that you have understood that correlated failures or multiple faults are the death of RAID. Fabulous. BUT: In a small RAID system (a handful of disks), the statistics is just not good enough to make it relevant.

Think of correlated failures as a manufacturing defect that shows itself at a specific time. Example 1: You have a 1000 disk RAID array, made up of 500 Ananas and 500 Banana. The array is configured so it can handle 10 simultaneous faults. Due to a known correlation, you know that on Monday, the 54th of Octuary (a day that only exists in Covid-leap-years), all the Banana disks have a 1% chance of failing (the normal failure probability for a good-quality disk is much lower, so this is highly significant). So you expect 5 Bananas to fail that Monday. It won't be a good day (you will make 5 trips into the cold and windy data center to replace disk drives, and your RAID will be doing massive rebuild), but you and your data will survive. Example 2: Now you have a 4-disk RAID array (two Ananas, two Banana), that can tolerate one fault. On that Monday, there is a ~2% chance that one of the two Banana will fail (which will be survivable, but you will be wetting your pants, hope you have no single sector error on the other 3 disks), and a 0.01% chance that both fail (and you lose your file system, and probably your job). Honestly, this is really not much worse than any other day, 0.01% is not significant. Third example: We're back to the 1000 drive array, but you stupidly configured it so it can only handle 2 faults. That Monday, 5 Banana fail at once, and you get fired. Honestly, you deserved that, because configuring a 1000-disk array to handle only 2 faults was dumb to begin with.

Now let's do an extreme example. Instead of a 1% failure probability, let's really screw up the works. On Thursday you come to work, and find in your e-mail that there is a firmware update for your Ananas disks. Unfortunately, you don't check the e-mail address carefully, and it is actually from the Elbonian spy agency, and that firmware update will brick all your Ananas disks (for reasons that can be found in a Dilbert cartoon, the Elbonian software engineers really hate your company). You apply the forged firmware, and lose half your disks drives. Very likely, your RAID is now dead - dee ee dee (that's a joke from some comedy movie), since no practical high capacity RAID system can tolerate loss of half its drives. The fact that the Bananas are alive won't save you. Sucks to be you. Now, if you had bought more different models and manufacturers of disks (Cherry, Date, Elderberry, Fig and Grape), and only lost 14% of the disk drives when the Elbonians hacked your Ananas, you might be alive ... but there are no 7 independent disk drive makers in the world any longer (I think there are 3 left, and I'm not sure how independent Toshiba is).

So: In reality, the problem of correlated failures really makes no difference, since massive correlations are deadly anyway, and small correlations are irrelevant compared to normal failure rates. In reality, it is much more likely that an unexpected correlation gets you. For example (real world experience from a previous job): When shipping a disk system with ~3000 drives, we experienced a failure rate of several percent in the first week or so. Several percent of 3000 is a relatively big number (hundreds), so clearly field service and spare parts logistics became overwhelmed, and the system collapsed, causing the customer to be not perfectly happy. The root cause was a combination of a manufacturing defect, quality control failing twice (at two different companies), and for budgetary reasons the system being sold and shipped on December 31st, in spite of the fact that the usual supply of (well quality-tested) disk drives was exhausted. This was a disaster, and fixing it ended up costing thousands of man-hours and millions of $. The root cause: all 3000 disks in the system came from one manufacturing run, one shipment, just a few pallet loads of drives, all of which had skipped over quality control together (it didn't help that there was a company VP standing there screaming that this thing needs to get shipped TODAY or else we won't make our revenue numbers for this fiscal year). That's the kind of correlated failure that really burns people. And this was from a respectable and conscientious vendor! I have other horror stories: like the big shipment of disks that was stored in a tropical climate in a non-air-conditioned warehouse in a city that's infamous for the sulfur smell in the air, for half a year during the monsoon season, and those disks were never reliable afterwards. Or the disks that got deep-frozen when the field service technician in Canada got caught in a snow storm, slid off the highway, had to be rescued (probably by the royal canadian mounted police on horses), while his van and the disks spent the night stuck in a snow bank (strangely, they did fine after thawing). You want to work around this kind of effect? Buy a little bit from each possible vendor, each possible model, from different sources that were not on the same truck together, or were not baked in the same warehouse. Even better: buy two disks every week for half a year, so you spread the risk over manufacturing weeks. In practice, this kind of countermeasure is so completely impractical at the individual level, just forget it. Buy good-quality drives that are intended for your usage, and get on with life.

Second aspect: Performance. I will deliberately ignore what you said about vibration and noise, since I know little about it. I don't think that mixing drives will cause any of them to fail faster. But it will have an effect on performance. Most RAID implementations are mostly synchronous on an IO-by-IO basis. If your raid is for example RAID-6 with an array size of 8 (meaning 8 data disks and two redundancy disks), then the RAID code will usually issue 8 or 10 IOs at the same time (8 for reads, 10 for writes), and wait until they're done. That is: wait until the slowest of the 8 or 10 is done. This is known as the convey effect ... a convoy moves at the speed of the slowest ship in the convey. So you should not build a RAID array made of 99 fast and expensive disks, and 1 slow and cheap disk ... the $$$ you spent on getting the fast disks is wasted. Matter-of-fact, since workloads are variable, you should not even mix disks that have the same average performance. For example, if your Ananas are really good on sequential throughput, and your Bananas good on random seeks, and both suck on the other workload (making them both on average both are mediocre), then at any given workload, your performance will suck. Nice theory. In practice, if you use disks of similar generations (same number of platters, some spindle RPM, similar sequential MB/s and similar seek time), then the difference between brands will likely be 10-20%, which is about similar to the difference between individual drives, and the change in performance of individual drives over weeks or months. So just don't worry about it; your RAID system performance is unpredictable by 10 or 20% anyway. Side remark: There are RAID systems that will measure the performance of individual disks, and either steer more IOs to faster disks, or put less data on slower disks, or ask the user to replace disks that are so slow that they cause problems, or deliberately attempt extra IOs ahead of time, if they can foresee that some disks will be slow. With these tricks, you can squeeze good performance out of arrays of quite dissimilar drives, even over the long term, as disks age and change. But those technologies are not available in free software or consumer RAID systems.

My advice: Do whatever is convenient. And keep good backups, and configure your RAID array for handling one extra fault (one more than you expect statistically, because Murphy). Buy good-quality disk drives that are appropriate for your workload, following the advice from the disk vendor (don't use consumer drives in a NAS, don't use NAS drives in a supercomputer, don't use supercomputer drives in a consumer desktop, and so on). And keep good backups. Did I mention that backups need to be part of your durability strategy?
 
You also ought to use drives with different ages to prevent a bad batch of new drives from simultaneous failures that cause catastrophic RAID failure one night.
 
Having all the above in mind it can (most likely will) lead to very strange (low) performance characterists of your array. The majority of arrays will use there same brand and model for this very reason and possibly also to keep the HBA happy unless you like possible intermittent errors. You should also have to consider the amout of drives and what type of RAID that will be most suitable for your application, RAID-1 (mirror) might be a better idea than RAID-5/6 (RAID-Z/Z2) etc. What you should keep in mind irregardless is that RAID != backup.
 
You also ought to use drives with different ages to prevent a bad batch of new drives from simultaneous failures that cause catastrophic RAID failure one night.
This. I had an array of 4 identical drives all bought at the same time. Two of them failed within weeks of each other after the warranty period had expired. Fortunately I was on the ball and had replaced the first one before the second one failed. The other two are still working fine, which is a head-scratcher.
 
I have used different vendors with no issue in software raid; however, I have always tried to keep HDD specs close as possible. To keep costs down, I usually just put the OS on hardware raid.
 
The problem I've had with hardware RAID is that the firmware is often buggy. I prefer software RAID unless I can afford an expensive RAID card. The preferred choice for on Freebsd is root on ZFS. This can not only give you software RAID, it also gives you snapshots and boot environments as Emrion says in your backup thread. Another option is to boot from a geom mirror:
 
When I assembled my 24 bay ZFS rig I made sure every drive was identical, even down to the firmware version.
I don't know if that was necessary but I think it is a good idea to use identical drives.
I too really like gmirror. I use it on several machines including mirrored 960GB enterprise NVMe drives.
I find myself using 3 drives and stashing one as a backup and rotate them to keep the wear even.
 
This. I had an array of 4 identical drives all bought at the same time. Two of them failed within weeks of each other after the warranty period had expired. Fortunately I was on the ball and had replaced the first one before the second one failed. The other two are still working fine, which is a head-scratcher.
A few years ago all of the identical RAID drives in one of my servers failed within hours of each other. (They came from a well known manufacturer who left it as an "exercise for the student" to diagnose that manufacturer's quality control problems.)

Fortunately, the server only held cloud backups, so it wasn't as catastrophic as it might have otherwise been. Nonetheless the rebuild process was painful. It takes a long time to rebuild > 1TB backups over the Inet.

Anyhow, the advice to mix up hard drives doesn't originate with me. It comes from admins much older and wiser than me.
 
I have used different vendors with no issue in software raid; however, I have always tried to keep HDD specs close as possible. To keep costs down, I usually just put the OS on hardware raid.
My business has deployed LSI HBAs for decades. We recently deployed our first gmirrored (sans HBA) server into a clinic. It's a light duty server. We'll see what happens.
 
... even down to the firmware version.
What most people don't appreciate (and I didn't either, until I saw how the sausage was made) is that disk drives are very complex beasts, with literally millions of lines of code in them. And that code does actually have bugs. The reason disk and HBA vendors distribute firmware is that they fix bugs. And they fix bugs because they want their customers to have a good experience, which also implies fewer customer support problems.

For this reason, I highly recommend that everyone upgrade the firmware of their disk drives whenever possible. It really helps fix bugs. Unfortunately, for most users it's quite difficult to do. Upgrade tools tend to be hardware-specific (Seagate doesn't use the same one as Hitachi or as Toshiba). They are not supported on many OSes (sometimes not even Linux or Windows, only DOS, which means you need to create a special boot disk with DOS). Writing your own (using SCSI write buffer commands) is possible, but not only very tedious, but also risky ... the people who write and test that software tend to brick a few dozen drives before they get it right (I know, because I used to put some of those drives and HBAs into my car and bring them to places like LSI and Seagate to have them un-bricked for us, when we were using unreleased prototype hardware).
 
On my desktop workstation I have deliberately created a ZPOOL mirror of totally different hard drives (even the size is different, but ZFS partitions are of same size). Everything works fine. To reduce the actual load on drives I have added relatively small (120G) SSD drives for ZIL mirror, L2ARC cache, swap and /tmp. I am satisfied with the result - the HDD-s gargle very rarely and everything works mostly at SSD speed while providing HDD mirror integrity and size.
 
Matter-of-fact, since workloads are variable, you should not even mix disks that have the same average performance. For example, if your Ananas are really good on sequential throughput, and your Bananas good on random seeks, and both suck on the other workload (making them both on average both are mediocre), then at any given workload, your performance will suck.
I like this argument. Since in a RAID the hard drives work together, then mixing disks will cause the drives to cancel out their respective performance leverage.
 
Another factor is the drives block sizes. Drives tend to hide their true block size so with different drives you could end up with different block sizes. That would affect your performance unless you segragate different drives by zpools/zdevs.
Imagine mixing 512 and 4K block size drives.
 
Back
Top