Disaster Recovery

You may have heared about recent flood issues in Germany. This gives a few things to think about.

We normally know that we should mirror disks, because one can fail at any time. We also know that we should do backups, to a device that is not permanently connected to that same machine, because software may fail, controllers may fail, power supply may fail, damaging all the mirrors at once.

But this now looks a bit different, and it does not seem so very unlikely to happen. Lets face it:
  • all local site machines and data is destroyed, including those in nearby buildings.
  • smartphones, laptops etc. are likely to be destroyed (they are not crafted for submarine operation).
  • usb sticks should survive, but may no longer be locateable.
  • ssh access keys are gone
  • web passwords, stored in a software vault, in firefox, in the smartphone, ..., are gone.
So then, what to do next?
I have a cloud machine, but the access key to that will be gone. I could access it from the panel, but the password to the panel will be gone. I can reset that password, but that needs the originally configured mail address, and that mail server will be gone, or also unaccessible.

Setting up a new machine from nothing is also not so very easy. I recently found that all the images (bootonly, memstick, disk1) do not even contain the compiler! They do not contain a complete OS and likely depend on the internet to complete an installation. Only the dvd image is complete enough to use it to compile a new system from sources (I did check that, but didn't check if it also contains the sources - I think it should), but that piece is ~4GB, probably too big to pull it down onto some smartphone via the supermarket wlan.
So, putting up a new system usually relies on having internet, while internet usually relies on having some system up and running. And BTW, the access key for the internet provider may have been configured on the system and long since forgotten - or it may be in some backup - but then, how to unpack that backup without a running machine?

Think about it...
 
Yeah, correct. But you can login with only a password and replace it easily in case it gets lost.
I did that when i got my new PC. Instead of restoring the SSH key from my backups, i created a new one, was faster.
 
We also know that we should do backups, to a device that is not permanently connected to that same machine, because software may fail, controllers may fail, power supply may fail, damaging all the mirrors at once.
Off-site backups. The building itself could burn down for example. Or, in this case, get completely flooded or wiped away.
 
I know people that take a daily bike ride just to transfer last night's backup tapes to a different location. That's about as low-tech as you could go. A high-tech way would be to have a dark fiber connected to a different location in another city and use fancy filesystem replication to have an exact copy of your storage. But there are many gradients between those two, it all depends on your budget.

I wouldn't rely on a cloud provider, not a single provider at least. That recent debacle in France proved that. (https://www.techtimes.com/articles/...ire-websites-companies-services-disrupted.htm)
 
I know people that take a daily bike ride just to transfer last night's backup tapes to a different location. That's about as low-tech as you could go. A high-tech way would be to have a dark fiber connected to a different location in another city and use fancy filesystem replication to have an exact copy of your storage. But there are many gradients between those two, it all depends on your budget.
Recent smartphones are surprisingly resilient. A couple years ago, I fell off a pier into water. I was fully clothed, and my phone was in the pocket of my shorts. I was able to swim to a place to climb out. And to my surprise, my phone (Galaxy A20) was still fine. FWIW, so was I - this happened in Hawaii (Right next to Captain Cook monument in Kealakekua Bay), and I was totally cooked by the sun. Everyone in my family is still laughing about it, and teasing me.
 
Off-site backups. The building itself could burn down for example. Or, in this case, get completely flooded or wiped away.
Exactly my point. But also: think about what exactly you will do after desaster strikes (have a contingency plan), and verify thst it will still work out in such a case.
In my case, I figured that I cannot access the cloud when local site and all keys+passwords are gone, because my mailserver runs at local site. So I now configured an employee sub-account with an old webmail address (which I rarely use anymore, but will remember the password). Such things need to be planned and done before the case.
I wouldn't rely on a cloud provider, not a single provider at least. That recent debacle in France proved that. (https://www.techtimes.com/articles/...ire-websites-companies-services-disrupted.htm)
Oh, that's cool. Didn't notice that, and glad I'm on scaleway. ;)

It's actually not so big of an issue if you do things correctly: you should cover the cloud by the local site, and the local site by the cloud, and, just as with two mirrored disks, it then becomes highly unlikely for a disaster to hit both at the same time.

Certainly, for any professional site it is mandatory to have proper DR schemes. What I am looking at is rather the private and semi-professional installations on which people tend to depend in an ever-greater extent. Here a desaster may not be accountable in figures (and sadly also not easily tax-deductible), but the personal loss and trouble can be rather ugly. There is probably no need for a daily backup, neither would it need to be fast or big - but doing nothing at all may end in an unpleasant way.

Running FreeBSD we already have an advantage, because we know where our data is, and have proper tools to manage it. We only need to use them. And it is not expensive at all - a suitable offsite machine for such purpose is less than 10€ /month, so that is well in-scope for private use.
 
The off-site backup needs to be far enough away that the probability of correlated failure is low enough. Depending on the value of the data to you, you get to pick the probability. For me, I want nearly zero, because my data includes tens of thousands of documents that have been scanned.

The correlation question depends on location. For example, for people who live in the valley of a large river (Rhine, Mississippi, ...), they need to look at 100-year flood plains. So if you live in Koeln, having the second copy in Duesseldorf makes little sense, unless the second copy is in an upper floor of a flood-proof building (there are lots of skyscrapers in Duesseldorf). But having the second copy in Hannover or Berchtesgaden makes sense, since there flood disaster probabilities are uncorrelated. For me (in California), flooding is a non-issue (we barely have rain), but fire is. Our house (and the primary copy) is in a highly fire-dangerous area, sad fact of life. Therefore the two backup copies are kept about half hour by car away, in office buildings that are exceedingly fire safe. To get both backup copies, we would need a fire that destroys fundamentally all of Silicon Valley, which is unfathomable.

Putting the backup copy on a good cloud provider usually makes sense, as long you pick a sensible location within the cloud provider. For example, a customer in Europe could pick Amazon/Google/Microsoft storage in the US or in Singapore for their backup location. A customer in the US should pick either a very different location in the US (other coast). If you go with a cloud storage provider, most offer storage that is replicated in multiple regions, for a relatively small increase in cost.

And clearly your backup has to include whatever cryptographic secrets and passwords you need to access it. The way I do it: My backup is not cloud based, but a physical portable disk drive (actually two of them). The drives are encrypted. The encryption password is stored in an encrypted file, which is replicated at home, in each of the backups, and on two cloud providers (the file is tiny, a few dozen KB). Decrypting the password file does not require a password that needs to be remembered, but requires knowing the answer to a half dozen questions that any member of our immediate family would know, but few other people in the world would. The questions and decryption instructions are stored in cleartext (actually, in paper form!) with the backup disk.
 
and be warned: you may have typed your passwords many thousand times over many years ... it still can happen that you forget it, e.g. when you have an accident or another traumatic experience.
 
  • Like
Reactions: PMc
I knew a person who worked for a big insurance company.
They would do audits of large companies disaster recovery plans.
The insurance company would mandate the physical distance between the backups and required 2 backup sites.
They could lose any 2 sites and still maintain continuity of operations.
That was 20 years ago. Before the cloud. I could see it all in the cloud now. Why own when you can rent.
 
The insurance company would mandate the physical distance between the backups and required 2 backup sites.
At my previous job, we had a customer who had a complete second (backup) data center, with a complete second copy of the data, kept up-to-date synchronously. The second data center was in the OTHER tower of the World Trade Center.

Even until recently, a lot of computing hardware was actually installed on Wall Street. I don't mean metaphorically in the financial industry, but physically: In buildings on the few blocks of the New York street that starts at Broadway. One of the funny things about that is physical space: Because lots of the buildings on Wall Street have low ceilings (they are old), and can not be modified (they are historic and protected), computer manufacturers had to design special racks that don't require overhead air handling and wiring and are shorter. Today, with the cloud, a lot of those concerns have become irrelevant.
 
Data centre separation is always a trade-off.

One must evaluate the risk appetite against the radius of events like a flood, fire, civil unrest, war, or even atomic blast.

Though the speed of light does come into play. Fibre has speed limitations at distant locations. So too much separation has implications for the design (real-time replication may not be possible), not to mention the cost of running (diverse path) fibre long distances.

Also, cloud might be OK if all you do is Windows and Linux on amd64. Not everyone has infrastructure as simple as that. Throw in the odd IBM mainframe, IBM P frame, or STK silo, and things get more complicated.

Also, I did see Amazon kick a customer off their cloud infrastructure this week. That should have induced panic updates to a lot of risk analyses...
 
No amount of disaster recovery infrastructure can overcome the lack of planning and actually having something in place. And it's best to start simple, but reliable.
 
Data centre separation is always a trade-off.

One must evaluate the risk appetite against the radius of events like a flood, fire, civil unrest, war, or even atomic blast.
In finance, health and insurance, you also have to add legal risk. For example: If a bank is chartered in New York, has its headquarters and operations on Wall Street, what happens if it rents space in warehouse in New Jersey for a backup data center? Is it suddenly subject to New Jersey banking regulations, because it is "operating in the state"? Those kinds of questions is why banks hire lawyers.

Though the speed of light does come into play. Fibre has speed limitations at distant locations. So too much separation has implications for the design (real-time replication may not be possible), not to mention the cost of running (diverse path) fibre long distances.
Absolutely. If you go cross-continent, the speed of light is the same order of magnitude as disk latency. If you are trying to keep up with an SSD, you can't go further than a few miles. That's why synchronous replication is only possible within a metropolitan area.

Throw in the odd IBM mainframe, IBM P frame, or STK silo, and things get more complicated.
There are cloud providers where you can rent P series, and IBM certainly will rent a cloud mainframe. At some point, if you have a data center full of these things, owning them becomes cheaper than renting though.
 
The insurance company also had geographic restrictions which were not in wholly in writing.
For example they audited a company HQ in Tampa Florida. Their backup datacenters were Phoenix and Atlanta.
But because of Tampa being in some prime Hurricane zone the auditors rejected the Atlanta datacenter becuase of the Hurricane risk. Atlanta could be hit by the same storm. So even though it met the distance requirement they still made them change their plans.
 
Even until recently, a lot of computing hardware was actually installed on Wall Street. I don't mean metaphorically in the financial industry, but physically: In buildings on the few blocks of the New York street that starts at Broadway. One of the funny things about that is physical space: Because lots of the buildings on Wall Street have low ceilings (they are old), and can not be modified (they are historic and protected), computer manufacturers had to design special racks that don't require overhead air handling and wiring and are shorter. Today, with the cloud, a lot of those concerns have become irrelevant.
And there's a good reason for it for many of those: physics. Many of this hardware is being used for high speed trading, so the longer your connection to Wall Street, the bigger your disadvantage.
 
And everything you have in your disaster plans should be tested before a real disaster happens...
Old joke: It's not a backup until you have actually read it.

True story: A person I know (who is not a computer person) works in an office. They had a big windows server (this was ~20 years ago, when a single server was sufficient for a company with ~50 people). The IT person had set it up with RAID, so they would tolerate disk failures. Unfortunately, they used RAID-0, which is actually not redundant. They also made nightly backup to tape, and stored the tapes, neatly labelled, in a cardboard box. One day, a disk fails, and the server goes down. So first, RAID didn't help, duh. The IT guy comes running, and it turns out he had never tried to actually read or look at a backup tape. They were all blank! He had not configured the backup software do actually back anything up, but every night it created one tape containing nothing, which he labelled and stored carefully.

The next few weeks were very uncomfortable, but eventually (with the hep of employees finding spare copies of files on their laptop or USB sticks, and a very expensive disk recovery company finding bytes on damaged disk), they got nearly everything back.
 
Old joke: It's not a backup until you have actually read it.

True story: A person I know (who is not a computer person) works in an office. They had a big windows server (this was ~20 years ago, when a single server was sufficient for a company with ~50 people). The IT person had set it up with RAID, so they would tolerate disk failures. Unfortunately, they used RAID-0, which is actually not redundant. They also made nightly backup to tape, and stored the tapes, neatly labelled, in a cardboard box. One day, a disk fails, and the server goes down. So first, RAID didn't help, duh. The IT guy comes running, and it turns out he had never tried to actually read or look at a backup tape. They were all blank! He had not configured the backup software do actually back anything up, but every night it created one tape containing nothing, which he labelled and stored carefully.

The next few weeks were very uncomfortable, but eventually (with the hep of employees finding spare copies of files on their laptop or USB sticks, and a very expensive disk recovery company finding bytes on damaged disk), they got nearly everything back.
Classic case of not thinking all the way through. All too often, even smart ideas like backup are hastily thrown together just to be able to say to the boss, 'Look, we do have backup!'. And then everybody just forgets about it until disaster happens. But in thinking stuff through, and lining it up, you have to ask: At what point did you do enough?
 
Old joke: It's not a backup until you have actually read it.

True story: A person I know (who is not a computer person) works in an office. They had a big windows server (this was ~20 years ago, when a single server was sufficient for a company with ~50 people). The IT person had set it up with RAID, so they would tolerate disk failures. Unfortunately, they used RAID-0, which is actually not redundant. They also made nightly backup to tape, and stored the tapes, neatly labelled, in a cardboard box. One day, a disk fails, and the server goes down. So first, RAID didn't help, duh. The IT guy comes running, and it turns out he had never tried to actually read or look at a backup tape. They were all blank! He had not configured the backup software do actually back anything up, but every night it created one tape containing nothing, which he labelled and stored carefully.

The next few weeks were very uncomfortable, but eventually (with the hep of employees finding spare copies of files on their laptop or USB sticks, and a very expensive disk recovery company finding bytes on damaged disk), they got nearly everything back.
The only way to thoroughly "read-test" a full set of backup media is to clone a complete new copy of your software system onto a completely separate and compatible hardware system. That's the "acid test" which should be rehearsed on a regular basis. This must include:
  1. OS installation media (for instance, a reliable FreeBSD-13.0-RELEASE USB installer),
  2. A complete set of 2nd-party software installation archives matching the ones currently in use on the original system (in my case, this includes Apache, PHP, PostgreSQL, etc., etc., and etc.),
  3. Complete instructions and/or shell scripts ready to perform, in a reasonable and predictable amount of time, all the post-installation software configuration which needs to be done on all the software components mentioned under item 2 above,
  4. Complete archives, plus installation instructions, and/or installation and configuration scripts, ready to perform, in a reasonable and predictable amount of time, all of (my/your/one's) own 3rd party software installation and configuration requirements,
  5. A complete set of database backups, for example, SQL dumps, modified text files, etc., etc., and etc., and, finally,
  6. A reliable and well-tested post-installation testing regimen, with instructions on how it's to be done in a reasonable and predictable amount of time.
This is what we should have before we ever attempt to upgrade, modify, maintain, or use any component, on any important or critical system.

If we have it, we don't need to fear any upgrade, regular daily usage, weekly processing, monthly processing, end-of-year processing, or general maintenance of the system. If we don't have it, we do.

The backup hardware doesn't need to be as powerful or expensive as the regular working hardware, but, as inexpensive as hardware is, relative to everything else nowadays, there's no good reason not to have it, not if the system is worth anything at all to you, and/or to all the other people who depend upon that system.
 
Back
Top