MPECS Inc. Blog: Some Disaster Recovery Planning On-Premises, Hybrid, and Cloud Thoughts

This was a post to the SBS2K Yahoo list in response to a comment about the risks of encrypting all of our domain controllers (which we have been moving towards for a year or two now). It’s been tweaked for this blog post.

***

We’ve been moving to 100% encryption in all of our standalone and cluster settings.

Encrypting a setup does not change _anything_ as far as Disaster Recovery Plans go. Nothing. Period.

The “something can go wrong there” attitude should apply to everything from on-premises storage (we’ve been working with a firm that had Gigabytes/Terabytes of data lost due to the previous MSP’s failures) and services to Cloud resident data and services.

No stone should be left unturned when it comes to backing up data and Disaster Recovery Planning. None. Nada. Zippo. Zilch.

The new paradigm from Microsoft and others has migrated to “Hybrid” … for the moment. Do we have a backup of the cloud data and services? Is that backup air-gapped?

Google lost over 150K mailboxes a number of years back, we worked with one panicked call who lost everything, with no return. What happens then?

Recently, a UK VPS provider had a serious crash and, as it turns out lost _a lot_ of data. Where are their clients now? Where’s their client’s business after such a catastrophic loss?

Some on-premises versus cloud based backup experiences:

Veeam/ShadowProtect On-Premises: Air-gapped (no user access to avoid *Locker problems), encrypted, off-site rotated, and high performance recovery = Great.
Full recovery from the Cloud = Dismal.
Partial recovery of large files/numerous files/folders from the Cloud = Dismal.
Garbage In = Garbage Out = Cloud backup gets the botched bits in a *Locker event.
- Image based backups are especially vulnerable if bit rot, unrecognized sector failures, and such happen
- The backup needs to be protected from user access (blog post)
Cloud provider’s DC goes down = What then?
Cloud provider’s Services hit a wall and failover fails = What then (this was a part of Google’s earlier mentioned problem me thinks)?
- ***Remember, we’re talking Data Centers on a grand scale where failover testing has been done?!?***
At Scale:

Cloud/Mail/Services providers rely on a myriad of systems to provide resilience
- Most Cloud providers rely on those systems to keep things going
Backups?

Static, air-gapped backups?
“Off-Site” backups?

These do not, IMO, exist at scale

The BIG question: Does the Cloud service provider have a built-in backup facility?

Back up the data to local drive or NAS either manually or via schedule
Offer a virtual machine backup off their cloud service

There is an assumption, and we all know what that means right?, that seems to be prevalent among top tier cloud providers that their resiliency systems will be enough to protect them from that next big bang. But, has it? We seem to already have examples of the “not”.

In conclusion to this rather long winded post I can say this: It is up to us, our client’s trusted advisors, to make bl**dy well sure our client’s data and services are properly protected and that a down-to-earth backup exists of their cloud services/data.

We really don’t enjoy being on the other end of a phone call “OMG, my data’s gone, the service is offline, and I can’t get anywhere without it!” :(

Oh, and BTW, our SBS 2003/2008/2011 Standard/Premium sites all had 100% Uptime across YEARS of service. :P

We did have one exception in there due to an inability to cool the server closet as the A/C panel was full. Plus, the building’s HVAC had a bunch of open primary push ports (hot in winter cold in summer) above the ceiling tiles which is where the return air is supposed to happen. In the winter the server closet would hit +40C for long periods of time as the heat would settle into that area. ShadowProtect played a huge role in keeping this firm going plus technology changes over server refreshes helped (cooler running processors and our move to SAS drives).

***

Some further thoughts and references in addition to the above forum post.

May 13, 2016: The Register: Salesforce.com crash caused DATA LOSS
April 18, 2016: BBC: Web host 123-reg deletes sites in clean-up error

The definition of irony:

September 20, 2015: The Register: AWS outage knocks Amazon, Netflix, Tinder and IMDb in MEGA data collapse
November 2014: Microsoft Azure had a series of outages some lasting days. Fortunately, no data loss happened that we know of though VMs were offline

Bing Search: Azure Outage
Note results for earlier problems in April 2014

August 6, 2012 WIRED: HOW APLE AND AMAZON SECURITY FLAWS LED TO MY EPIC HACKING

Mat Hanon’s horrific story of complete data loss
Please read this and then secure a password vault and use it to manage _different_ passwords for _ALL_ sites and client admins!
We have 2FA enabled on every service that has it available
We use KeePass for our vault

February 27, 2011: endgadget: Gmail Accidentally resetting accounts, years of correspondence vanish into the cloud?

As mentioned, we had frontline experience with this Google data loss situation. :(
Note that threads around this issue seemingly stopped being updated with no further status reports by Google or others

The moral of this story is quite simple. Make sure _all_ data is backed up and air-gapped. Period.

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service

MPECS Inc. Blog

Tuesday, 26 July 2016

Some Disaster Recovery Planning On-Premises, Hybrid, and Cloud Thoughts

No comments: