MPECS Inc. Blog: July 2016

Tuesday, 26 July 2016

Some Disaster Recovery Planning On-Premises, Hybrid, and Cloud Thoughts

This was a post to the SBS2K Yahoo list in response to a comment about the risks of encrypting all of our domain controllers (which we have been moving towards for a year or two now). It’s been tweaked for this blog post.

***

We’ve been moving to 100% encryption in all of our standalone and cluster settings.

Encrypting a setup does not change _anything_ as far as Disaster Recovery Plans go. Nothing. Period.

The “something can go wrong there” attitude should apply to everything from on-premises storage (we’ve been working with a firm that had Gigabytes/Terabytes of data lost due to the previous MSP’s failures) and services to Cloud resident data and services.

No stone should be left unturned when it comes to backing up data and Disaster Recovery Planning. None. Nada. Zippo. Zilch.

The new paradigm from Microsoft and others has migrated to “Hybrid” … for the moment. Do we have a backup of the cloud data and services? Is that backup air-gapped?

Google lost over 150K mailboxes a number of years back, we worked with one panicked call who lost everything, with no return. What happens then?

Recently, a UK VPS provider had a serious crash and, as it turns out lost _a lot_ of data. Where are their clients now? Where’s their client’s business after such a catastrophic loss?

Some on-premises versus cloud based backup experiences:

Veeam/ShadowProtect On-Premises: Air-gapped (no user access to avoid *Locker problems), encrypted, off-site rotated, and high performance recovery = Great.
Full recovery from the Cloud = Dismal.
Partial recovery of large files/numerous files/folders from the Cloud = Dismal.
Garbage In = Garbage Out = Cloud backup gets the botched bits in a *Locker event.
- Image based backups are especially vulnerable if bit rot, unrecognized sector failures, and such happen
- The backup needs to be protected from user access (blog post)
Cloud provider’s DC goes down = What then?
Cloud provider’s Services hit a wall and failover fails = What then (this was a part of Google’s earlier mentioned problem me thinks)?
- ***Remember, we’re talking Data Centers on a grand scale where failover testing has been done?!?***
At Scale:

Cloud/Mail/Services providers rely on a myriad of systems to provide resilience
- Most Cloud providers rely on those systems to keep things going
Backups?

Static, air-gapped backups?
“Off-Site” backups?

These do not, IMO, exist at scale

The BIG question: Does the Cloud service provider have a built-in backup facility?

Back up the data to local drive or NAS either manually or via schedule
Offer a virtual machine backup off their cloud service

There is an assumption, and we all know what that means right?, that seems to be prevalent among top tier cloud providers that their resiliency systems will be enough to protect them from that next big bang. But, has it? We seem to already have examples of the “not”.

In conclusion to this rather long winded post I can say this: It is up to us, our client’s trusted advisors, to make bl**dy well sure our client’s data and services are properly protected and that a down-to-earth backup exists of their cloud services/data.

We really don’t enjoy being on the other end of a phone call “OMG, my data’s gone, the service is offline, and I can’t get anywhere without it!” :(

Oh, and BTW, our SBS 2003/2008/2011 Standard/Premium sites all had 100% Uptime across YEARS of service. :P

We did have one exception in there due to an inability to cool the server closet as the A/C panel was full. Plus, the building’s HVAC had a bunch of open primary push ports (hot in winter cold in summer) above the ceiling tiles which is where the return air is supposed to happen. In the winter the server closet would hit +40C for long periods of time as the heat would settle into that area. ShadowProtect played a huge role in keeping this firm going plus technology changes over server refreshes helped (cooler running processors and our move to SAS drives).

***

Some further thoughts and references in addition to the above forum post.

May 13, 2016: The Register: Salesforce.com crash caused DATA LOSS
April 18, 2016: BBC: Web host 123-reg deletes sites in clean-up error

The definition of irony:

September 20, 2015: The Register: AWS outage knocks Amazon, Netflix, Tinder and IMDb in MEGA data collapse
November 2014: Microsoft Azure had a series of outages some lasting days. Fortunately, no data loss happened that we know of though VMs were offline

Bing Search: Azure Outage
Note results for earlier problems in April 2014

August 6, 2012 WIRED: HOW APLE AND AMAZON SECURITY FLAWS LED TO MY EPIC HACKING

Mat Hanon’s horrific story of complete data loss
Please read this and then secure a password vault and use it to manage _different_ passwords for _ALL_ sites and client admins!
We have 2FA enabled on every service that has it available
We use KeePass for our vault

February 27, 2011: endgadget: Gmail Accidentally resetting accounts, years of correspondence vanish into the cloud?

As mentioned, we had frontline experience with this Google data loss situation. :(
Note that threads around this issue seemingly stopped being updated with no further status reports by Google or others

The moral of this story is quite simple. Make sure _all_ data is backed up and air-gapped. Period.

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service

Thursday, 14 July 2016

Hyper-V and Cluster Important: A Proper Time Setup

We’ve been deploying DCs into virtual settings since 2008 RTM. There was not a lot of information on virtualizing anything back then. :S

In a domain setting having time properly synchronized is critical. In the physical world the OS has the on board CMOS timer to keep itself in check along with the occasional poll of ntp.org servers (we use specific sets for Canada, US, EU/UK, and other areas).

In a virtual setting one needs to make sure that time sync between host and guest PDCe/DC time authority is disabled. The PDCe needs to get its time from an outside, and accurate, source. The caveat with a virtual DC is that it no longer has a physical connection with the local CMOS clock.

What does this mean? We’ve seen high load standalone and clustered guests have their time skew before our eyes. It’s not as much of a problem in 2012 R2 as it was in 2008 RTM/R2 but time related problems still happen.

This is an older post but outlines our dilemma: MPECS Inc. Blog: Hyper-V: Preparing A High Load VM For Time Skew.

This is the method we use to set up our PDCe as authority and all other DCs as slaves: Hyper-V VM: Set Up PDCe NTP Time Server plus other DC’s time service.

In a cluster setting we _always_ deploy a physical DC as PDCe: MPECS Inc. Blog: Cluster: Why We Always Deploy a Physical DC in a Cluster Setting. The extra cost is 1U and a very minimal server to keep the time and have a starting place if something does go awry.

In higher load settings where time gets skewed scripting the time sync with a time server within the guest DC to happen more frequently means the time server will probably send a Kiss-O-Death packet (blog post). When that happens the PDCe will move on through its list of time servers until there are no more. Then things start breaking and clusters in the Windows world start stalling or failing.

As an FYI: A number of years ago we had a client call us to ask why things were wonky with the time and some services seemed to be offline. To the VMs everything seemed to be in order but their time was whacked as was the cluster node’s time.

After digging in and bringing things back online by correcting the time on the physical DC, the cluster nodes, and the VMs everything was okay.

When the owner asked why things went wonky the only explanation I had was that something in the time source system must have gone bad which subsequently threw everything out on the domain.

They indicated that a number of users had complained about their phone’s time being whacked that morning too. Putting two and two together there must have been a glitch in the time system providing time to our client site and the phone provider’s systems. At least, that’s the closest we could come to a reason for the time mysteriously going out on two disparate systems.

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service

Tuesday, 5 July 2016

Outlook Crashes - Exchange 2013 or Exchange 2016 Backend

We’ve been deploying Office/Outlook 2016 and Exchange 2016 CU1 in our Small Business Solution (SBS) both on-premises and in our Cloud.

Since we’re deploying Exchange with a single common name certificate we are using the _autodiscover._tcp.domain.com method for AutoDiscover.

A lot of our searching turned up “problems” around AutoDiscover but pretty much all of them were red herrings.

It turns out the Microsoft has deprecated the RPC over HTTPS setup in Outlook 2016. What does this mean?

MAPI over HTTPS is the go-to for Outlook communication with Exchange going forward.

Well, guess what?

MAPI over HTTPS is _disabled_ out of the box!

In Exchange PowerShell check on the service’s status:

Get-OrganizationConfig | fl *mapi*

To enable:

Set-OrganizationConfig -MapiHttpEnabled $true

Then, we need to set the virtual directory configuration:

Get-MapiVirtualDirectory -Server EXCHANGESERVER | Set-MapiVirtualDirectory -InternalUrl https://mail.DOMAIN.net/MAPI -ExternalUrl https://mail.DOMAIN.net/MAPI

Verify the settings took:

Get-MapiVirtualDirectory -Server EXCHANGESERVER | FL InternalUrl,ExternalUrl

And finally, test the setup:

Test-OutlookConnectivity -RunFromServerId EXCHANGESERVER -ProbeIdentity OutlookMapiHttpSelfTestProbe

EXCHANGESERVER needs to be changed to the Exchange server name.

Hat Tip: Mark Gossa: Exchange 2013 and Exchange 2016 MAPI over HTTP

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service