Wednesday 16 April 2008

Images No Good ... Catastrophic SBS Failure ... Now What?!?

In what turned out to be an SBS catastrophic failure yesterday, we were purposed with installing some fresh hard drives and restoring the OS and data partitions via a ShadowProtect (SP) backup image set.

Well, things did not go even close to plan. We had hoped that we would be in and out in under 4 hours under optimal conditions.

It did not matter how many times, which image version we used, or combination of array sizes in the SRCS16's BIOS settings, we could not get a successful recovery. After one successful restore we went into the Recovery Console to run CHKDSK against the troubled partitions. After that, the OS choked on a missing sys file. :(

Of the times that we did manage to get the SBS booted up, we found a plethora of Event ID 55s in the logs.

On many of the OS boot attempts we were greeted with:

Checking file system on H:

So, it began to look like the corruption ran pretty far back into our backup image sets.

We all know that hindsight is 20/20! ;)

So, in hindsight, the most expedient method of recovering this server would have been to SwingIT off the original hardware and SwingIt back onto a fresh install of SBS on that original hardware. Given that we did not know we would end up being on-site for 12 hours making recovery attempts and eventually rolling out the backup DC setup to provide authentication and shares it was not a viable option until well into the wee hours of the morning.

We now have the go ahead to Swing onto a new server instead of back onto the existing one. Since they are up and running, we took out the old SBS box that has a somewhat stable recovery on it ... though not stable enough for production ... to use in our Swing Migration.

For now, they are running via the backup DC and data mirror, along with the backup DC providing Internet access via RRAS and a second NIC. It is not ideal as there are a number of network dependent applications that required some fiddling to get working, but at least they are not twiddling their thumbs and loosing money hand over fist.

This is one scenario where having our client's email setup as follows pays off:
  • MX 100 ispmailserver1.myisp.com
  • MX 50 ispmailserver2.myisp.com
  • MX 25 mysbsmail.mysbsdomain.com
The ISP email is pulled down to the SBS box via the POP3Connector which is set to 1 hour intervals.

At least for now they still have access to the outside world via Webmail and the server will get all of their incoming mail when it comes back online. Any critical emails can be BCCd back to themselves for later download.

While we have tried to keep the impact on our client's business down to a minimum, there have been a number of hiccups before things started to settle down. So, to provide our client some restitution for the lost time, we will provide some of our billable time to them at no cost.

Since they are our longest running client at close to 10 years now, it only makes sense to have a little give-and-take in the business relationship.

There are a couple of important lessons here:
  • Test those SP backup images by restoring them
  • Test their durability by restoring them to different hardware
  • Having a second DC can provide an Active Directory source for a Swing Migration in the event of a total SBS failure.
In our case, we were guilty of not having enough time to run through their more recent image to do some restore tests. This too is another motivation to give our client a break on the otherwise very expensive I.T. week they are having.

Given the volume of work with this situation, and others, there may be a smattering of blog posts for a while ...

Thanks for reading and supporting us! :)

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.

No comments: