Tuesday, 15 April 2008

SBS, ShadowProtect, and an Event ID 55 NTFS Error ...

We went in to a client site yesterday evening to finish a warranty swap of some flaky memory sticks.

Up to the time that the server was shut down last night, there were no indications of any problems with the two RAID arrays, a 500GB RAID 1 and 1TB RAID 5, on the box. They were partitioned as follows:
  • RAID 1
    • 50GB OS partition
    • 430GB ServerData
    • 10GB SwapISACache
  • RAID 5
    • 1TB NetworkData
The unit is a 3 year old SR2400 with dual 3.0GHz Xeons and 4GB of RAM. We had changed out the hard drives to give the server more storage about two years ago and did a warranty swap out of the Intel SE7520JR2 motherboard (previous blog post) about a month ago twice. The first swap was a struggle to get the replacement SE7520JR2 board to recognize the 4 sticks of Kingston RAM. The second swap was because the first warranty replacement board lost 2GB of RAM on a patch reboot. :(

Once we had the new Kingston RAM sticks installed, we fired up the box we went straight into the BIOS to confirm that all 4 1GB sticks were there which they were.

Subsequently, when booting into the OS, the initial Windows Server 2003 scroller kept going and going. While not necessarily a bad thing, after three years, we know how long this particular box takes to boot.

The heart started to sink at that point.

Then the kicker:


Once of your disks needs to be checked for consistency

Um, this NetworkData partition is a 1TB RAID 5 array! The previous ServerData partition took only a few minutes. For this partition, it looks like we are going to be here for a while ...

I: NetworkData is 95 percent completed.

Well, okay ... maybe not ... but then ...

Inserting an index entry into index $0 of file 25.

We have all had that one minute seems like an eternity experience when something stressful like this was going on. The amount of time that the above message scrolled on the screen seemed like one even though it may have lasted only 3 or 4 minutes.

The only thing that kept us from hitting that reset button was the fact that the 4 drives in the RAID 5 array where this was occurring were pinned. Meaning, their lights were on constantly due to disk activity.

Then a little light in what was seemingly turning into a catastrophic failure:

Correcting error in index $I30 for file 9377.

This went on for over 20 minutes.

Then we faced something one hopes to never face that late at night expecting to pop in and pop out for a quick task ... the proverbial nail in the coffin - a catastrophic failure:

An unspecified error occurred.
.

It is at this point that it has become pretty clear that we were in for the duration.

But then ...

Windows is starting up.

Again ... must resist pushing buttons (best Captain Kirk voice) ... keeping those fingers tied up and away from the power and/or reset on the front of the server. Just in case, we left it alone. And, thankfully, the above screen is what we were greeted with.

Soon we saw:

The Active Directory is rebuilding indices. Please wait

This stage took a couple of minutes. The Initializing Network Interfaces stage took another 10-15 minutes.

We were eventually greeted with a Services Failed to Start error and subsequently the logon screen.

*Phew*

It looks as though the OS partition has made it through this relatively unscathed. The service chokes were for SQL, WSUS, WSS, and a LoB application that had their databases stored on one of the soon to be discovered absent partitions. Exchange had also choked.

One lesson in all of this: A server may stay up and running almost indefinitely when experiencing a sector breakdown on a disk member or members of the array. To some degree the RAID controller will compensate. However, in our experience, as soon as the server is downed, or rebooted, those sector gremlins can jump out and make their presence known as was the case here.

Another lesson from this: We keep the Exchange databases on the OS partition for this very reason. If the Exchange databases were on a different partition and/or array and it fails, we loose Exchange and email communication. If we have an SBS OS that boots with a relatively happy Exchange ... the databases intact too, then at least our client will not loose their ability to communicate with the outside world while we would be working on the data recovery side of things.

Back to this SBS box: Once into the OS, we were able to eventually get into the Event Logs and sure enough, out of the four partitions, the three besides the OS partition were toast.

Amazing ... simply amazing.

From the server Event Log:
Event Type: Error
Event Source: Ntfs
Event Category: Disk
Event ID: 55
Date: 4/15/2008
Time: 7:01:32 AM
User: N/A
Computer: MY-SBS01
Description:
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume .

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0d 00 04 00 02 00 52 00 ......R.
0008: 02 00 00 00 37 00 04 c0 ....7..À
0010: 00 00 00 00 02 01 00 c0 .......À
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 81 01 22 00 .".
The above Event Log messages were numerous.

Just in case, we initiated the ChkDsk utility from within the GUI. It too crashed on the two critically needed partitions.

We made sure that the relevant services that had folders on the ServerData partition were shutdown, and we fired up ShadowProtect to bring that partition back. We were fortunate that this particular partition recovery cooperated and we were able to fire up the relevant services and their LoB app that had the database server logs on that partition too.

The 1TB RAID 5 array did not cooperate at all. Even after a 10 hour plus SBS OS initiated ShadowProtect restore attempt that we began very early this morning. It failed and caused the SBS server to spontaneously reboot around 10:30AM this morning. This also means no LoB application for now. We are fortunate that it was not critical to the daily functioning of our client's business.

So, where does that leave us?

In this client's case, we have a backup DC that also has a live data mirror on it! So, with SBS at least functional, we were able to email users a simple batch file to disconnect the original Company share and connect them to the backup server's Company share.

On the SBS box, we made sure to restart and verify the services running on the now restored ServerDATA partition, and have left the RAID 5 array partition alone for now.

The extra expense of having that backup DC/Data Mirror box sitting there has just paid for itself in spades. For this client, we are talking a hit against the firm in the magnitude of $1K/hour for down time. The share switch took a relatively small period of time. If SBS went down totally, the backup server is setup to bring DHCP, DNS, and a secondary ISP gateway online within very short order to at least keep the firm functional.

If things happened to end up with a nonfunctional SBS OS, we would have also had the option of bringing down one of our Quad Core Xeon 3000 series servers that sits on our bench just for this task: A ShadowProtect Hardware Independent Restore of a client's entire SBS setup. We would bring them back online fully functional on newer, albeit temporary, hardware setup until such time as a new permanent server could be installed.

Having the ShadowProtect backup setup in place gives us a good number of very flexible options to make sure that there is very little or no impact on our client's daily business operations in the event of a server failure.

Given the age of this particular 2U system, we are now talking to the partners about a replacement SBS 1U to be Swung in by the end of this week.

ShadowProtect

There is definitely one thing that has been made especially clear in the midst of all of this: The last time we experienced a catastrophic failure of this magnitude, we had BackupExec and two 72GB x6 HP Tape Libraries to fall back on. The recovery took a whole long weekend because of the sheer volume of data and struggles with BUE.

This time around, while the stress levels were and are still high, they were no where near the levels they were at when the last SBS catastrophic failure happened.

Even if the partners decide on a set of replacement hard drives as a temporary measure until their high season dies down near the end of the summer, we will have them back online with solid storage tonight. Can't say the same for tape and BUE ... especially with the volumes of data we are talking about.

ShadowProtect gives us the options ... with the hard drive replacement, between StorageCraft's ShadowProtect, and the disaster recovery training (one can connect the dots) one receives doing Swing Migrations, we will be able to do the following:
  • Restore a clean version of SBS from the previous evening's ShadowProtect image
  • Recover the Exchange databases from the ShadowProtect incremental we will do tonight on the SBS partition
  • Forklift those databases into the recovered SBS Exchange
  • Restore the ServerData and NetworkData partition data from last night's clean ShadowProtect image
  • Copy the backup server's live data mirror changes that were made by the client's users back to the SBS box.
As it turns out, we just received a call back from our partner contact at the firm. They prefer to go with the hard drive replacement until their business slows down this summer to keep things as close to status quo as possible for now.

We can do that! We have the technology and the skills! ;)

A big thanks to both StorageCraft for a great product and Jeff Middleton of SBSMigration for the awesome skill set we have gained via the SwingIt! Kit. Without this product and those skills, we would be in a very bad situation getting worse by the minute ... and very likely out of a really good client ... or ...

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.

1 comment:

ZC1 said...

I remember that client call too. (in fact, two calls)

I was called out to a new client with similar symptoms, but this problem was caused by an unexpected power failure that the UPS could not handle, since the UPS batteries were dead.
AND
they were using an overloaded desktop motherboard (as an SBS 2k3 R2 server) with onboard RAID1 and a failed tape restore.

The other call was again similar, except the IT guy could not figure out what was wrong. Besides the boot partition had only 3 CD's worth of space left, it turned out to be a strategically placed bad sector on a Dell Server.

Little did I know, Dell installed a software RAID (CERC) SATA 2s) solution so when I fixed the problem and put the RAID disk back, the system REFUSED to recognize the array or EVEN the drives.
What????!!!!! (It's 10:30pm, I'm hungry and I want to go home).

After monkeying with it a bit, I recognized that the onboard Dell CERC raid controller was intermittently acting up, but it eventually booted and ran okay. I'm replacing the server this week (end).

So.....!!!

I feel for you Man!

I'm there for ya, if you need someone to talk with, give me a shout, you have my email.

(Raid 5 crash, ughh..no fun (shakes head).

ZC1
Beta tester of "0"s and "1"s