Up to the time that the server was shut down last night, there were no indications of any problems with the two RAID arrays, a 500GB RAID 1 and 1TB RAID 5, on the box. They were partitioned as follows:
- RAID 1
- 50GB OS partition
- 430GB ServerData
- 10GB SwapISACache
- RAID 5
- 1TB NetworkData
Once we had the new Kingston RAM sticks installed, we fired up the box we went straight into the BIOS to confirm that all 4 1GB sticks were there which they were.
Subsequently, when booting into the OS, the initial Windows Server 2003 scroller kept going and going. While not necessarily a bad thing, after three years, we know how long this particular box takes to boot.
The heart started to sink at that point.
Then the kicker:
Um, this NetworkData partition is a 1TB RAID 5 array! The previous ServerData partition took only a few minutes. For this partition, it looks like we are going to be here for a while ...
Once of your disks needs to be checked for consistency
Well, okay ... maybe not ... but then ...
I: NetworkData is 95 percent completed.
We have all had that one minute seems like an eternity experience when something stressful like this was going on. The amount of time that the above message scrolled on the screen seemed like one even though it may have lasted only 3 or 4 minutes.
Inserting an index entry into index $0 of file 25.
The only thing that kept us from hitting that reset button was the fact that the 4 drives in the RAID 5 array where this was occurring were pinned. Meaning, their lights were on constantly due to disk activity.
Then a little light in what was seemingly turning into a catastrophic failure:
This went on for over 20 minutes.
Correcting error in index $I30 for file 9377.
Then we faced something one hopes to never face that late at night expecting to pop in and pop out for a quick task ... the proverbial nail in the coffin - a catastrophic failure:
It is at this point that it has become pretty clear that we were in for the duration.
An unspecified error occurred.
But then ...
Again ... must resist pushing buttons (best Captain Kirk voice) ... keeping those fingers tied up and away from the power and/or reset on the front of the server. Just in case, we left it alone. And, thankfully, the above screen is what we were greeted with.
Windows is starting up.
Soon we saw:
This stage took a couple of minutes. The Initializing Network Interfaces stage took another 10-15 minutes.
The Active Directory is rebuilding indices. Please wait
We were eventually greeted with a Services Failed to Start error and subsequently the logon screen.
It looks as though the OS partition has made it through this relatively unscathed. The service chokes were for SQL, WSUS, WSS, and a LoB application that had their databases stored on one of the soon to be discovered absent partitions. Exchange had also choked.
One lesson in all of this: A server may stay up and running almost indefinitely when experiencing a sector breakdown on a disk member or members of the array. To some degree the RAID controller will compensate. However, in our experience, as soon as the server is downed, or rebooted, those sector gremlins can jump out and make their presence known as was the case here.
Another lesson from this: We keep the Exchange databases on the OS partition for this very reason. If the Exchange databases were on a different partition and/or array and it fails, we loose Exchange and email communication. If we have an SBS OS that boots with a relatively happy Exchange ... the databases intact too, then at least our client will not loose their ability to communicate with the outside world while we would be working on the data recovery side of things.
Back to this SBS box: Once into the OS, we were able to eventually get into the Event Logs and sure enough, out of the four partitions, the three besides the OS partition were toast.
Amazing ... simply amazing.
From the server Event Log:
Event Type: ErrorThe above Event Log messages were numerous.
Event Source: Ntfs
Event Category: Disk
Event ID: 55
Time: 7:01:32 AM
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume .
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
0000: 0d 00 04 00 02 00 52 00 ......R.
0008: 02 00 00 00 37 00 04 c0 ....7..À
0010: 00 00 00 00 02 01 00 c0 .......À
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 81 01 22 00 .".
Just in case, we initiated the ChkDsk utility from within the GUI. It too crashed on the two critically needed partitions.
We made sure that the relevant services that had folders on the ServerData partition were shutdown, and we fired up ShadowProtect to bring that partition back. We were fortunate that this particular partition recovery cooperated and we were able to fire up the relevant services and their LoB app that had the database server logs on that partition too.
The 1TB RAID 5 array did not cooperate at all. Even after a 10 hour plus SBS OS initiated ShadowProtect restore attempt that we began very early this morning. It failed and caused the SBS server to spontaneously reboot around 10:30AM this morning. This also means no LoB application for now. We are fortunate that it was not critical to the daily functioning of our client's business.
So, where does that leave us?
In this client's case, we have a backup DC that also has a live data mirror on it! So, with SBS at least functional, we were able to email users a simple batch file to disconnect the original Company share and connect them to the backup server's Company share.
On the SBS box, we made sure to restart and verify the services running on the now restored ServerDATA partition, and have left the RAID 5 array partition alone for now.
The extra expense of having that backup DC/Data Mirror box sitting there has just paid for itself in spades. For this client, we are talking a hit against the firm in the magnitude of $1K/hour for down time. The share switch took a relatively small period of time. If SBS went down totally, the backup server is setup to bring DHCP, DNS, and a secondary ISP gateway online within very short order to at least keep the firm functional.
If things happened to end up with a nonfunctional SBS OS, we would have also had the option of bringing down one of our Quad Core Xeon 3000 series servers that sits on our bench just for this task: A ShadowProtect Hardware Independent Restore of a client's entire SBS setup. We would bring them back online fully functional on newer, albeit temporary, hardware setup until such time as a new permanent server could be installed.
Having the ShadowProtect backup setup in place gives us a good number of very flexible options to make sure that there is very little or no impact on our client's daily business operations in the event of a server failure.
Given the age of this particular 2U system, we are now talking to the partners about a replacement SBS 1U to be Swung in by the end of this week.
There is definitely one thing that has been made especially clear in the midst of all of this: The last time we experienced a catastrophic failure of this magnitude, we had BackupExec and two 72GB x6 HP Tape Libraries to fall back on. The recovery took a whole long weekend because of the sheer volume of data and struggles with BUE.
This time around, while the stress levels were and are still high, they were no where near the levels they were at when the last SBS catastrophic failure happened.
Even if the partners decide on a set of replacement hard drives as a temporary measure until their high season dies down near the end of the summer, we will have them back online with solid storage tonight. Can't say the same for tape and BUE ... especially with the volumes of data we are talking about.
ShadowProtect gives us the options ... with the hard drive replacement, between StorageCraft's ShadowProtect, and the disaster recovery training (one can connect the dots) one receives doing Swing Migrations, we will be able to do the following:
- Restore a clean version of SBS from the previous evening's ShadowProtect image
- Recover the Exchange databases from the ShadowProtect incremental we will do tonight on the SBS partition
- Forklift those databases into the recovered SBS Exchange
- Restore the ServerData and NetworkData partition data from last night's clean ShadowProtect image
- Copy the backup server's live data mirror changes that were made by the client's users back to the SBS box.
We can do that! We have the technology and the skills! ;)
A big thanks to both StorageCraft for a great product and Jeff Middleton of SBSMigration for the awesome skill set we have gained via the SwingIt! Kit. Without this product and those skills, we would be in a very bad situation getting worse by the minute ... and very likely out of a really good client ... or ...
Microsoft Small Business Specialists
*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.