Saturday, 26 April 2008

SBS Disaster Recovery - Second DC SBS Restore Caveats

The SBS domain recovery we worked on for a client that began last week has presented us with a number of foreseen and unforeseen challenges.

Once we had them online with their backup systems, the goal was to Swing their SBS over to new hardware. In this particular situation, we have a secondary DC installed on the domain.

It was installed for this very reason: To provide Active Directory, DNS, and Internet access along with VPN access to company files if needed.

It was our goal to use the Swing method to introduce a new SBS box utilizing the old SBS box as the starting point for the Swing.

We had also grabbed the most recent ShadowProtect images of the backup DC for any unforeseen needs.

Once the battle to settle the old SBS server down to some form of stability happened, the first thing we did was attempt to join a system to the SBS domain. This is the message we received during the attempt:

Computer Name Changes
The following error occurred attempting to join the domain "mysbsdomain.lan":
The directory service was unable to allocate a relative identifier.
This did not necessarily yield any clues at first.

After a bunch of searching around, the closest thing we could find was at the Experts-Exchange: The directory service was unable to allcate a relative identifier (keep in mind they are now subscription based) which in turn led to this MS KB article: KB248410 Error Message: The Account-Identifier Allocator Failed to Initialize Properly which is for Server 2000 and this one: MS KB article: KB822053 Error message: "Windows cannot create the object because the Directory Service was unable to allocate a relative identifier" also for Server 2000/3.

The KB articles gave us some repadmin tool commands to test things out that lead to some clues as to the source of the problem.

At this point, the old SBS box was plugged into a stand-alone Gigabit switch. The NICs had the appropriate IP setups and teaming, and the Internet NIC is plugged into our Workbench network just in case we need outside access.

We knew that the old SBS box could not communicate with the backup DC. This is a given since the SBS box was sitting in our shop and not the client's site.

However, not having communication with the backup DC should not be a problem right?

So, we figured that since the Backup DC was nowhere to be found, we would try something simple like adding a user on the SBS box itself via the wizard.

We ran the Add User Wizard.

This is the message we received:
You must be a member of the Small Business Server Administrators or Power Users group to create computer accounts. Contact your administrator.
Oh. So, we tried the Add Computer wizard and received the same message. But, we were logged in as the domain admin.

In the SBS Console however, we were able to open ADUC and make changes to object properties or GPMC and modify policies. So, this at least confirmed that we were into the server with a domain admin account and our domain admin privileges.

A more detailed search into the SBS event logs brought us to this log entry:
Event Type: Error
Event Source: SAM
Event Category: None
Event ID: 16651
Date: 4/19/2008
Time: 1:56:52 PM
User: N/A
Computer: MYFailed-SBS01
Description: The request for a new account-identifier pool failed. The operation will be retried until the request succeeds. The error is " The requested FSMO operation failed. The current FSMO holder could not be contacted."
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
The error does not make sense since the SBS box holds all FSMO roles. There were consistent NTDS KCC warnings in the logs too:
Event Type: Warning
Event Source: NTDS
KCC Event Category: Knowledge Consistency Checker
Event ID: 1308
Date: 4/19/2008
Time: 2:01:32 PM
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: MySBSServer
Description: The Knowledge Consistency Checker (KCC) has detected that successive attempts to replicate with the following domain controller has consistently failed.
Attempts: 31
Domain controller: CN=NTDS Settings,CN=MyBackupDC,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=MySBSDomain,DC=LAN Period of time (minutes):
6902
The Connection object for this domain controller will be ignored, and a new temporary connection will be established to ensure that replication continues. Once replication with this domain controller resumes, the temporary connection will be removed.
Additional Data Error value: 8524
The DSA operation is unable to proceed because of a DNS lookup failure. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Again, these errors are to be expected as the backup DC was not online.

By this time we were firing up a Xeon 3070 based box to do a Hardware Independent Restore of our client's backup DC as it was looking like we were going to need it.

Finally, about an hour later, there was the final clue to the mess we found ourselves in:
Event Type: Warning
Event Source: NTDS Replication
Event Category: Replication
Event ID: 2092
Date: 4/19/2008
Time: 2:56:32 PM
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: MySBSServer
Description: This server is the owner of the following FSMO role, but does not consider it valid. For the partition which contains the FSMO, this server has not replicated successfully with any of its partners since this server has been restarted. Replication errors are preventing validation of this role.

Operations which require contacting a FSMO operation master will fail until this condition is corrected.

FSMO Role: DC=MySBSDomain,DC=LAN
User Action:

1. Initial synchronization is the first early replications done by a system as it is starting. A failure to initially synchronize may explain why a FSMO role cannot be validated. This process is explained in KB article 305476.

2. This server has one or more replication partners, and replication is failing for all of these partners. Use the command repadmin /showrepl to display the replication errors. Correct the error in question. For example there maybe problems with IP connectivity, DNS name resolution, or security authentication that are preventing successful replication.

3. In the rare event that all replication partners being down is an expected occurance, perhaps because of maintenance or a disaster recovery, you can force the role to be validated. This can be done by using NTDSUTIL.EXE to seize the role to the same server. This may be done using the steps provided in KB articles 255504 and 324801 on http://support.microsoft.com.

The following operations may be impacted:

Schema: You will no longer be able to modify the schema for this forest.
Domain Naming: You will no longer be able to add or remove domains from this forest.
PDC: You will no longer be able to perform primary domain controller operations, such as Group Policy updates and password resets for non-Active Directory accounts.
RID: You will not be able to allocation new security identifiers for new user accounts, computer accounts or security groups.
Infrastructure: Cross-domain name references, such as universal group memberships, will not be updated properly if their target object is moved or renamed.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
We were still nowhere near having the backup DC restored on our box here in the shop. So, we created a VPN connection to the production backup DC and forced a replication across the VPN.

At first we were expecting the replication to take a while, but it was relatively quick, and we now had a viable SBS DC to work with.

We have now learned SBS Recovery with secondary DC valuable lesson number 1: When needing to recover an SBS DC that has other DCs in the SBS Active Directory forest, we need one of those DCs for the initial replication. Given that our recovered SBS DC was in good shape at that point, replicating with the production backup DC was okay to do.

However, if things were a lot more messed up, then the only option would be to have a recovered version of one of the other DCs attached to the recovered SBS' isolated network and forcing replication with it or using it as the source for the Swing Migration.

As soon as we saw a successful KCC message in the logs, we ran the Add User Wizard and sure enough we could create users and computers again.

Then came the sigh of relief. :)

Okay, so through the Swing steps we go to establish a new SBS instance on the new hardware we have in the shop.

We ran into a few initial hiccups in the Swing process, but they were relative to some the methodology itself.

Once we had the new SBS server finished, we delivered it to our client's site very early this last Tuesday morning. The intention was to bring it online while everyone was not in the office.

We shutdown DHCP on the backup server, reconnected the internal network cable to the box's second NIC, teamed them back up, and reset the IP and DNS settings on the team.

The new SBS box and the backup DC were not very happy to see each other at first. Replication failed either way.

Since they were not wanting to replicate, we needed to work with the highest priority which was to get the client machines moved over to the new SBS box. We created a startup script to do the following:
  • ipconfig /release (remove the IP settings given by the backup DC)
  • ipconfig /renew (reestablish IP settings to the new SBS box)
  • net use g: /delete (SBS Company folder)
  • net use h: /delete (Backup Company folder)
  • net use g: file://mysbsserver/company (data now online)
  • net use h: file://backupdc/companybu (now read only)
  • gpupdate /force (forces the client machines to pull GP from the SBS box)
The last step was critical for bringing things back together. We made sure all of the client machines that were online once we reached this point were sent into a reboot via a shutdown -r batch file on the server. We logged on as our test domain user account to verify share, Outlook, and Internet access. The ISA Firewall Client was connected and everything seemed to be working as it should.

We then made sure that the users that brought their laptops in that morning understood that they were to answer yes to the reboot question that came via the GPUpdate.

Their Office 2003 which is distributed via Group Policy Software Installation ran again causing a slow down on the initial boot.

But, once they were connected, their data and shares were as they expected them, and their Outlook was connected to Exchange and happy. Email was moving as it should.

We ran into a problem getting the new SBS DC and backup DC to replicate however.

The SBS box happily picked up the proper settings both in DNS and Active Directory to replicate with the backup DC. But, the backup DC would have nothing to do with the new SBS box.

In a way, this is expected behaviour given the new SBS box will have a totally different underlying identity to the original SBS box. We followed the entries in the following TechNet article: Troubleshooting GUID Discrepancies.

This is what we see from the NTDS properties of an SBS box and the corresponding AD DNS entry:

NTDS DNS Alias and its corresponding entry in _msdcs.mysbsdomain.lan

When we had a look in _msdcs.mysbsdomain.lan on the new SBS box and there were indeed multiple entries for the old SBS box, the new SBS box, and the existing backup DC.

DNS on the backup DC also had multiple entries, so we cleaned out the wrong ones and ran replication again. Still no go. The new SBS server happily tried to connect to the backup DC, but the Backup DC would not connect to the new SBS box. We were getting Access Denied messages in the logs whenever we forced replication or it ran on its own.

Another clue was in the fact that we could not access any shares or network resources on the new SBS box from the backup DC but we could the other way around.

Looking into the ServicePrincipalName cleanup suggested in the above article, we made the necessary changes.

ServicePrincipalName Cleanup: Remove the old entries, paste the new GUID in place, and Add

There are two entries in the SPN that needed to be changed.

After the cleanup and a reboot of the backup DC, they still would not replicate. That GUID alias in the NTDS properties under DssSites.msc would not change to the new SBS server's GUID. It still had the old SBS server's GUID there.

Given the amount of time we were fighting to get them to replicate by this point, we decided to DCPromo the backup DC to demote it and DCPromo it back in again.

That failed!

From the command line we had to DCPromo /forceremoval on the backup DC to get it to demote. That worked.

But that still left us with the new SBS server and all of the backup DC references in Active Directory. However, we knew that would be the case as the Swing Migration steps prepared us for what was next: Utilizing NTDSUtil to perform a metadata cleanup of the backup DC settings, and a cleanup of any reference to the backup DC in ADSIEdit.msc. We also needed to clean up DNS of any reference to the old backup DC's GUID.

We doubled back over our work to make sure there were absolutely no AD settings left for the backup DC. Once satisfied, we DCPromoed the backup DC back into the AD forest.

After a reboot, they were both happily replicating!

We have now learned SBS Recovery with secondary DC valuable lesson number 2: When we go to reintegrate a newly installed SBS server that was a part of a disaster recovery process, we may need to demote any and all secondary and tertiary DCs.

In this case, the secondary DC was in the same office, but, in the case of a branch office scenario where there are a couple of other offices out there, this could present a real rats nest to get things up to speed AD replication wise.

Now, what we experienced could very well be an anomaly where the edits done to the SPNs on the other DCs may in fact take and the issue stops there. They go on to replicate with no further issue.

For your reference: This was one of the more challenging disaster recoveries we have had to face yet.

We have been very fortunate that none of our clients have totally lost a location, but we came close once with one of our clients where the entire building's roof rained a deluge of rain water into a server closet of one of our clients at 03:30 in the morning. That was a scary call. It seemed that building maintenance has not gotten around to cleaning out the roof drains and the drain just above the closet was the one to give way. :(

Since we have both the old SBS and backup DC up and running and replicating happily here in the shop, we will be running a couple of test Swing Migrations to see if that second DC causes problems in a nondisaster recovery SBS domain migration too.

The step after that will be to see how the new Server 2008 Active Directory schema extensions for a Read Only DC at a branch site impacts our SBS 2003 to 2003 and 2003 to 2008 migrations.

Thanks for reading! :)

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.

4 comments:

Andy said...

Wow - sounds like someone lost a lot of hair on that job. As far as experts exchange is concerned - just scroll down past the ads, testimonials and keep going - eventually you'll get to all of the content and answers on the page

Philip E. said...

Stuart .. will reply. (comment removed)

Anonymous,
Online storage backup is not an option for clients that move significant data and have small less than 1Mb upload capabilities.

We do StorageCraft with off-site services and rotations provided by us for our clients.

Comment removed due to it being comment spam in nature.

Andy,

Yes, it was one heck of a project.

I will look a little closer at EE when not logged in to see if that works. It seems to me as I recall that everything is greyed out when not authenticated and requires a username and password.

This is what I get if I am not signed into EE:

04.02.2006 at 05:25PM PDT, ID: 16356888

twanlass:All comments and solutions are available to Premium Service Members only. Sign-up to view the solution to this question.

Already a member? Login to view this solution.

So, me thinks that one needs to be a member.

Philip

Andy said...

Just checked - I've never logged in and I get all the details - you have to scroll down a long, long way though.
I do wish that google would let you remove a site from entries shown in the results to avoid this sort of thing though.
The storagecraft option sounds pretty good (although not in this case!) - I'm going to be taking a look at this option shortly myself

Lyle35204 said...

I also had to recover my SBS 2003 server from data corruption. The recent Storage Craft images were also corrupted so I restored the image, ran fix mbr, and chkdsk -r. The server booted, but I probably could have just run fix mbr on the drive. Lessons: Storage Craft backups are very fast, but recovery is slow, taking 10 hours or more to restore an image. We were down for several days trying to find an image that would restore.