Thursday, 26 June 2008

Seagate 750GB ES.2 Failure on SR1530AHLX swap experience

This is our second drive failure for this series within a short period of time.

The drive:
  • 750GB ST3750330NS Seagate ES.2
    • Firmware: SN04
    • Date Code: 08281
    • Site Code: KRATSG
    • Product of Thailand
In this case, there were two drives connected to the S3000AHLX on board RAID controller in a RAID 1 array.

The first indication of a problem was the lack of a server report in our reports folder for our client. The lack of response via RWW or VPN made it clear we were in for an on-site visit.

Once on-site, the hard drive light was solid. There was no response from the server when a monitor, keyboard, and mouse were connected. This particular 1U box is headless.

We pressed the power button for about 5 seconds to power the server down and pressed it again to power it back up and we were greeted with:
LSI RAID: Cannot detect array configuration
The only option we had was to enter the RAID BIOS. The server would not boot up.

In the RAID BIOS, we needed to tell the RAID controller what was going on. We entered the View/Add menu, chose the Port 0 drive, and was able to see that the drive on Port 1 was failed. We then needed to save the status update to the RAID BIOS.

Since we were there first thing in the morning, we booted the server up and verified that SBS 2003 R2 Premium was fully operational. Everything booted up fine. So, we left the server in that state for the business day.

When we came back at the end of the business day, we shut down the server and replaced the defective drive. This is a non-hot swap unit, so there was a little work involved pulling things apart and putting them back together.

And now comes the really big caveat: The RAID controller on this particular motherboard requires the rebuild to be accomplished while in the RAID controller BIOS. Thus, we needed to be back again first thing in the morning to reboot the server into the SBS OS. This is really surprising since all of the Intel Desktop Boards we have used the on board RAID controllers would run the rebuild in the background.

The rebuild was successful, and we had a fully functioning SBS box when it booted up, but the rebuild situation for the array was an eye opener as far as the need to initiate it manually in the RAID BIOS. FYI: The rebuild rate for the 750GB RAID 1 mirror was about 4-5 hours while the server was offline.

A little more research is in order for this situation. The new S3000SH and S3210SHLX boards show a download for the Intel RAID Web Console 2 available on the product Web page. We need to see if we can use the RAID Web Console to initiate a rebuild from within the OS. Having the ability to initiate the rebuild while the OS is online can be critical to a business with only one server.

Yes, performance will definitely be an issue as the on board RAID controller is busy trying to rebuild the array onto the new drive. But, there are ways to work with that situation versus not having the server online at all.

We were not able to install an add-in RAID controller on this particular 1U box since the only available slot was taken by a PCI-E Gigabit NIC.

For our smaller clients, there can be a struggle between cost versus performance and added redundancy and hot swap features. Having those features is just like having an insurance policy.

In this case, the insurance policy of an add-in RAID card and hot swap would have enabled us to change out the defective drive and rebuild the array without downing the server. That insurance policy would have allowed the server to keep on functioning too. It has been our experience that a drive failure on an on board RAID array tends to lock up the server or workstation.

And, one other thing: With our client having their ShadowProtect backups in place, with the last incremental taken just before the hard drive locked up, helped us to be quite confident once the situation had been assessed. Worst case scenario we were replacing the drive, recreating the array, and running the ShadowProtect recovery to recreate the SBS box's partitions. We had all of the bases covered.

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.

2 comments:

Anonymous said...

Hi Philip, what you may want to think of installing on these servers is the Intel version of HP's ILO or Dell's DRAC card, so that you can reboot the server without having to go on site, and potentially you would have been able to diagnose the RAID controller from your owqn office. The logic behind this is that by being able to do all that remotely you will save on your petrol and keep your costs down.
kind regards.
David

Philip Elder Cluster MVP said...

David,

My memory is a bit foggy on this one ... however, as I recall, in our dual NIC SBS Premium setups (~99.9% of our clients) we were at a disadvantage for utilizing the remote management module.

Now that SBS 2008 is single NIC however, you bring up an excellent point, and we will look to implementing the remote management module setup on all appropriate server configurations going forward.

Thank you for that!

Philip