Wednesday, 21 May 2008

Seagate 750GB ST3750330NS Failure on SRCSASRB

We just had our first Seagate Enterprise ES.2 series drive failure on a client production server.

The server setup:
  • Server: SR1560SFNA 1U Dual E5440 Xeon server
  • Drive: ST3750330NS 750GB Seagate ES.2 7200RPM Enterprise Storage.
  • Controller: SRCSASRB RAID Controller
  • Array configuration: RAID 1 (2x ST3750330NS)
  • 750GB ST3750330NS Global Hot Spare
This server is running SBS 2003 Premium R2 Open License version.

A screen shot of the Intel RAID Web Console 2:

Seagate 750GB ES.2 Failed

One of the last times we had a RAID array fail on an add-in RAID controller was on an Intel SRCS16 RAID controller with a RAID 1 pair and the remaining 4 drives on a RAID 5 array for capacity.

One of the OS RAID 1 pairs failed. We had to down the server to replace the drive as that was a quirk with the SRCS16 and the then new SATA 300 drives.

The rebuild on the 250GB pair was over 20 hours with the server offline. Performance degradation while the server was online was tangible. This client ran CAD drawings off of the SBS box along with all of the other tasks required by SBS. We left the server on overnight in rebuild mode. It finished just after their office opened the following morning.

In this case, the rebuild rate on the SRCSASRB is significantly better:

750GB RAID 1 Rebuild: ~3-4 Hours: Server Online

Keep in mind that the above time may not accurately reflect the exact time the rebuild will take. Since this is our client's busy time, the rebuild times may be a lot slower due to the OS demand for disk time.

We will be popping in to hot swap replace the failed drive and subsequently setting up the replacement as the new hot spare.

The above mentioned SRCS16 failure was a nail biter. We did not have any ShadowProtect backups at that time and would have had to have rebuilt the server using the built-in SBS backup. While that would have worked, it would have been time consuming and our client would not have been pleased with the idea that their people would miss a day of work.

The most critical time in a failed RAID 1 or 5 array is the time to introduce a replacement drive and have it rebuilt.

In today's case we had an identical Seagate ES.2 drive setup as a hot spare.

In a case where there is no hot spare to begin a rebuild cycle as soon as an array drive member fails, there is that additional time for a technician to respond and replace that defective drive.

No hot swap? Then more time and lost production for the client to down the server and replace that defective drive. We then have to ask our client: Do you want to risk the possibility of total loss if another array member dies (goes for both RAID 1 and RAID 5 arrays), or do we down the server, replace the defective unit, and put the server back online in rebuild mode (a lot slower) - or leave it offline in rebuild mode (a lot faster)?

We must keep in mind that stress on the hard disks will increase markedly during the rebuild cycle too. This is because the RAID controller is demanding both the rebuild tasks and the regular server operations if the server is still online.

For clients with higher costs for down time, this is the primary reason to be promoting the hot swap option to them. There is a selling point for the add-in RAID controllers as well: The motherboard based RAID controller in this situation (S5400SF) would more than likely have locked up the server when the drive failed.

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

*All Mac on SBS posts are posted on our in-house iMac via the Safari Web browser.

No comments: