Thursday 5 November 2009

RAID Controller Log: Unrecoverable Medium Error – Puncturing Bad Block?!?

Okay, so this is a new one and lead to a near stoppage of the heart last evening:

image

The errors in order:

Controller ID: 0 Unrecoverable medium error during rebuild: PD –|—:0 Location 0x26f1640

Controller ID: 0 Puncturing bad block: PD –|—:0 Location 0x26f1640

Controller ID: 0 Puncturing bad block: PD –|—:1 Location 0x26f1640

PD 0 is the last original disk in this server that is giving us headaches. Both PD 1 and PD 2 (there are three hot swap bays in the SR1560SFH Intel Server System) were replaced with new drives.

The original PD 1 had failed during a server firmware including BMC update (previous blog post). The original PD 2, the then global hot spare, was rebuilt into the array with no errors . . . until a consistency check that ran later that afternoon produced some unrecoverable fatal errors.

Last night we dropped original PD 1 out of the configuration, replaced it with a new drive, had the new drive picked up as a hot spare. We then failed out the original hot spare PD 2 now RAID 1 array member assuming that it was the source of the errors we saw in the consistency check yesterday afternoon.

So, the above screenshot was taken after the new PD 1 was being rebuilt into the array with PD 0 as the source. Needless to say the heart definitely skipped a few beats with visions of index $0 running through my head (previous blog post)!

The rebuild did eventually finish successfully though?!?

We will be going back this evening to fail out the bad PD 0 and replace it with a new drive which will then be designated the new hot spare.

Once the PD 2, currently a hot spare, rebuild into the RAID 1 array has finished, Intel indicated to us that we need to run a consistency check. From there, hopefully ShadowProtect will finally give us a backup!

And one more thing, just what does “Puncturing bad block” really mean?

The suggestion in the above NEC linked document is to take the preventative measure and swap out the indicated drive(s) promptly. :)

It looks as though the RAID controller has found some bad sectors on the PD 0 and puncturing means to set those sectors as off limits on both array members.

But part of this whole puzzle is the fact that the RAID controller (Intel SRCSASRB with firmware 470) shows a media error level of 0 for both array members and a predictive failure count of 0 for both members!

Hopefully tomorrow we can rest easy with a backup in hand!

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists
Co-Author: SBS 2008 Blueprint Book

*Our original iMac was stolen (previous blog post). We now have a new MacBook Pro courtesy of Vlad Mazek, owner of OWN.

Windows Live Writer

6 comments:

stryqx said...

The sort of nonsense presented in the second-last paragraph is the primary reason I moved away from Intel RAID controllers and back to Adaptec controllers.

Adaptec couldn't get it right either way back when, which is why they bought out DPT, who were able to do it right.

It's still discomforting to know that even today RAID controllers aren't doing what they're supposed to do in the case of drive failure :-(

stryqx said...

Clarification: I'm not suggesting what you wrote was nonsense; the nonsense is the RAID stats reporting everything is OK!

TRRCED said...

I am having a similar problem on my server. I use the Intel RAID web console also. I am running a RAID 10 with four drives in two spans making one virtual drive. A while ago drive 2 failed so I replaced it and booted into Windows. The rebuild completed successfully and the RAID is optimal, but there were several media errors during the rebuild. Here is an example of the errors I get.

Controller ID: 0 Unrecoverable medium error during recovery: PD 0:2 Location 0x16c4a1bf

Controller ID: 0 Unrecoverable medium error during recovery: PD 0:3 Location 0x16c4a1bf

A consistency check has not been able to resolve all these errors. I haven't been able to do a full system backup because of this for a while now and am hesitant to try a restore from a backup. Do you know of any ways to resolve these errors.

Anonymous said...

"Just today I had an Unrecoverable medium error during recovery: PD 0:5 Location ######" and tho it was a single drive (SSD) in RAID0 I was pretty amazed in the morning. I'm still amazed that it still allowed operation on it, while .. if you ask me .. drive should be kicked out immediately and marked as unusable until I review the situation. Basically my servers (HYPER-V) started to act so weird that it didn't even do the normal shutdown when told so. Drive out. Restore data (I do have backups but I suffer from not testing them) to another drives for the time being till we decide how to go on.

Anonymous said...

Adaptec controllers (at least as of a few years ago) had a nasty habit of kicking the wrong drive out of the array (dropping the good one instead of the bad one). Unfortunately we have moved away from Adaptec because when they got purchased by PMC they stopped fixing the existing controller issues and only issued updates for their new stuff. I'm not talking about discontinued cards, i'm talking about controllers that were still available for purchase from Adaptec - as long as you don't mind using a buggy firmware that's 3 years old and will never see another update. Pretty sad, We used Adaptec exclusively for years.

Philip Elder Cluster MVP said...

When we were testing our standalone node setup with dual SAS HBAs for Hyper-V clusters we trialed a pair of LSI cards and Adaptec cards.

I was expecting the Adaptec cards to kill the LSI cards in every aspect.

To my surprise the Adaptec HBAs were a real PITA and kept ghosting SAS connections when changes were made on the storage unit.

The LSI cards were flawless.

Lately, the Intel/LSI RAID cards have been pretty sound in their performance. One catch though, we no longer use SATA drives anywhere in servers.

Philip