Thursday 8 March 2007

Every IT Project Manager's Nightmare: The Upgrade that Breaks Things

It is currently high season for accountants here in Canada. Personal taxes are being filed left right and centre, or are they?

From the Canada Revenue Agency's Web site:
The Canada Revenue Agency (CRA) is experiencing electronic system difficulties that prevent the public from accessing some electronic services for personal returns such as NETFILE, TELEFILE and EFILE. We have temporarily shut down public access to electronic services to ensure the integrity of taxpayer information.

The CRA has a team working to restore its systems to normal operations but it will be a matter of days before the system problems are completely resolved. The security and integrity of taxpayer data has not been compromised. This problem is not the result of illegal activity, computer hackers or a virus.

We have now traced the source of the problem to software maintenance conducted on March 4, 2007. We are currently working to bring all systems back online gradually.
You can find the full details here.

I wonder if any heads are going to roll? ;0) ... Probably not, it's government after all.

We service a number of SBS networks installed at accountant's offices. We received calls from many of them asking why their e-file wasn't working. After spending some time on the issue, it was pretty clear the problem was not the SBS infrastructure.

We got a call from one of our accountant clients, and he explained to me what was happening on the CRA side. According to him and his conversations with CRA, CRA has had to hire a team out of New York to come in and help them fix whatever is broken.

It is pretty obvious to me, as one who manages small business networks, that we don't mess around with our client's infrastructure during their high season. We make sure that their networks are patched and running stable before their high season, not during it!

If there are necessary patches and updates during the high season, we test them on virtual lab infrastructure that mirrors our client's current setup. There is absolutely no reason why that could not have happened in CRA's case. The technology exists to Plate Spin their entire production network into a parallel virtual environment that could be broken, restarted at square 1, and broken yet again, until they got it right.

It is pretty much guaranteed that if we had a client whose infrastructure we broke for an extended period of time as in the CRA's, we would no longer have them for a client. The client could also potentially loose their business!

In my opinion, there is no reason why we who support our client's entire livelihoods (our roles are that critical), clients who cannot get by without their e-mail and network infrastructure for any real length of time, would not put the time in to make sure that what we do with their production environments does NOT impact that livelihood.

For us, it is the cost of doing business. For our clients, it is added value.

Philip Elder
MPECS Inc.
Microsoft Small Business Specialists

2 comments:

Anonymous said...

The "sofwtare patch" in question was likely the Microsoft Daylight Savings Time patch. Not to say its MS's fault, however, goes to show how a minor DST patch can create mass havoc. As tax-paying citizens and residents, I suggest we demand the CRA to explain exactly what went wrong, which subcontractor had overall responsibility for the NETFILE system and what risk mitigation practices have been put in place to prevent this from happening again next year.

Philip Elder Cluster MVP said...

Somehow, I believe that the NETFILE system wouldn't be running on a Microsoft platform.

It would be likely to be a very customized, possibly in-house, product that runs on top of UNIX.

DST related? Probably. Either way, my original contention still stands, they had two (2) years to test DST related patches VIRTUALLY before ever touching a production system.

And yes, as a Canadian taxpayer, I believe we have the right to know exactly what happned to the systems, and whether any data has been compromized in the process.

Thanks for the comment!

Philip E.