During most of my time in government I have had some part in preparing for a disaster and how would we react to it. Planning usually culminated with a table-top exercise of some form with scenario inputs provided by your local Emergency Services Coordinator. Normally, IT type scenarios were limited to the loss of a radio tower or some other partial outage and the main focus of the table-top was community service restoral and response. Catastrophic IT failure was generally an exercise of our own doing and most of the early scenarios revolved around the idea that somehow the data center had become unusable or magically disappeared. We then had to decide how we would react to such a scenario and what our restoral process would be and more importantly where it would be.
“I never had to face a disaster and rely on tapes for a 100 percent restoral”
In the early days of IT we never had the luxury of even discussing disaster as our main focus was just keeping everything running and systems talking to each other. The expense alone of a data center somewhere else would have been astronomical. Over time, as systems became more stable and prices came down, we begin discussing Disaster Recovery and the art of failing over. The restoral of operations process got all of the departments involved in identifying their programs and collectively establishing the priority in which they should be restored. As you can imagine these were always lively conversations. In our organization we established four tiers of applications that were prioritized for recovery order. We lovingly referred to Tier One Applications as “oh hell apps” that no one could live without. Not surprisingly, payroll always comes out on top as the thing everyone wants restored first!
So RFP’s were issued and consultants were hired, data was collected, decisions were made and then; technology changed and resilience was born.
No longer did we have to worry about hard configurations and how we could pull off moving data to another location, bringing up a battery of new servers, and then configuring a route back to our users with minimal downtime. On a side note I must add how I personally always marveled at the faith everyone seemed to put in those tape back-up systems. Yes, I know we restored lost or accidentally deleted files from time to time, but when the rubber meets the road I am eternally grateful I never had to face a disaster and rely on tapes for a 100 percent restoral. The other major issue with the disaster recovery scenarios that was rarely, if ever, addressed was how the heck we would ever bring the data back once whatever issue that rendered the data center unusable had been resolved.
Virtualization, snapshots, and storage area networks solved a lot of problems from a disaster standpoint. Getting vendors to let you run their products on virtualized servers is a different topic and much more acceptable today but was an obstacle no the less. Now you can prioritize your applications by risk aversion that equates to how often you’ll snapshot them and how long you’ll keep the snapshots, with your main expense being the terra-bits of storage space you’ll need to house them all.
Now, if only you could get the “disasters” to follow your plan!
Enter the Derecho of 2012, 60 to 85 MPH sustained winds with some 70 MPH isolated gust sprinkled in; knocking out power from Iowa to the East Coast. At the peak of the disaster approximately 50 percent of the County was without power. The restoral process lasted from seven to ten days with outside temperatures hovering around 100 degrees. Almost all of the County facilities were without power and had to rely on generator back-up if they were so equipped. Most of the fire houses and the Public Safety Center had generator back-up.
The County Public Safety Center became the hub of operations for the County to provide information and services to the many displaced citizens. The Center had opened in 2007 and was built as a “survivable facility”. It contained bunk rooms, galleys, shower facilities, and would allow a limited number of employees to stay there overnight. The facility also houses the County Emergency Operations Center, 911 Center, IT Data Center, and the Public Safety Radio system. Disaster planning during the construction phase paid off as geographically separate entry points, redundant carriers over both copper and fiber were all planned features. Even the 911 wireless and wire-line trunks were dispersed between two separate telephone company central offices and two separate tandem switches. The generator back-up system is tested every two weeks and is fully load tested once a quarter. The testing proved out as the Center ran for seven days on generator before commercial power was restored.
With Voice over IP technology conference rooms were quickly converted to departmental offices. Citizens could still view our website or contact our offices for assistance. The local cable provider combined with local TV and radio stations along with our website kept those in need apprised of where they could go for ice, water, or shelter while the restoral process proceeded. They also had an avenue to get current updates on when they could expect power to be restored. One key piece of pre-disaster planning was to ensure that local utility companies contacted the Emergency Operation Center twice a day with updates on their service availability. By including them in table-top exercises they are familiar with the names and faces of the key players.
In the aftermath our assessment was: we were lucky! I say that because the enterprise only appeared to the public to be resilient. The County Administration Center and the Public Service Facility were both unusable, one with no power and the other with only two of three phases of power. A failed generator at the Public Safety Building would have been a game changer as there was no “disaster site” to either “fail over” to or be “resilient” at.
Three years later our plans still have not been completed. For several years the great recession put deep cuts into all of our operational budgets and there just wasn’t any money for anything extra. For the last several years we have been streaming back-up data to a local university situated in Blacksburg, VA and are progressing with a suitable disaster site for continuity of operations. Our current planned site is a new library facility located on the opposite side of the county. It is a facility large enough to support the data infrastructure needs and could be used as an alternate location for County Departments that may need to relocate in the event of facility outages.
We anticipate a real world test of the capabilities of this location in conjunction with the replacement of our Core Switches at the prime data center. If there is interest, I’ll chime in again then with an after-actions report.