Using "The Box" for Disaster Recovery Planning
“This isn’t a contingency we’ve remotely looked at.” It’s my favorite line in Apollo 13, the movie that recounts the famous space mission where an on-board explosion turned plans for landing on the Moon into a scramble for survival.
The explosion on the spacecraft was clearly the big disaster. However, the quote above was delivered by one of the engineers when he was trying to explain to his chief why they could not share filters on different parts of the ship, because some were round and others square. The need to share filters was not a contingency they had ever considered. A team of engineers was assembled and tasked with fitting a square filter into a round hole using only the spare parts available on the space craft. In the movie those parts were dramatically dumped onto a table from a couple of boxes and the engineers started rummaging and solving the problem. Seeing those parts fall out of the boxes always comes back to me when I think about disaster recovery.
The engineers were trying to recover from a disaster by developing a plan on the fly using only what they had in the "box." I am certain that NASA had a well-documented disaster recovery plan that had been considered and reviewed, but it lacked specific procedures for resolving a micro-disaster with the filters. The most effective part of the NASA's response was to empower this engineering team to use spare parts to solve the problem.
I think this scene provides great insight into disasters and disaster recovery. I also think it serves to support the idea that disasters rarely go according to plan. So, how can you prepare for a disaster that you know won't play by the rules? Simply put you have to be ready for the worst and then hope for something, anything better. As Stephen King said, “there's no harm in hoping for the best as long as you're prepared for the worst.”
There is no better tool you could have when recovering from a disaster than a well-stocked box of recovery resources.
I spent the first decade of the new millennium working as the CIO of a regional bank. During my time there I witnessed the regulatory scrutiny around disaster recovery evolve quite rapidly. Initially the focus was only on the data, just get it back and figure out the rest later. Next came a focus on recovery time objectives and recovery point objectives, which required the business leaders to weigh in and actively participate in disaster recovery planning. But for them to do so, disasters had to be reframed into how they could impact business operations and that brought a focus on Risk Assessments and Business Impact Analyses. As a result, Disaster Recovery Planning turned into Business Continuity Planning.
A by-product of this new business and technical collaboration was the large amount of documentation that had to be created and then maintained so that an organization could be in regulatory compliance. One positive in all of this was that business and technology had found common ground on which to collaborate. The negative was that the documentary requirements grew and demanded a great deal of effort to be maintained. Audits began to focus more on helping the business align its recovery objectives with alternative business processes and less on the specific mechanisms to restore the primary process impacted by the disaster. Disaster recovery manuals that used to be full of technical procedures morphed into business continuity plans that talked heavily about business priorities and recovery objectives. Specific technical material grew stale due to a shift in focus.
At the same time this shift to business-focused recovery planning was happening, recovery technology capabilities exploded. Virtualization maturation brought rapid change and capabilities at a frequency that outpaced a typical organization's ability to incorporate them. Adapting to the changes required so much time and effort that little was left for folding those changes into the disaster recovery plan, which had started to focus more on business recovery anyway. This all resulted in organizations having large disaster recovery manuals that were almost impossible to maintain and were outdated almost as soon as they were completed. Engineers started to focus their plans around how they would recover in the event of complete and catastrophic disasters, the worst case scenario, and they had little time to develop procedures for smaller disasters.
Given that disasters rarely follow the plan and most times are not worst case, it makes sense to develop a disaster recovery practice that can be adapted to deliver the fastest possible recovery for the specific disaster. This is where the "box" comes in handy. There is no better tool you could have when recovering from a disaster than a well-stocked box of recovery resources.
Clearly I am using the box as a metaphor for a resource repository, so create such a space and fill it full of things you wish you had at your fingertips during a disaster. Things like vendor contact numbers, license keys, administrative passwords, critical configurations, source code, backups, logs, disaster recovery contracts, cloud service agreements, and business documents are all excellent things to keep in your box. As part of your standard operating procedures you should use the box often, in fact the box should contain production versions of the resources listed here. This means that by the very nature of conducting normal business operations you are also keeping your disaster recovery resources current. If a disaster hits, the first step is to assess the disaster and the second is to use the resources in your box to create the recovery plan for the specific incident.
It sounds like I am suggesting that all you need is a "box" and there is no need for an actual recovery plan, but that is not the case. There is great value in having a recovery or business continuity plan, but the role of such a document should be refined. The most important things such plans can contain are the following - the documented priorities of the business, rules for declaring a disaster, chain of command during a disaster, intra-team communication protocols, and expectations for the disaster team. This document should be reviewed and updated frequently by leadership across the organization. Given the reduced scope in relation to traditional technical-focused disaster recovery plans, it is not a difficult document to maintain. Each of these items are easily understood by both business and technology and help facilitate meaningful conversation about priorities and recovery.
Moving to a governance-focused continuity plan coupled with a "box" is a great way to make disaster preparation a part of your organization's normal operation and culture. It also gives you a way to respond to disasters that don't follow the plan. Like Mike Tyson famously stated, "Everyone has a plan until they get punched in the mouth." Having a box should help you if you were to ever get punched in the mouth.