Developing Seamless Business Continuity and Disaster Recovery Plans
By Dr. Jim Kennedy.
The development of recovery times for both the business organization’s business continuity plan and the IT department’s disaster recovery plan need to be developed through the collaboration of both parties for either plan to provide the proper protection. However in my thirty-five years in the business continuity and resiliency field I have found in many situations they are not.
The reasons for this can be timing or a lack of knowledge of the overall business continuity and/or disaster recovery planning process coupled with a lack of understanding of each other’s real recovery timing needs.
The purpose of this article is to provide a framework in which the recovery time objectives (RTOs) for the business continuity and the disaster recovery plan can be developed together.
Reason for inconsistencies and failures
Generally the drivers for business continuity and disaster recovery planning are considered to be one and the same, but this is not always the case. Many times the very design process for IT infrastructure requires that the IT organization develop disaster recovery planning thoughts and plans early in the application and/or systems development process. So, early in the project’s timescale of the development of a new application or system, IT must have some understanding of what kind of recovery timing and recovery point timing will be needed to support the technology to be deployed. IT will try to obtain the RTO and RPO (recovery point objective) numbers, but the business is most often focused on insuring that the deployment of the new business process or function is rolled out on time and within budget. The business organization is not thinking about business continuity planning at this time. So, IT will take it on itself to develop a best guess of the required recovery times either based on conversations with the business organization or on its own, if the latter cannot or will not commit to a number.
In other cases that I have seen, there is a clear lack of knowledge about business continuity and disaster recovery planning. Each organization knows that they need either a business continuity or a disaster recovery plan but they are not trained in the overall steps in developing such plans. As such the business organization does not understand the risks, tradeoffs, and costs involved in developing a proper business continuity plan. The business organization also often does not understand that it needs to properly analyze the operation to better understand the recovery requirements during the process/systems/application development phase of the systems/process development life cycle or, as ITIL defines it, the application life cycle (ALC). The business organization needs to quantify the impacts of loss of that process or system; and may not be sure of the right questions to ask - not only in terms of loss of productivity, but in terms of costs to process manually in case of a system loss or failure. Can the organization develop and use manual processes at all if the system or IT infrastructure fails? Does the organization have the human resources to perform the necessary manual processes or will they need to bring in contingent workers and for how long and for what cost? Every business organization needs to clearly understand and to articulate their operation’s maximum tolerable period of disruption (MTPD).
MTPD is the maximum time an activity or resource can be unavailable before irreparable harm is caused to the organization. This applies to both customer-facing and internal activities. Note that the recovery time objective specifies the time by which an organization intends to recover an activity or resource: the maximum tolerable period of disruption is the upper bound on this time.
The business needs to utilize the MTPD to develop its processes and contingency processes, and the IT organization need to understand the MTPD to properly develop its technology and RTO which, in turn, will enable the business to achieve its RTO objectives.
At the same time, IT needs to utilize the recovery time numbers developed by the business organization as a basis for its system and infrastructure RTO values.
Standards and planning process
There are so many business continuity and disaster recovery standards to choose from, as well as other related standards of practice, that this might be the reason for all of the confusion. The fact that none of these standards really talk of integrating the business recovery and the IT technology recovery plans together in to the overall process or application development life cycle complicates the matter even further.
There is also the issue that business continuity and/or disaster recovery planning classes are usually only electives in business administration or computer technology/information systems curriculum. So we are not exactly preparing our next batch of business or technology leaders to properly understand the methods, or importance, of contingency planning.
All that being said, most of the standards that exist do have a pretty consistent set of predefined steps to be reasonably successful. So if we take all of the contingency planning steps and align them with the ITIL ALC phases the planning cycle will integrate system development with continuity planning together at the best possible time in the development process.
I will outline the steps below in developing business continuity and disaster recovery plans with their corresponding points within the ITIL application development life cycle:
|STEPS IN BUSINESS CONTINUITY AND DISASTER RECOVERY PLANNING||ITIL APPLICATION LIFE CYCLE PHASES|
|1) Understand the Organization
a. Risk Assessment
b. Business Impact Assessment
i. Determine MTPD for operation
ii. Develop RTO for Critical Systems
iii. Develop RPO for Critical Systems
|Requirements – requirements gathered based on business needs of the organization|
|2) Evaluate and Determine Strategy
a. BC strategy to meet RTO/RPO
b. DR strategy to meet RTO/RPO
|Design – requirements translated into specifications|
|3) Develop Plans
a. BCP – Business Organization
b. DRP –IT Organization
|Build – Application and the operational model are made ready for deployment|
|4) Exercise Plan||Operate -- IT operates the application as part of the business service|
|5) Audit and Maintain Plan||Optimize|
Using the standards and good practices
During the requirements gathering phase of the ITIL ALC the business owner should have also conducted the risk assessment and business impact analysis or BIA. The results of these two activities allow the business owner to clearly see the impact on the business of a failure or discontinuation of operations in either, or both, of the business or IT operations. They can then translate that knowledge from the risk assessment and business impact analysis into quantifiable RTO and RPO numbers to be used in the next phase of business continuity and disaster recovery planning (Evaluate and Determine Strategy) and the Design phase of the ITIL ALC.
The RTO and RPO numbers are used to develop alternative strategies that meet the recovery time and point needs. A cost for each alternative design is developed. The cost is the total of the IT cost to design, implement, build and operate; and the business cost for any workarounds or special handling during the outage period; plus costs to load any transactions processed during that outage period into the system (processing resynchronization) after they are brought back on-line and are processing again as before the incident.
The alternative strategies are then looked at using a cost and benefit (time, reduced workaround complexity, and etc.) analysis of each alternative. The best option will accomplish return to operation in a reasonable time with an acceptable cost to the business and IT. However, the alternative selected will require input from both IT and the business to properly address the risk of outage. The business will need to insure that it can perform the workarounds and still meet all of the business, regulatory and audit needs of the operation for the time period that the alternative defines the IT organization to need for restoring the IT systems needed to restart the application and its associated services.
For the plans to be effective and ‘fit for purpose’ it is very important that the business and IT are on the ‘same sheet of music’ as to recovery times and points. It is no good if the business has planned its resources and workarounds expecting a system recovery time of 24 hours only to find that the system will be down for 48 hours. On the other side of the coin it is not fiscally responsible to pay the cost to expedite the recovery time of an IT system to less than four hours if the business can tolerate an outage period of 24 hours or more at much less cost for the final IT solution.
Once it has been concluded that both plans are consistent with each other, the actual plans can be developed. While the business prepares for implementation of the new application and/or service, IT will make ready the systems and infrastructure needed to also meet the business schedule for implementation.
Exercising the plans
There is one caveat, however. Even if both sides have planned together and developed their plans based on a single and consistent recovery time, the two planning activities still need to verify (via exercising the plans together) that the IT recovery timing (the disaster recovery plan which includes hardware restoration, software restoration, synchronization of databases, and etc.) actually comes in on time to meet the business’ needs as provided for in the business continuity plan.
Only in testing and timing the two recovery processes to ensure that they are coincident can an organization truly be confident that the overall plans will be successful.
Dr. Jim Kennedy, MRP, MBCI, CBRM, CHS-IV, CRISC has a PhD in Technology and Operations Management and is the chief consulting officer for Recovery-Solutions. Dr. Kennedy has over 30 years' experience in the information security, business continuity and disaster recovery fields and has been published nationally and internationally on those topics. He is the co-author of three books, ‘Security in a Web 2.0 World – a standards based approach,’ ‘Blackbook of Corporate Security’ and ‘Disaster Recovery Planning: An Introduction’ and is author of the e-book, ‘Business Continuity & Disaster Recovery – Conquering the Catastrophic’. Dr. Kennedy can be reached at [email protected]