Implementing a disaster recovery solution is dependent on three factors — 1) time 2) resources 3) dollar amount.
Most organization doesn’t even think about DR when the IT infrastructure and applications are running without any issues. Most of them think about DR only when something breaks that created a major negative impact on the business.
If you are a sysadmin, or someone who is responsible for keeping the IT running, you should be constantly working on disaster recovery. Whether your company allocated time and budget or not, you can still work on some aspects of DR.
The following is list of various items that you might want to consider while planning for a DR. This list is not comprehensive by any means, but it should give you enough ideas to get started on DR.
- Resilient primary datacenter. Before you plan for a secondary remote datacenter, you should make sure all the components in your primary datacenter are highly redundant. Your focus should be to design your primary datacenter as resilient as possible that you should rebound quickly from most disasters (except the natural disasters) without having to ever use the secondary remote datacenter. For example, for your production database, have a physical standby database running on the same datacenter, configure dual NIC cards and HBA cards on all prod servers, configure multiple web servers with a load balancer, connect the server to two power circuits using the redundant power supply on the servers, etc.
- Remote secondary datacenter. If you implement a resilient primary datacenter, your goal for using a redundant remote datacenter is primarily for natural disasters such as earthquakes, fires, floods, etc. While this might be very obvious, it is worth stating, as I’ve seen few companies do this, they have both the primary and secondary datacenter in the same city, which defeats the purpose of DR. If your primary datacenter is in california, setup a secondary datacenter somewhere in the east coast.
- Replicate production components in DR datacenter. You don’t need to replicate all of the hardware and applications from primary to secondary datacenter. A sysadmin, or any technical person can quickly identify all critical hardware and software that needs to be replicated in the DR site, but you might need help from other departments to identify the applications that are critical for the business. You have to map the critical business functions to IT infrastructure components, and make sure all of those infrastructure components along with application and data are replicated to the DR site.
- Storage plan. If you have some kind of SAN storage (or NAS storage) that supports the critical application in the primary DR, you need to have similar SAN storage (or NAS storage) on your DR site also. For performance reasons, the prod servers in the DR site should have the same spec as the prod servers in the primary site. However, for storage, if you have a high-end SAN storage from a big vendor on the primary site, consider implementing similar high-end SAN storage from a small vendor, which might cost you lot less money for the same configuration and similar performance.
- Replicate data to DR site on an on-going basis. Syncing data between the primary and disaster datacenter is a critical aspect of a successful DR implementation. Once you’ve listed out all the applications that needs to be replicated to the DR site, you need to figure out how to sync the data between these two sites for all these applications. For example, you can replicate Oracle database at a block level using the replication technologies provided by the storage vendors, or you can use datagurad to replicate the data at a oracle level. Both has its own pros and cons. You have to analyze them carefully and choose the one that fits your budget and scope of the DR.
- Replicate initial data using manual method. When you are setting up the DR site for the 1st time, you have to do decide how you’ll do the initial data sync. For example, if you are replicate a data warehouse database that is huge in size, you can’t copy the initial backup of the database over the network to the remote site, as it might hog the bandwidth. Instead, you can take a tape backup in the primary datacenter and use it in the secondary datacenter to setup the initial database. Once the initial setup is done, you can implement some form of automatic incremental sync between the sites.
- RTO stands for Recovery Time Objective. Working with the management team, you identify the acceptable RTO for the business. For example, your organization may decide that the acceptable RTO is 8 hours. i.e After a disaster, within maximum of 8 hours all the critical applications should be fully operational at the DR site. RTO has a direct impact on how much dollar amount is spend in implementing the DR solution. For example, a RTO of 1 hour might need a very sophisticated DR solution that is way too expensive than what is required by 24 hour RTO.
- RPO stands for Recovery Point Objective. Just like RTO, you should work with management to decide an acceptable RPO for the business. For example, your organization may decide that the acceptable RPO is 2 hours. i.e After a disaster, when you failover the service to secondary DR, 2 hours of data loss is acceptable to business. For example, if the disaster happens at 3 p.m, after you restore the system in the DR site, it will contain data from production only as of 1 p.m. So, you’ve lost data from 1 p.m – 3 p.m. In simple terms, if your RPO is 2 hours, your business should be willing to loose 2 hours worth of data during a disaster.
- Automatic or manual failover? You have decide whether you want to automatically failover or manually failover after a disaster. In most cases, a manual intervention is acceptable, as you don’t want to failover automatically to a DR site based on some false signal. Keep in mind that once you failover, there is lot of work involved in coming back to the primary datacenter.
- Network failover. I have seen DR plans that gives lot of focus to data replication, but less focus to the networking aspects of the DR. Working with your networking team, you need to identify all the network infrastructures that needs to be replicated. For example, DNA failover to make sure that your production URL is pointing to the DR site after the failover. If you have established VPN connections for your customers, identify how to failover the VPN connections. When you create/modify firewall rules (or anything security related) in the primary datacenter, you need to identify how those security policies can be replicated to the DR site on an ongoing basis.
- Remote hand setup. You need to have appropriate plan for accessing the remote datacenter for debugging any issues. You can setup KVM in the DR site, to access the console of the hardware located in the DR site from your office. Or, you need to plan some form of manual remote hand services, where someone can go to the DR site physically and carry out your instructions.
- Annual DR testing. Several organizations spend lot of time and money in setting up a DR site, only to figure out that it doesn’t really work as expected when they are in a real DR situation. Once a year, you should validate your current DR configurations to make sure the DR site works properly and meets the original objective. If everything is configured properly, you should be able to manually failover your critical applications from the primary site to the DR site, and let it run there for few days. This also helps you to see how the DR site handles real time load.
- DR site as a QA platform. Instead of using DR site only for disaster situation, you can use them as a QA platform to do performance and load testing of your applications. This might be helpful, as you don’t need to invest in additional testing infrastructure in the primary data center. When you take this approach, you’ll sill be syncing data from primary site to the DR site on an on-going basis. However, whenever you do a load testing, you need to implement some additional solution where you take a checkpoint of the current state of the DR site, perform your QA testing, and then immediately rollback to the previous checkpoint and continue syncing the primary site data from there.
- DR response plan. Once you’ve implemented the DR site, you need to have a clear DR plan on how you and your team will respond when there is a real disaster. Collaborate with various departments in your organization, and identify the key resources who will be part of the DR response team, and identify their specific role in the DR response plan. The DR response plan is a simple step-by-step instructions on what needs to be done when there is a DR, who will perform those tasks, and in what sequence those tasks will be performed.
- Don’t have a DR site? Lot of organizations don’t have a DR site. If you are working for one of them, and you are responsible for critical applications and IT infrastructure, it is your responsibility to come up with a DR plan, and educate your top management on the importance of spending time and dollar on DR, and get their approval. Come-up with three different DR plans—one that costs $$$, one that costs $$, one that cost $. As we explained earlier, the DR plan can vary depending on various factors, and cost is one of them. After you’ve presented your detailed DR plan to your management, if they still don’t approve, at least you’ll feel good that you’ve given your best shot in coming up with a good DR plan.
- When to declare a disaster? You need to clearly identify this ahead of time. You need to have a very clear written criteria on when you’ll switch to DR site. i.e What criteria triggers the DR failover criteria? When do you initiate a DR? At what point you declare its a DR, add start working on failing over to the DR site? The answer to these question should be clearly defined, and reviewed by every department in your organization, and finally these criteria should be approved by the top management. For some, when the production is down, because someone deleted something on production by mistake might not trigger a DR. For some organizations, they are probably better off restoring the data from the backup on the primary site itself. For other organizations, business cannot wait until they restore from backup, and they need to switch to DR site.
- Backup, backup, and backup. Backups are very important factor in a DR plan. Like we mentioned earlier, your goal should be to never use the DR site, unless a real natural disaster happens. So, a strong backup strategy in your primary site is very critical. You should backup all your critical applications. When you backup your database, store the backup in four locations. Backups are pretty much useless if you don’t restore them on a test server to validate that they are working on an on-going basis.
- Applying patches. When you apply OS patch, upgrade firmware, or perform any kind of configuration management to the hardware in the primary site, you need to have a strategy to do those on the DR site on an on-going basis. You don’t want to be in a situation, where the OS configuration on your primary site is different than the DR site.
- Successful DR depends on lot of factors. Top management blessing, adequate budget allocation, involvement from all business divisions, strong DR plan, strong technical resources, fully tested DR implementation. Most importantly, a well defined DR scope that is in alignment with the business objective is very critical for a successful DR.
- DR documentation. A proper DR planning requires lot of processes to be established. All these processes should be documented properly. For example, a document that explains the escalation process when a DR strikes. A technical documentation that explains what needs to be done to do the failover to DR site. A communication document that lists out all the team members involved in the DR, and what they are responsible for, and how they can be contacted during DR. A document for customer support team, who know what needs to be communicated to customer, and how to reach customers during a disaster. A DR testing document that lists everything that a QA team needs to test after the DR site is live, etc.
We’ve only scratched the surface of the Disaster Recovery. There is lot more to it than the above items. If you are a sysadmin, or someone who is responsible for your IT applications and infrastructure, and if don’t have a DR plan, consider this as a reminder to get something started for your DR strategy.
Comments on this entry are closed.
Very nice article..
Thanks a lot…
Thank you Sir!
thaks for your article
kindly provide the IBM Sever RAID Configrations(Hardware 7 Software)
Nice article. We are planning for DR now and this post has provided really good insight on DR.
That’s an awesome article. We are now planning to establish a DR site from scratch. I have never done any infrastructure project, so kind of stuck up. Can you please help me.
-Esablish a DR site which will be used as a PT (performance test) env too
-OLTP system needs to be replicated
we have a documented solution yet for how we are going to
1. initiate system from unknown state, to be ready for a PT? by unknown, I mean some previous PT or DR may have run, that changed the system in a way, that makes it unusable for the next PT or DR. The goal is to make this as simple as pushing a button, so we may need to build some scripts in this project to make that happen.
a. this would include a full wipe, and restore of all three tiers, both os, app, db, settings etc…, and potential data mask.
i. need to document the steps, and performance expectations. what is latency to remote site? throughput? etc…
b. also how would we execute the PT? for example what data would be captured or generated to drive the PT? would the masking affect this in any way? how would it be started? what baseline metrics would automatically be captured for each run?
c. can we use Oracle SQL Replay?
2. intiate a DR. again a full wipe and restore, but maybe different IPs or other settings for use in Prod.
I need help on what should I include in DR plan.
A sample script to failover to DR/PT env.
Please help. Its urgent.