Considerations for establishing and testing a redundant site
Resilience against failure or recovery from failure are two ongoing goals for the IT manager to plan for. Finding the right mix of both is important for the business. Spend too much and IT becomes a burden for the business, too little and the business may experience downtime due to failures of systems or delays in recovering data. In the following we discuss recovery from failure and the use of a redundant site. I hope you find our experience of value.
Recent successful testing of a redundant IT site for a client got me thinking about what makes a good redundant site or secondary site solution and how to operate it successfully. For our recent test, the primary IT site, which replicates constantly with the secondary site, was shutdown and the secondary site brought up as the live site. Testing of the results included operation of the secondary site (in the role of the primary site) over several weeks of day to day operations and processing end of month financials. The second site was then failed back to the primary site. Approximately 50+ live production servers were involved in the test.
This comprehensive test of the customer DR site produced the following items to consider in creating a redundant site and testing a failover plan;
The DR document is a living document. Testing every month is expensive. Test once every 3 years and the learnings from the test are forgotten as people move on and circumstances change. We have found 12 months to be the ideal frequency. Each annual test has produced a series of learnings which keeps pace with changes and growth in the environment. Issues and solutions are documented as part of the ongoing plan.
DR infrastructure is for DR
It’s tempting to utilise extra IT resources such as servers and storage other than for the purpose of DR. Production, Development or Financial teams would love to get their hands on more resources. The risk here is that a disaster will occur without warning and require the use of these computing resources. The DR resources have to be a dedicated no-go space for DR purposes only.
The primary and secondary sites are ideally mirror images of each other. Tools from server virtualisation vendor VMWare allow a virtual server environment to replicate data between sites. Consider how you would replicate items such as USB based software licenses.
A redundant network link which allows the DR site to be available is ideal. Single points of failure are tough to eliminate. Things to consider in the network design are the duplication of firewalls, authentication servers, DMZ servers and cloud based services.
Document the Process
Each DR failover and failback is an opportunity to learn. Systems grow and versions of network and virtualisation software keep marching forward. Personnel change over time. Comprehensive documentation allows each obstacle and its solution to be documented and utilised to enhance the quality and speed of the next recovery, real or test.
Vendor Support, Good Communications
IT systems are made up of hardware and software from multiple vendors. No single vendor, systems integrator, data centre operator or Telco has all the answers. Ensure support for the DR failover test via a rigorous change control process, good communication and clear lines of leadership. A command bridge is a helpful tool to solve issues quickly across multiple stakeholders and providers.
We have explored some aspects of the benefits and operations of a redundant IT site. Benefits can include protection of data and subsequently revenue, safety or reputation. The right combination of business process and technology can create a substantial business benefit and maximise your IT investment.