Designing your Infrastructure for Failover

Don’t bet on a single horse

Design your IT infrastructure in such a way that it can survive major outages (the easy way)

If you run a webshop, are in the ad-serving world, or another industry that requires a continuous online presence, you are aware that (long) downtimes are risky for your business. This means you must design a platform that allows you to act on failures you have no control over (external factors), such as loss of a (virtual) server, due to underlying hardware failure, or something as significant as a total data center outage.

This means you should have a good (and tested!) Disaster Recovery strategy as part of your Business Continuity.
Your strategy should address the possible failures, the designed mitigations, and the remediation. But it should not be too complex as this will increase the risks of unforeseen failures when running a failover switch from your standard infrastructure to your Disaster Recovery (DR) infrastructure.

For example, (virtual) servers and databases can be replicated to separate server infrastructure, even in a secondary data center site that is entirely independent of the primary data center site. This means in most cases that the server or data replication should be near real-time to ensure all changes inside the server are replicated as not to lose any valuable changes in case of an incident. Various tools are available to make this possible, from local (virtual) server replication software to SAN storage replication.

To keep replicated databases consistent, don’t rely on server or storage replication but instead on the provided database tools to keep databases consistent and in sync.

How to failover to an independent secondary site for DR?

We have seen in the past with other hosting providers that although you have designed your platform to be fully redundant by using different data centers/availability zones, a single point of failure can be the network. If the network across and between both data centers is one network and if that network faces a major issue, your environment will likely still be impacted in both data centers. That’s why the secondary site you choose must also be network-independent of your primary data center.

Use a network-independent secondary data center of your current hosting provider. This may include the failover of your public IP addresses, too, between your primary and secondary site. Or find a different hosting provider for the secondary DR-site if the secondary data center of your current hosting provider doesn’t have that independence.

There is, however, a challenge to being network-independent across different providers: the public IPs received from the hosting provider may not be used in your DR environment at another provider.

To solve this challenge the easy way: change the relevant DNS entries (A-records) to include the DR IP addresses. Then, in the time needed to switch over to the DR-environment, the changed DNS records will be propagated across the Internet, and access to the platform will be restored.

Designing and implementing a Disaster Recovery platform doesn’t need to be complex, nor does it require extensive network knowledge. You can be independent of a single (cloud) provider, and if you frequently test your DR, you can rest well at night.