What Happened
On July 18th, a subset of Azure customers experienced issues with services in the Central US region. This incident impacted Azure Virtual Machine availability, which caused downstream impact on various services including, but not limited to, Azure App Service, Azure AD, and Azure Cosmos DB. The issue originated from an incomplete update to the ‘Allow List’ for Storage Scale Units, which failed to include necessary network address information for numerous VM hosts. Azure has released a preliminary post-incident review (PIR), with the final root cause of PIR expected within 14 days of the original outage. You can read the official details here.
Planning for Failure
People, processes, and technologies fail. In the age of so many connected devices, systems, and services, the critical task at hand is elevating your infrastructure tolerance to eliminate single points of failure while also providing a means of recovery in regional outage scenarios. Historically, this meant having two of everything in the Data Center world. If one network device fails, you have another network device that is highly available and configured with the appropriate First Hop Redundancy Protocol (FHRP) or routing protocols like Border Gateway Protocol (BGP). This example eliminates single points of failure by deploying redundant sets of hardware and leveraging redundant peerings.
Furthermore, when considering Disaster Recovery (DR), multiple data centers would exist in either an active/standby or active/active scenario to prevent or limit impact and restore services and business operations quickly.
Switching Gears to Cloud
In the world of the cloud, we can use availability zones to reach high availability. These availability zones represent physical and logically separated data centers. To achieve DR capabilities in the cloud, you would leverage multiple cloud regions (which contain multiple availability zones) like so:
To reduce and/or eliminate the impact experienced from the recent Azure Central outage, not only would the components of your application need to be available across multiple regions, but so would all the infrastructure. It isn’t just enough to deploy and forget. Network infrastructure especially, must be designed and configured in such a way that, when a regional failure happens, consumers of services aren’t even aware that an outage is happening.
How Alkira Delivers Enterprise-Grade Resiliency
Getting High Availability and Disaster Recovery right can be daunting. Many times, organizations are playing from behind because of the significant engineering and time commitment required from the beginning. Alkira’s Network Platform provides Infrastructure On-Demand with high availability built-in. Alkira’s Cloud Exchange Points (CXPs) span multiple availability zones by default. What does this mean for our customers?
High Availability Made Simple
When a customer connects an Azure VNet to Alkira, it is multihomed to an Alkira CXP in multiple availability zones by default, eliminating single points of failure for networking within a single region. Since Alkira’s architecture runs active/active, this means that, even when an availability zone fails, there is no perceived impact on the customer. Many applications in the enterprise today also have dependencies that are still hosted in on-premises facilities. Any network connected to Alkira will be highly available by default, to maximize uptime for applications or services as a whole, despite what on-premises location or cloud is in scope.
Network DR – On-Demand
In today’s complicated compliance and regulatory landscape for business applications, organizations need a reliable method of cross-region redundancy. For business-critical applications that are available across multiple regions, Alkira provides the means to build your network to withstand cloud region failures. When customers onboard a new network on Alkira’s platform, they have the option to set this up with a few mouse clicks. In the event that all paths are unavailable to an Alkira CXP, the network in scope would have connectivity into an alternate Alkira CXP in a different region. This backup connection would be disabled until a CXP regional failure is triggered.
Optional Inter-Cloud Redundancy
Alkira offers its customers the flexibility to choose which Cloud Service Provider (CSP) to deploy CXPs on. In some instances, it may be desired to deploy critical infrastructure and services in multiple CSPs to increase diversity in the event a single provider experiences a prolonged regional outage. In this scenario, on-premises and cloud network traffic can still route through the backup Alkira CXP, which is hosted on the secondary CSP’s Infrastructure.
Value Driven Optimization
The Hybrid + Multi-Cloud world has thrust Network Engineers into a web of understanding new vendor capabilities, limitations, design, and processes. A fundamental design principle for the next generation of networking that must also serve new technologies like Artificial Intelligence is, that it isn’t how much infrastructure you deploy, but how you deploy that infrastructure. You could deploy a poorly designed cloud network infrastructure with tons of the latest hardware and software, and your performance, high availability, and operations would suffer.
At Alkira, we believe in following the best practices and design principles provided by our partner infrastructure providers and cloud service providers while adhering to first-principle network engineering practices. Our customers see the outcomes of this effort every day in a network built and optimized to provide the best performance and reliability at the right cost.