Saturday, 24 July 2021

Advancing global network reliability through intelligent software - Part 2

Read Advancing global network reliability through intelligent software part 1.

In part one of this networking post, we presented the key design principles of our global network, explored how we emulate changes, our zero touch operations and change automation, and capacity planning. In part two, we start with traffic management. For a large-scale network like ours, it would not be efficient to use traditional hardware managed traffic routing. Instead, we have developed several software-based solutions to intelligently manage traffic engineering in our global network.

SDN-based Internet Traffic Engineering (ITE)

The edge is the most dynamic part of the global network—because the edge is how users connect to Microsoft’s services. We have strategically deployed edge PoPs close to users, to improve customer/network latency and extend the reach of Microsoft cloud services.

Read More: MS-900: Microsoft 365 Fundamentals

For example, if a user in Sydney, Australia accesses Azure resources hosted in Chicago, USA their traffic enters the Microsoft network at an edge PoP in Sydney then travels on our network to the service hosted in Chicago. The return traffic from Azure in Chicago flows back to Sydney on our Microsoft network. By accepting and delivering the traffic to the point closest to the user, we can better control the performance.

Azure Exam Prep, Azure Tutorial and Materials, Azure Preparation, Azure Learning, Azure Guides, Azure Career

Each edge PoP is connected to tens or hundreds of peering networks. Routes between our network and providers’ networks are exchanged using the Border Gateway Protocol (BGP). BGP best path selection has no inherent concept of congestion or performance—neither is BGP capacity aware. So, we developed an SDN-based Internet Traffic Engineering (ITE) system that steers traffic at the edge. The entry and exit points are dynamically altered based on the traffic load of the edge, internet partners’ capacity constraints, reduction or augments in capacity, demand spikes sometimes caused by distributed denial of service attacks, and latency performance of our internet partners. The ITE controller constantly monitors these signals and alters the routes we advertise to our internet partners and/or the routes advertised inside the Microsoft network, to select the best peer-edge.

Optimizing last mile resilience with Azure Peering Service


In addition to optimizing routes within our global network, the Azure Peering Service extends the optimized connectivity to the last mile in the networks of Internet Service Providers (ISPs). Azure Peering Service is a collaboration platform with providers, to enable reliable high-performing connectivity from the users to the Microsoft network. The partnership ensures local and geo redundancy, and proximity to the end users. Each peering location is provisioned with redundant and diverse peering links. Also, providers interconnect at multiple Microsoft PoP locations so that if one of the edge nodes has degraded performance, the traffic routes to and from Microsoft via alternative sites. Internet performance telemetries from Map of Internet (MOI) drive traffic steering for optimized last mile performance.

Route Anomaly Detection and Remediation (RADAR)


The internet runs on BGP. A network or autonomous system is bound to trust, accept, and propagate the routes advertised by its peers without questioning its provenance. That is the strength of BGP and allows the internet to update quickly and heal failures. But it is also its weakness—the path to prefixes owned by a network can be changed by accident or malicious intent to redirect, intercept, or blackhole traffic. There are several incidents that happen to every major provider and some make front page news. We developed a global Route Anomaly Detection and Remediation (RADAR) system to protect our global network.

RADAR detects and mitigates Microsoft route hijacks on the Internet. BGP route leak is the propagation of routing announcement(s) beyond their intended scope. RADAR detects route leaks in Azure and the internet. It can identify stable versus unstable versions of a route and validate new announcements. Using RADAR, and the ITE controller, we built real-time protection for Microsoft prefixes. Peering Service platform extends the route monitoring and protection against hijacks, leaks and any other BGP misconfiguration (intended or not) in the last mile up to the customer location.

Software-driven Wide Area Network (SWAN)


The backbone of the global network is analogous to a highway system connecting major cities. The SWAN controller is effectively the navigation system that assigns the routes for each vehicle, such that every vehicle reaches its destination as soon as possible and without causing congestion on the highways. The system consists of topology discovery, demand prediction, path computation, optimization, and route programming.

Azure Exam Prep, Azure Tutorial and Materials, Azure Preparation, Azure Learning, Azure Guides, Azure Career

Over the last 12 months, the speed of the controller to program the network improved by an order of magnitude and the route-finding capability improved two-fold. Link failures are like lane closures so the controller must recompute routes to decrease congestion. The controller uses the same shared risk link groups (SRLGs) to compute backup routes in case of failure of the primary routes. The backup routes activate immediately upon failure, and the controller gets to work at reoptimizing traffic placement. Links that go up and down in rapid succession are held back from service until they stabilize.

One measure of reliability is the percentage of successfully transmitted bytes to requested bytes, measured over an hour and averaged for the day. Ours is 99.999 percent or better for customer workloads. All communication between Microsoft services is through our dedicated global network. Thousand Eyes Cloud Performance Benchmark reports that over 99 percent of Azure inter-region latencies faster than the performance baseline, and over 60 percent of region pairs are at least 10 percent faster. This is a result of the capacity augments and software systems described in this post.

Bandwidth Broker—software-driven Network Quality of Service (QoS)


If the global network is a highway system, Bandwidth Broker is the system that controls the metering lights at the onramps of highways. For every customer vehicle, there is more than one Microsoft vehicle traversing the highway. Some of the Microsoft vehicles are discretionary and can be deferred to avoid congestion for customer vehicles. Customer vehicles always have a free pass to enter the highways. The metering lights are green in normal operation but when there is a failure or a demand spike, Bandwidth Broker turns on the metering lights in a controlled manner. Microsoft internal workloads are divided into traffic tiers, each with a different priority. Higher priority workloads are admitted in preference to lower priority workloads.

Brokering occurs at the sending host. Hosts periodically request bandwidth on behalf of applications running on them. The requests are aggregated by the controller, bandwidth is reserved, and grants are disseminated to each host. Bandwidth Broker and SWAN coordinate to adjust traffic volume to match routes, and traffic routes to match volume.

It is possible to experience multiple fiber cuts or failures that suddenly reduce network capacity. Geo-replication operations to increase resilience can cause a huge surge in network traffic. Bandwidth Broker generally allows us to preserve the customer experience during these conditions, by shedding discretionary internal workloads when congestion was imminent.

Continuous monitoring


A robust monitoring solution is the foundation to achieve higher network reliability. It lowers both the time to detect and time to repair. The monitoring pipelines constantly analyze several telemetry streams including traffic statistics, health signals, logs, and device configurations. The pipelines automatically collect more data when anomalies are detected or diagnose and remediate common failures. These automated interventions are also guarded by safety check systems.

Major investments in monitoring have been:

➤ Polling and ingestion of metrics data at sub-minute speeds. A few samples are needed to filter transients and a few more to generate a strong signal. This leads to faster detection times.
➤ An enhanced diagnostics system that is triggered by packet loss or latency alerts, instructs agents at different vantage points to collect additional information to help triangulate and pinpoint the issue to a specific link or device.
➤ Enhanced diagnostics trigger auto-mitigation and remediation actions for the most common incidents, with the help of Clockwerk and Real Time Operation Checker (ROC). This translates to faster time to repair and has the ripple effect of keeping engineers focused on more complex incidents.

Other pipelines continuously monitor network graphs for node isolation, and periodically assess risks with “what-if” intent using ROC as described above. We have multiple canary agents deployed throughout the network checking reachability, latency, and packet loss across our regions. This includes agents within Azure, as well as outside of our network, to enable outside-in monitoring. We also periodically analyze Map of Internet (MOI) telemetries to measure end to end performance from customers to Azure. Finally, we have robust monitoring in place to protect the network from security attacks such as BGP route hijacks, and distributed denial of service (DDoS).

Source: Microsoft.com

Related Posts

0 comments:

Post a Comment