Project Narya is a holistic, end-to-end prediction and mitigation service—named after the "ring of fire" from Lord of the Rings, known to resist the weariness of time. Narya is designed not only to predict and mitigate Azure host failures but also to measure the impact of its mitigation actions and to use an automatic feedback loop to intelligently adjust its mitigation strategy. It leverages our Resource Central platform, a general machine learning and prediction-serving system that we have deployed to all Azure compute clusters worldwide. Narya has been running in production for over a year and, on average, has reduced virtual machine interruptions by 26 percent—helping to run your Azure workloads more smoothly.
How did we approach this before Narya?
In the past, we used machine learning to inform our failure predictions, then selected the mitigation action statically based on the failure predicted. For example, if a piece of hardware was determined to be "at-risk" then we would notify customers running workloads on it that we have detected degraded hardware through in-virtual machine notifications. We would also always perform this set of steps:
Read More: AZ-600: Configuring and Operating a Hybrid Cloud with Microsoft Azure Stack
1. Block new allocations on the node.
2. Migrate off as many of the virtual machines as possible on the fly (using live migration).
3. Wait several days for short-lived virtual machines to be stopped organically or re-deployed by customers.
4. Migrate off the remaining virtual machines by disconnecting the virtual machines and moving them to healthy nodes.
5. Bring the node out of production and run internal diagnostics to determine repair action.
Although this approach worked well, we saw several opportunities to improve in certain scenarios. For instance, some failures may be too severe (such as damaged disks) for us to wait days for virtual machines to be stopped or re-deployed. At other times, an "at-risk" prediction might be more minor or even a false positive. In these cases, forced migration would cause unnecessary customer impact, and instead, it would be better to continue monitoring further signals and re-evaluate the node after a given period. Ultimately, we concluded that to truly design the best system for our customers, we needed not only to be more flexible in how we responded to our predictions, but we also needed to measure the exact customer impact of our actions for every different scenario.
How do we approach this now, with Narya?
This is where Narya comes in. Rather than having a single pre-determined mitigation action for an "at-risk" prediction, Narya considers many possible mitigation actions. For a given set of predictions, Narya uses either an online A/B testing framework or a reinforcement learning framework to determine the best possible response.
Phase 1: Failure prediction
Narya starts by using fleet telemetry to predict potential host failures due to hardware faults. We can produce accurate predictions by using a mix of both domain-expert, knowledge-based predictive rules, and a machine learning-based method.
An example of a domain-expert predictive rule is if a CPU Internal Error (IERR) occurs twice within n days (for example, n = 30), this indicates that the node will likely fail again soon. Narya currently uses several dozen domain-expert predictive rules derived from data-driven methods.
Narya also incorporates a machine learning model, which is helpful because it analyzes more signals and patterns over a larger time frame than the predictive rules—allowing us to predict failures earlier. This builds on our prior failure prediction work but, rather than focusing on failures of individual components, this model now reviews overall host health with respect to real customer impact. Since 2018, we have also expanded the kinds of incoming signals and have improved signal quality. As a result, we have reduced the number of false positives and negatives, ultimately improving the effectiveness of this failure prediction step.
Phase 2: Deciding and applying mitigation actions
Rather than having one fixed mitigation strategy, we created a selection of mitigation actions for Narya to consider. Each mitigation action can be considered as a composite of many smaller steps, including:
◉ Marking the node as unallocatable.
◉ Live migrating the virtual machines to other nodes.
◉ Soft rebooting the kernel while preserving memory, which minimizes interruptions to customer workloads which experience only a short pause.
◉ Deprioritizing allocations on the node.
◉ And more.
For example, one mitigation action might be to mark the node unallocatable, then attempt a memory-preserving kernel soft reboot, and mark allocatable again if successful. If unsuccessful, implement a live migration and send the node to diagnostics, where we run tests to determine whether the hardware is degraded. If it is, then we send the node to repair and replace the hardware. Overall, this gives us far more flexibility to handle different scenarios with different mitigations, improving overall Azure host resilience.
To respond to "at-risk" predictions in a much more flexible manner, Narya uses an online A/B testing framework and a reinforcement learning (RL) framework to continuously optimize the mitigation action for minimal virtual machine interruptions.
A/B testing framework
When Narya conducts A/B testing, it selects different mitigation actions, compares them to a control group with no action taken, and gathers all the data to determine which mitigation actions are best for which scenarios. From then onwards, for this given set of failure predictions, it continuously selects the best actions—helping to reduce virtual machine reboots, ensure more available capacity, and maintain the best performance.
Reinforcement learning (RL) framework
When Narya uses reinforcement learning, it learns how to maximize the overall customer experience by exploring different actions over time, weighing the most recent actions the most heavily. Reinforcement learning is different from A/B testing in that it automatically learns to avoid less optimal actions by continuously balancing between using the most optimal actions and exploring new ones.
Phase 3: Observe customer impact and retrain models
Finally, after mitigation actions are taken, new data can be gathered. We now have a measure of the most up-to-date customer impact data, which we use to continually improve our models at every step of the Narya framework. Narya makes sure to do this automatically—the data not only helps us to update the domain-expert rules and the machine learning models in the failure prediction step, but also informs better mitigation action policy in the decision step.
0 comments:
Post a Comment