Tuesday 30 June 2020

Advancing Azure service quality with artificial intelligence: AIOps

As Mark mentioned when he launched this Advancing Reliability blog series, building and operating a global cloud infrastructure at the scale of Azure is a complex task with hundreds of ever-evolving service components, spanning more than 160 datacenters and across more than 60 regions. To rise to this challenge, we have created an AIOps team to collaborate broadly across Azure engineering teams and partnered with Microsoft Research to develop AI solutions to make cloud service management more efficient and more reliable than ever before. We are going to share our vision on the importance of infusing AI into our cloud platform and DevOps process. Gartner referred to something similar as AIOps (pronounced “AI Ops”) and this has become the common term that we use internally, albeit with a larger scope. Today’s post is just the start, as we intend to provide regular updates to share our adoption stories of using AI technologies to support how we build and operate Azure at scale.

Why AIOps?


There are two unique characteristics of cloud services:

◉ The ever-increasing scale and complexity of the cloud platform and systems
◉ The ever-changing needs of customers, partners, and their workloads

To build and operate reliable cloud services during this constant state of flux, and to do so as efficiently and effectively as possible, our cloud engineers (including thousands of Azure developers, operations engineers, customer support engineers, and program managers) heavily rely on data to make decisions and take actions. Furthermore, many of these decisions and actions need to be executed automatically as an integral part of our cloud services or our DevOps processes. Streamlining the path from data to decisions to actions involves identifying patterns in the data, reasoning, and making predictions based on historical data, then recommending or even taking actions based on the insights derived from all that underlying data.

Azure Study Material, Azure Guides, Azure Learning, Azure Certification, Azure Exam Prep

Figure 1. Infusing AI into cloud platform and DevOps.

The AIOps vision


AIOps has started to transform the cloud business by improving service quality and customer experience at scale while boosting engineers’ productivity with intelligent tools, driving continuous cost optimization, and ultimately improving the reliability, performance, and efficiency of the platform itself. When we invest in advancing AIOps and related technologies, we see this ultimately provides value in several ways:

◉ Higher service quality and efficiency: Cloud services will have built-in capabilities of self-monitoring, self-adapting, and self-healing, all with minimal human intervention. Platform-level automation powered by such intelligence will improve service quality (including reliability, and availability, and performance), and service efficiency to deliver the best possible customer experience.

◉ Higher DevOps productivity: With the automation power of AI and ML, engineers are released from the toil of investigating repeated issues, manually operating and supporting their services, and can instead focus on solving new problems, building new functionality, and work that more directly impacts the customer and partner experience. In practice, AIOps empowers developers and engineers with insights to avoid looking at raw data, thereby improving engineer productivity.

◉ Higher customer satisfaction: AIOps solutions play a critical role in enabling customers to use, maintain, and troubleshoot their workloads on top of our cloud services as easily as possible. We endeavor to use AIOps to understand customer needs better, in some cases to identify potential pain points and proactively reach out as needed. Data-driven insights into customer workload behavior could flag when Microsoft or the customer needs to take action to prevent issues or apply workarounds. Ultimately, the goal is to improve satisfaction by quickly identifying, mitigating, and fixing issues.

My colleagues Marcus Fontoura, Murali Chintalapati, and Yingnong Dang shared Microsoft’s vision, investments, and sample achievements in this space during the keynote AI for Cloud–Toward Intelligent Cloud Platforms and AIOps at the AAAI-20 Workshop on Cloud Intelligence in conjunction with the 34th AAAI Conference on Artificial Intelligence. The vision was created by a Microsoft AIOps committee across cloud service product groups including Azure, Microsoft 365, Bing, and LinkedIn, as well as Microsoft Research (MSR). In the keynote, we shared a few key areas in which AIOps can be transformative for building and operating cloud systems, as shown in the chart below.

Azure Study Material, Azure Guides, Azure Learning, Azure Certification, Azure Exam Prep

Figure 2. AI for Cloud: AIOps and AI-Serving Platform.

AIOps


Moving beyond our vision, we wanted to start by briefly summarizing our general methodology for building AIOps solutions. A solution in this space always starts with data—measurements of systems, customers, and processes—as the key of any AIOps solution is distilling insights about system behavior, customer behaviors, and DevOps artifacts and processes. The insights could include identifying a problem that is happening now (detect), why it’s happening (diagnose), what will happen in the future (predict), and how to improve (optimize, adjust, and mitigate). Such insights should always be associated with business metrics—customer satisfaction, system quality, and DevOps productivity—and drive actions in line with prioritization determined by the business impact. The actions will also be fed back into the system and process. This feedback could be fully automated (infused into the system) or with humans in the loop (infused into the DevOps process). This overall methodology guided us to build AIOps solutions in three pillars.

Azure Study Material, Azure Guides, Azure Learning, Azure Certification, Azure Exam Prep

Figure 3. AIOps methodologies: Data, insights, and actions.

AI for systems


Today, we're introducing several AIOps solutions that are already in use and supporting Azure behind the scenes. The goal is to automate system management to reduce human intervention. As a result, this helps to reduce operational costs, improve system efficiency, and increase customer satisfaction. These solutions have already contributed significantly to the Azure platform availability improvements, especially for Azure IaaS virtual machines (VMs). AIOps solutions contributed in several ways including protecting customers’ workload from host failures through hardware failure prediction and proactive actions like live migration and Project Tardigrade and pre-provisioning VMs to shorten VM creation time.

Of course, engineering improvements and ongoing system innovation also play important roles in the continuous improvement of platform reliability.

◉ Hardware Failure Prediction is to protect cloud customers from interruptions caused by hardware failures. We shared our story of Improving Azure Virtual Machine resiliency with predictive ML and live migration back in 2018. Microsoft Research and Azure have built a disk failure prediction solution for Azure Compute, triggering the live migration of customer VMs from predicted-to-fail nodes to healthy nodes. We also expanded the prediction to other types of hardware issues including memory and networking router failures. This enables us to perform predictive maintenance for better availability.

◉ Pre-Provisioning Service in Azure brings VM deployment reliability and latency benefits by creating pre-provisioned VMs. Pre-provisioned VMs are pre-created and partially configured VMs ahead of customer requests for VMs. As we described in the IJCAI 2020 publication, As we described in the AAAI-20 keynote mentioned above,  the Pre-Provisioning Service leverages a prediction engine to predict VM configurations and the number of VMs per configuration to pre-create. This prediction engine applies dynamic models that are trained based on historical and current deployment behaviors and predicts future deployments. Pre-Provisioning Service uses this prediction to create and manage VM pools per VM configuration. Pre-Provisioning Service resizes the pool of VMs by destroying or adding VMs as prescribed by the latest predictions. Once a VM matching the customer's request is identified, the VM is assigned from the pre-created pool to the customer’s subscription.

AI for DevOps


AI can boost engineering productivity and help in shipping high-quality services with speed. Below are a few examples of AI for DevOps solutions.

◉ Incident management is an important aspect of cloud service management—identifying and mitigating rare but inevitable platform outages. A typical incident management procedure consists of multiple stages including detection, engagement, and mitigation stages. Time spent in each stage is used as a Key Performance Indicator (KPI) to measure and drive rapid issue resolution. KPIs include time to detect (TTD), time to engage (TTE), and time to mitigate (TTM).

Azure Study Material, Azure Guides, Azure Learning, Azure Certification, Azure Exam Prep

Figure 4. Incident management procedures.

As shared in AIOps Innovations in Incident Management for Cloud Services at the AAAI-20 conference, we have developed AI-based solutions that enable engineers not only to detect issues early but also to identify the right team(s) to engage and therefore mitigate as quickly as possible. Tight integration into the platform enables end-to-end touchless mitigation for some scenarios, which considerably reduces customer impact and therefore improves the overall customer experience.

◉ Anomaly Detection provides an end-to-end monitoring and anomaly detection solution for Azure IaaS. The detection solution targets a broad spectrum of anomaly patterns that includes not only generic patterns defined by thresholds, but also patterns which are typically more difficult to detect such as leaking patterns (for example, memory leaks) and emerging patterns (not a spike, but increasing with fluctuations over a longer term). Insights generated by the anomaly detection solutions are injected into the existing Azure DevOps platform and processes, for example, alerting through the telemetry platform, incident management platform, and, in some cases, triggering automated communications to impacted customers. This helps us detect issues as early as possible.

For an example that has already made its way into a customer-facing feature, Dynamic Threshold is an ML-based anomaly detection model. It is a feature of Azure Monitor used through the Azure portal or through the ARM API. Dynamic Threshold allows users to tune their detection sensitivity, including specifying how many violation points will trigger a monitoring alert.

◉ Safe Deployment serves as an intelligent global “watchdog” for the safe rollout of Azure infrastructure components. We built a system, code name Gandalf, that analyzes temporal and spatial correlation to capture latent issues that happened hours or even days after the rollout. This helps to identify suspicious rollouts (during a sea of ongoing rollouts), which is common for Azure scenarios, and helps prevent the issue propagating and therefore prevents impact to additional customers.

AI for customers


To improve the Azure customer experience, we have been developing AI solutions to power the full lifecycle of customer management. For example, a decision support system has been developed to guide customers towards the best selection of support resources by leveraging the customer’s service selection and verbatim summary of the problem experienced. This helps shorten the time it takes to get customers and partners the right guidance and support that they need.

AI-serving platform


To achieve greater efficiencies in managing a global-scale cloud, we have been investing in building systems that support using AI to optimize cloud resource usage and therefore the customer experience. One example is Resource Central (RC), an AI-serving platform for Azure that we described in Communications of the ACM. It collects telemetry from Azure containers and servers, learns from their prior behaviors, and, when requested, produces predictions of their future behaviors. We are already using RC to predict many characteristics of Azure Compute workloads accurately, including resource procurement and allocation, all of which helps to improve system performance and efficiency.

Looking towards the future


We have shared our vision of AI infusion into the Azure platform and our DevOps processes and highlighted several solutions that are already in use to improve service quality across a range of areas. Look to us to share more details of our internal AI and ML solutions for even more intelligent cloud management in the future. We’re confident that these are the right investment solutions to improve our effectiveness and efficiency as a cloud provider, including improving the reliability and performance of the Azure platform itself.

Saturday 27 June 2020

Azure Support API: Create and manage Azure support tickets programmatically

Large enterprise customers running business-critical workloads on Azure manage thousands of subscriptions and use automation for deployment and management of their Azure resources. Expert support for these customers is critical in achieving success and operational health of their business. Today, customers can keep running their Azure solutions smoothly with self-help resources, such as diagnosing and solving problems in the Azure portal, and by creating support tickets to work directly with technical support engineers.

Azure Study Materials, Azure Guides, Azure Learning, Azure Exam Prep

We have heard feedback from our customers and partners that automating support procedures is key to help them move faster in the cloud and focus on their core business. Integrating internal monitoring applications and websites with Azure support tickets has been one of their top asks. Customers expect to create, view, and manage support tickets without having to sign-in to the Azure portal. This gives them the flexibility to associate the issues they are tracking with the support tickets they raise with Microsoft. The ability to programmatically raise and manage support tickets when an issue occurs is a critical step for them in Azure usability.

We’re happy to share that the Azure Support API is now generally available. With this API, customers can integrate the creation and management of support tickets directly into their IT service management (ITSM) system, and automate common procedures.

Using the Azure Support API, you can:

◉ Create a support ticket for technical, billing, subscription management, and subscription and service limits (quota) issues.

◉ Get a list of support tickets with detailed information, and filter by status or created date.

◉ Update severity, status, and contact information.

◉ Manage all communications for a support ticket.

Benefits of Azure Support API


Reduce the time between finding an issue and getting support from Microsoft

A typical troubleshooting process when the customer encounters an Azure issue looks something like this:

Azure Study Materials, Azure Guides, Azure Learning, Azure Exam Prep

On step five, if the issue is unresolved and identified to be on the Azure side, customers navigate to the Azure portal, to contact support. With programmatic case management access, customers can automate their support process with their internal tooling to create and manage their support tickets, thus reducing the time between finding an issue and contacting support.

Customers now have one end-end process that goes smoothly from internal to external without the person filing the issue having to deal with the complexity and challenges between separate case management systems.

Create support tickets via ARM templates

Deploying an ARM template that creates resources can sometimes result in a ResourceQuotaExceeded deployment error, indicating that you have exceeded your Azure subscription and service limits (quotas). This happens because quotas are applied in the resource group, subscription, account, and other scopes. For example, your subscription may be configured to limit the number of cores for a region. If you attempt to deploy a virtual machine with more cores than the permitted amount, you receive an error stating the quota has been exceeded. The way to resolve it is to request a quota increase by filing a support ticket. With Support APIs in place, you can avoid signing in to the Azure portal to create a ticket, instead request quota increases directly via ARM templates.

Thursday 25 June 2020

Rules Engine for Azure Front Door and Azure CDN is now generally available

Azure Study Materials, Azure Guides, Azure Certification, Azure Exam Prep, Azure CDN

Today we are announcing the general availability of the Rules Engine feature on both Azure Front Door and Azure Content Delivery Network (CDN). Rules Engine places the specific routing needs of your customers at the forefront of Azure’s global application delivery services, giving you more control in how you define and enforce what content gets served from where. Both services offer customers the ability to deliver content fast and securely using Azure’s best-in-class network. We have learned a lot from our customers during the preview and look forward to sharing the latest updates going into general availability.

How Rules Engine works


We recently talked about how we are building and evolving the architecture and design of Azure Front Door Rules Engine. The Rules Engine implementation for Content Delivery Network follows a similar design. However, rather than creating groups of rules in Rules Engine Configurations, all rules are created and applied to each Content Delivery Network endpoint. Content Delivery Network Rules Engine also boasts the concept of a global rule which acts as a default rule for each endpoint that always triggers its action.

General availability capabilities


Azure Front Door

The most important feedback we heard during the Azure Front Door Rules Engine preview was the need for higher rule limits. Effective today, you will be able to create up to 25 rules per configuration, for a total of 10 configurations, giving you the ability to create a total of 250 rules across your Azure Front Door. There remains no additional charge for Azure Front Door Rules Engine.

Azure Content Delivery Network 

Similarly, Azure Content Delivery Network limits have been updated. Through preview, users had access to five total rules including the global rule for each CDN endpoint. We are announcing that as part of general availability, the first five rules will continue to be free of charge, and users can now purchase additional rules to customize CDN behavior further. We’re also increasing the number of match conditions and actions within each rule to ten match conditions and five actions.

Rules Engine scenarios


Rules Engine streamlines security and content delivery logic at the edge, a benefit to both current and new customers of either service. Different combinations of match conditions and actions give you fine-grained control over which users get what content and make the possible scenarios that you can accomplish with Rules Engine endless.

For instance, it’s an ideal solution to address legacy application migrations, where you don’t want to worry about users accessing old applications or not knowing how to find content in your new apps. Similarly, geo match and device identification capabilities ensure that your users always see the optimal content their location and device are using. Implementing security headers and cookies with Rules Engine can also ensure that no matter how your users come to interact with the site, they are doing so over a secure connection, preventing browser-based vulnerabilities from impacting your site.

Here are some additional scenarios that Rules Engine empowers:

◉ Enforce HTTPS, ensure all your end-users interact with your content over a secure connection.

◉ Implement security headers to prevent browser-based vulnerabilities like HTTP Strict-Transport-Security (HSTS), X-XSS-Protection, Content-Security-Policy, X-Frame-Options, as well as Access-Control-Allow-Origin headers for Cross-Origin Resource Sharing (CORS) scenarios. Security-based attributes can also be defined with cookies.

◉ Route requests to mobile or desktop versions of your application based on the patterns in the contents of request headers, cookies, or query strings.

◉ Use redirect capabilities to return 301, 302, 307, and 308 redirects to the client to redirect to new hostnames, paths, or protocols.

◉ Dynamically modify the caching configuration of your route based on the incoming requests.

◉ Rewrite the request URL path and forward the request to the appropriate backend in your configured backend pool.

◉ Optimize media delivery to tune the caching configuration based on file type or content path (Azure Content Delivery Network only).

Tuesday 23 June 2020

Rapid recovery planning for IT service providers

Azure Lighthouse is launching the “Azure Lighthouse Vision Series,” a new initiative to help partners with the business challenges of today and provide them the resources and knowledge needed to create a thriving Azure practice.

We are starting the series with a webinar aimed at helping our IT service partners prepare for and manage a new global economic climate. This webinar will be hosted by industry experts from Service Leadership Inc., advisors to service provider owners, and executives worldwide. It will cover offerings and execution strategies for solutions and services to optimize profit, growth, and stock value. Service Leadership publishes the Service Leadership Index® of solution provider performance, the industry's broadest and deepest operational and financial benchmark service.

The impact of a recession on service providers


As we continue through unchartered economic territory, service providers must prepare for possible recovery scenarios. Service Leadership has developed an exclusive (and no-cost) guide for service provider owners and executives called the Rapid Recovery™ Planning Guide, based on historical financial benchmarks of solution providers in recessions, and likely recovery scenarios.

The guide unlocks the best practices used by those service providers who did best in past recessions, as evidenced by their financial performance from the 2008 recession to the present day. As noted in the guide, through their Service Leadership Index® Annual Solution Provider Industry Profitability Report™, Service Leadership determined that:

◉ In the 2001 and 2008 recessions, value-added reseller (VAR) and reseller revenue declined an average 45 percent within two quarters.

◉ In the 2008 recession, mid-size and enterprise managed services providers (MSPs) experienced a 30 percent drop in revenue within the first three quarters.

◉ Private cloud providers saw the smallest average dip, only 10 percent, in past recessions.

◉ Project services firms experienced the most significant decline, having dropped into negative adjusted EBITDA back in 2008.

The upcoming webinar will explore methods used by the top performing service providers to plan and execute successfully in the current economy.

Tackling the challenges of today and tomorrow


Service providers have an essential role to play in our economic recovery. As we shift to a remote working culture, companies across the globe are ramping up efforts to reduce cost, ensure continuity in all lines of business, and manage new security challenges with a borderless office.

The chart below shows how three Service Provider Predominant Business Models™ have performed since the end of the last recession.

Azure Study Materials, Azure Learning, Azure Certification, Azure Exam Prep, Azure Tutorial and Material

During the webinar, Service Leadership will provide estimated financial projections using multiple economic scenarios through 2028. These predictions, coupled with service provider best practices for managing an economic downturn, will be at the heart of our presentation.

Navigating success with Azure


Our Principal PM Manager for Azure Lighthouse, Archana Balakrishnan, will join Service Leadership to illustrate how Microsoft Azure Management tools can give service providers the tools needed to scale, automate, and optimize managed services on Azure.

Sunday 21 June 2020

Accelerating Cybersecurity Maturity Model Certification (CMMC) compliance on Azure

Cybersecurity Maturity Model Certification (CMMC), Azure Tutorial and Material, Azure Guides, Azure Learning, Azure Tutorial and Material

As we deliver on our ongoing commitment to serving as the most secure and compliant cloud, we’re constantly adapting to the evolving landscape of cybersecurity to help our customers achieve compliance more rapidly. Our aim is to continue to provide our customers and partners with world-class cybersecurity technology, controls, and best practices, making compliance faster and easier with native capabilities in Azure and Azure Government, as well as Microsoft 365 and Dynamics 365.

In architecting solutions with customers, a foundational component of increasing importance is building more secure and reliable supply chains. For many customers, this is an area where new tools, automation, and process maturity can improve an organization’s security posture while reducing manual compliance work.

In preparing for the new Cybersecurity Maturity Model Certification (CMMC) from the Department of Defense (DoD), many of our customers and partners have asked for more information on how to prepare for audits slated to start as early as the summer of 2020.

Designed to improve the security posture of the Defense Industrial Base (DIB), CMMC requires an evaluation of the contractor’s technical security controls, process maturity, documentation, policies, and the processes that are in place and continuously monitored. Importantly, CMMC also requires validation by an independent, certified third-party assessment organization (C3PAO) audit, in contrast to the historical precedent of self-attestation.

Expanding compliance coverage to meet CMMC requirements


Common questions we’ve heard from customers include: “when will Azure achieve CMMC accreditation?” and “what Microsoft cloud environments will be certified?”

While the details are still being finalized by the DoD and CMMC Accreditation Body (CMMC AB), we expect some degree of reciprocity with FedRAMP, NIST 800-53, and NIST CSF, as many of the CMMC security controls map directly to controls under these existing cybersecurity frameworks. Ultimately, Microsoft is confident in its cybersecurity posture and is closely following guidance from DoD and the CMMC AB to demonstrate compliance to the C3PAOs. We will move quickly to be evaluated once C3PAOs are accredited and approved to begin conducting assessments. 

Microsoft’s goal is to continue to strengthen cybersecurity across the DIB through world-class cybersecurity technology, controls, and best practices, and to put its cloud customers in a position to inherit Microsoft’s security controls and eventual CMMC certifications. Our intent is to achieve certification for Microsoft cloud services utilized by DIB customers.

Note: While commercial environments are intended to be certified as they are for FedRAMP High, CMMC by itself should not be the deciding factor on choosing which environment is most appropriate. Most DIB companies are best aligned with Azure Government and Microsoft 365 GCC High for data handling of Controlled Unclassified Information (CUI).

New CMMC acceleration program for a faster path to certification


The Microsoft CMMC acceleration program is an end-to-end program designed to help customers and partners that serve as suppliers to the DoD improve their cybersecurity maturity, develop the cyber critical thinking skills essential to CMMC, and benefit from the compliance capabilities native to Azure and Azure Government.

The program will help you close compliance gaps and mitigate risks, evolve your cybersecurity toward a more agile and resilient defense posture, and help facilitate CMMC certification. Within this program, you’ll have access to a portfolio of learning resources, architectural references, and automated implementation tools custom-tailored to the certification journey.

Source: microsoft.com

Saturday 20 June 2020

Achieve higher performance and cost savings on Azure with virtual machine bursting

Selecting the right combination of virtual machines (VMs) and disks is extremely important as the wrong mix can impact your application’s performance. One way to choose which VMs and disks to use is based on your disk performance pattern, but it’s not always easy. For example, a common scenario is unexpected or cyclical disk traffic where the peak disk performance is temporary and significantly higher than the baseline performance pattern. We frequently get asked by our customers, "should I provision my VM for baseline or peak performance?" Over-provisioning can lead to higher costs, while under-provisioning can result in poor application performance and customer dissatisfaction. Azure Disk Storage now makes it easier for you to decide, and we’re pleased to share VM bursting support on your Azure virtual machines.

Get short-term, higher performance with no additional steps or costs


VM bursting, which is enabled by default, offers you the ability to achieve higher throughput for a short duration on your virtual machine instance with no additional steps or cost. Currently available on all Lsv2-series VMs in all supported regions, VM bursting is great for a wide range of scenarios like handling unforeseen spiky disk traffic smoothly, or processing batched jobs with speed. With VM bursting, you can see up to 8X improvement in throughput when bursting. Additionally, you can combine both VM and disk bursting (generally available in April) to get higher performance on your VM or disks without overprovisioning. If you have workloads running on-premises with unpredictable or cyclical disk traffic, you can migrate to Azure and take advantage of our VM bursting support to improve your application performance.

Bursting flow


VM bursting is regulated on a credit-based system. Your VM starts with a full amount of credits and these credits allow you to burst for 30 minutes at the maximum burst rate. Bursting credits accumulate when your VM instance is running under their performance disk storage limits. Bursting credits are consumed when your VM instance is running over their performance limits.

Microsoft Exam Prep, Azure Certification, Azure Certifications, Azure Tutorial and Material

Benefits of virtual machine bursting


◉ Cost savings: If your daily peak performance time is less than the burst duration, you can use bursting VMs or disks as a cost-effective solution. You can build your VM and disk combination so the bursting limits match the required peak performance and the baseline limits match the average performance.

◉ Preparedness for traffic spikes: Web servers and their applications can experience traffic surges at any time. If your web server is backed by VMs or disks using bursting, the servers are better equipped to handle traffic spikes.

◉ Handling batch jobs: Some application’s workloads are cyclical in nature and require a baseline performance for most of the time and require higher performance for a short period of time. An example of this would be an accounting program that processes transactions daily that require a small amount of disk traffic, but at the end of the month does reconciling reports that need a much higher amount of disk traffic.

Get started with disk bursting


Create new virtual machines on the burst supported virtual machines using the Azure portal, PowerShell, or command-line interface (CLI) now. Bursting comes enabled by default on VMs that support it, so you don't need to do anything but deploy the instance to get the benefits. Any of your exisiting VMs that support bursting will have the capability enabled automatically. You can find the specifications of burst eligible virtual machines in the table below. Bursting feature is available in all regions where Lsv2-series VMs are available.

Size Uncached data disk throughput (MB/s)  Max burst uncached data disk throughput (MB/s) 
Standard_L8s_v2 160 1280
Standard_L16s_v2  320  1280 
Standard_L32s_v2  640  1280 
Standard_L48s_v2  960  2000 
Standard_L64s_v2  1280  2000 
Standard_L80s_v2  1400  2000 

Thursday 18 June 2020

Advancing Microsoft Teams on Azure—operating at pandemic scale

Scale, resiliency, and performance do not happen overnight—it takes sustained and deliberate investment, day over day, and a performance-first mindset to build products that delight our users. Since its launch, Teams has experienced strong growth: from launch in 2017 to 13 million daily users in July 2019, to 20 million in November 2019. In April, we shared that Teams has more than 75 million daily active users, 200 million daily meeting participants, and 4.1 billion daily meeting minutes. We thought we were accustomed to the ongoing work necessary to scale service at such a pace given the rapid growth Teams had experienced to date. COVID-19 challenged this assumption; would this experience give us the ability to keep the service running amidst a previously unthinkable growth period?

A solid foundation


Teams is built on a microservices architecture, with a few hundred microservices working cohesively to deliver our product’s many features including messaging, meetings, files, calendar, and apps. Using microservices helps each of our component teams to work and release their changes independently.

Azure is the cloud platform that underpins all of Microsoft’s cloud services, including Microsoft Teams. Our workloads run in Azure virtual machines (VMs), with our older services being deployed through Azure Cloud Services and our newer ones on Azure Service Fabric. Our primary storage stack is Azure Cosmos DB, with some services using Azure Blob Storage. We count on Azure Cache for Redis for increased throughput and resiliency. We leverage Traffic Manager and Azure Front Door to route traffic where we want it to be. We use Queue Storage and Event Hubs to communicate, and we depend on Azure Active Directory to manage our tenants and users.


While this post is mostly focused on our cloud backend, it’s worth highlighting that the Teams client applications also use modern design patterns and frameworks, providing a rich user experience, and support for offline or intermittently connected experiences. The core ability to update our clients quickly and in tandem with the service is a key enabler for rapid iteration. If you’d like to go deeper into our architecture, check out this session from Microsoft Ignite 2019.

Agile development


Our CI/CD pipelines are built on top of Azure Pipelines. We use a ring-based deployment strategy with gates based on a combination of automated end-to-end tests and telemetry signals. Our telemetry signals integrate with incident management pipelines to provide alerting over both service- and client-defined metrics. We rely heavily on Azure Data Explorer for analytics.

In addition, we use an experimentation pipeline with scorecards that evaluate the behavior of features against key product metrics like crash rate, memory consumption, application responsiveness, performance, and user engagement. This helps us figure out whether new features are working the way we want them to.

All our services and clients use a centralized configuration management service. This service provides configuration state to flip product features on and off, adjust cache time-to-live values, control network request frequencies, and set network endpoints to contact for APIs. This provides a flexible framework to “launch darkly,” and to conduct A/B testing such that we can accurately measure the impact of our changes to ensure they are safe and efficient for all users.

Key resiliency strategies


We employ several resiliency strategies across our fleet of services:

◉ Active-active fault tolerant systems: An active-active fault tolerant system is defined as two (or more) operationally-independent heterogenous paths, with each path not only serving live traffic at a steady-state but also having the capability to serve 100 percent of expected traffic while leveraging client and protocol path-selection for seamless failover. We adopt this strategy for cases where there is a very large failure domain or customer impact with reasonable cost to justify building and maintaining heterogeneous systems. For example, we use the Office 365 DNS system for all externally visible client domains. In addition, static CDN-class data is hosted on both Azure Front Door and Akamai.

◉ Resiliency-optimized caches: We leverage caches between our components extensively, for both performance and resiliency. Caches help reduce average latency and provide a source of data in case a downstream service is unavailable. Keeping data in caches for a long time introduces data freshness issues yet keeping data in caches for a long time is the best defense against downstream failures. We focus on Time to Refresh (TTR) to our cache data as well as Time to Live (TTL). By setting a long TTL and a shorter TTR value, we can fine-tune how fresh to keep our data versus how long we want data to stick around whenever a downstream dependency fails.

◉ Circuit Breaker: This is a common design pattern that prevents a service from doing an operation that is likely to fail. It provides a chance for the downstream service to recover without being overwhelmed by retry requests. It also improves the response of a service when its dependencies are having trouble, helping the system be more tolerant of error conditions.

◉ Bulkhead isolation: We partition some of our critical services into completely isolated deployments. If something goes wrong in one deployment, bulkhead isolation is designed to help the other deployments to continue operating. This mitigation preserves functionality for as many customers as possible.

◉ API level rate limiting: We ensure our critical services can throttle requests at the API level. These rate limits are managed through the centralized configuration management system explained above. This capability enabled us to rate limit non-critical APIs during the COVID-19 surge.

◉ Efficient Retry patterns: We ensure and validate all API clients implement efficient retry logic, which prevents traffic storms when network failures occur.

◉ Timeouts: Consistent use of timeout semantics prevents work from getting stalled when a downstream dependency is experiencing some trouble.

◉ Graceful handling of network failures: We have made long-term investments to improve our client experience when offline or with poor connections. Major improvements in this area launched to production just as the COVID-19 surge began, enabling our client to provide a consistent experience regardless of network quality.

If you have seen the Azure Cloud Design Patterns, many of these concepts may be familiar to you.  We also use the Polly library extensively in our microservices, which provides implementations for some of these patterns.

Our architecture had been working out well for us, Teams use was growing month-over-month and the platform easily scaled to meet the demand. However, scalability is not a “set and forget” consideration, it needs continuous attention to address emergent behaviors that manifest in any complex system.

When COVID-19 stay-at-home orders started to kick in around the world, we needed to leverage the architectural flexibility built into our system, and turn all the knobs we could, to effectively respond to the rapidly increasing demand.

Capacity forecasting


Like any product, we build and constantly iterate models to anticipate where growth will occur, both in terms of raw users and usage patterns. The models are based on historical data, cyclic patterns, new incoming large customers, and a variety of other signals.

As the surge began, it became clear that our previous forecasting models were quickly becoming obsolete, so we needed to build new ones that take the tremendous growth in global demand into account. We were seeing new usage patterns from existing users, new usage from existing but dormant users, and many new users onboarding to the product, all at the same time. Moreover, we had to make accelerated resourcing decisions to deal with potential compute and networking bottlenecks. We use multiple predictive modeling techniques (ARIMA, Additive, Multiplicative, Logarithmic). To that we added basic per-country caps to avoid over-forecasting. We tuned the models by trying to understand inflection and growth patterns by usage per industry and geographic area. We incorporated external data sources, including Johns Hopkins’ research for COVID-19 impact dates by country, to augment the peak load forecasting for bottleneck regions.

Throughout the process, we erred on the side of caution and favored over-provisioning—but as the usage patterns stabilized, we also scaled back as necessary.

Scaling our compute resources


In general, we design Teams to withstand natural disasters. Using multiple Azure regions helps us to mitigate risk, not just from a datacenter issue, but also from interruptions to a major geographic area. However, this means we provision additional resources to be ready to take on an impacted region’s load during such an eventuality. To scale out, we quickly expanded deployment of every critical microservice to additional regions in every major Azure geography. By increasing the total number of regions per geography, we decreased the total amount of spare capacity each region needed to hold to absorb emergency load, thereby reducing our total capacity needs. Dealing with load at this new scale gave us several insights into ways we could improve our efficiency:

◉ We found that by redeploying some of our microservices to favor a larger number of smaller compute clusters, we were able to avoid some per-cluster scaling considerations, helped speed up our deployments, and gave us more fine-grained load-balancing.

◉ Previously, we depended on specific virtual machine (VM) types we use for our different microservices. By being more flexible in terms of a VM type or CPU, and focusing on overall compute power or memory, we were able to make more efficient use of Azure resources in each region.

◉ We found opportunities for optimization in our service code itself. For example, some simple improvements led to a substantial reduction in the amount of CPU time we spend generating avatars (those little bubbles with initials in them, used when no user pictures are available).

Networking and routing optimization


Most of Teams’ capacity consumption occurs within daytime hours for any given Azure geography, leading to idle resources at night. We implemented routing strategies to leverage this idle capacity (while always respecting compliance and data residency requirements):

◉ Non-interactive background work is dynamically migrated to the currently idle capacity. This is done by programming API-specific routes in Azure Front Door to ensure traffic lands in the right place.

◉ Calling and meeting traffic was routed across multiple regions to handle the surge. We used Azure Traffic Manager to distribute load effectively, leveraging observed usage patterns. We also worked to create runbooks which did time-of-day load balancing to prevent wide area network (WAN) throttling.

Some of Teams’ client traffic terminates in Azure Front Door. However, as we deployed more clusters in more regions, we found new clusters were not getting enough traffic. This was an artifact of the distribution of the location of our users and the location of Azure Front Door nodes. To address this uneven distribution of traffic we used Azure Front Door’s ability to route traffic at a country level. In this example you can see below that we get improved traffic distribution after routing additional France traffic to the UK West region for one our services.


Figure 1: Improved traffic distribution after routing traffic between regions.

Cache and storage improvements


We use a lot of distributed caches. A lot of big, distributed caches. As our traffic increased, so did the load on our caches to a point where the individual caches would not scale. We deployed a few simple changes with significant impact on our cache use:

◉ We started to store cache state in a binary format rather than raw JSON. We used the protocol buffer format for this.
◉ We started to compress data before sending it to the cache. We used LZ4 compression due to its excellent speed versus compression ratio.

We were able to achieve a 65 percent reduction in payload size, 40 percent reduction in deserialization time, and 20 percent reduction in serialization time. A win all around.

Investigation revealed that several of our caches had overly aggressive TTL settings, resulting in unnecessary eager data eviction. Increasing those TTLs helped both reduce average latency and load on downstream systems.

Purposeful degradation (feature brownouts)


As we didn’t really know how far we’d need to push things, we decided it was prudent to put in place mechanisms that let us quickly react to unexpected demand spikes in order to buy us time to bring additional Teams capacity online.

Not all features have equal importance to our customers. For example, sending and receiving messages is more important than the ability to see that someone else is currently typing a message. Because of this, we turned off the typing indicator for a duration of two weeks while we worked on scaling up our services. This reduced peak traffic by 30 percent to some parts of our infrastructure.

We normally use aggressive prefetching at many layers of our architecture such that needed data is close at hand, which reduces average end-to-end latency. Prefetching however can get expensive, as it results in some amount of wasted work when fetching data that will never be used, and it requires storage resources to hold the prefetched data. In some scenarios we chose to disable prefetching, freeing up capacity on some of our services at the cost of higher latency. In other cases, we increased the duration of prefetch sync intervals. One such example was suppressing calendar prefetch on mobile which reduced request volume by 80 percent:


Figure 2: Disable prefetch of calendar event details in mobile.

Incident management


While we have a mature incident management process that we use to track and maintain the health of our system, this experience was different. Not only were we dealing with a huge surge in traffic, our engineers and colleagues were themselves going through personal and emotional challenges while adapting to working at home.

To ensure that we not only supported our customers but also our engineers, we put a few changes in place:

◉ Switched our incident management rotations from a weekly cadence to a daily cadence.
◉ Every on-call engineer had at least 12 hours off between shifts.
◉ We brought in more incident managers from across the company.
◉ We deferred all non-critical changes across our services.

These changes helped ensure that all of our incident managers and on-call engineers had enough time to focus on their needs at home while meeting the demands of our customers.

The future of Teams


It is fascinating to look back and wonder what this situation would have been like if it happened even a few years ago. It would have been impossible to scale like we did without cloud computing. What we can do today by simply changing configuration files could previously have required purchasing new equipment or even new buildings. As the current scaling situation stabilizes, we have been returning our attention to the future. We think there are many opportunities for us to improve our infrastructure:

◉ We plan to transition from VM-based deployments to container-based deployments using Azure Kubernetes Service, which we expect will reduce our operating costs, improve our agility, and align us with the industry.

◉ We expect to minimize the use of REST and favor more efficient binary protocols such as gRPC. We will be replacing several instances of polling throughout the system with more efficient event-based models.

◉ We are systematically embracing chaos engineering practices to ensure all those mechanisms we put in place to make our system reliable are always fully functional and ready to spring into action.

By keeping our architecture aligned with industry approaches and by leveraging best practices from the Azure team, when we needed to call for assistance, experts could quickly help us solve problems ranging from data analysis, monitoring, performance optimization and incident management. We are grateful for the openness of our colleagues across Microsoft and the broader software development community. While the architectures and technologies are important, it is the team of people you have that keeps your systems healthy.

Source: microsoft.com

Tuesday 16 June 2020

Be prepared for what’s next: Accelerate cloud migration

Cloud Migration, Azure Study Materials, Azure Certification, Azure Exam Prep

We are in the midst of unprecedented times with far-reaching implications of the global health crisis to healthcare, public policy, and the economy. Organizations are fundamentally changing how they run their businesses, ensure the safety of their workforce, and keep their IT operations running. Most IT leaders that we have had the opportunity to speak with over the past couple of months are thinking hard on how to adapt to rapidly changing conditions. They are also trying to retain momentum on their well-intentioned strategic initiatives.

Across our customers in many industries and geographies, we continue to see the cloud deliver tangible benefits. Azure enables our customers to act faster, continue to innovate, and pivot their IT operations to what matters most. We understand the challenges our customers are facing. We also recognize that our customers are counting on us more than ever.

Common, consistent goals for businesses


Even though our customers’ challenges are often unique to the industries they serve, we hear many common, consistent goals.

◉ Cloud-based productivity and remote collaboration is enabling workers, IT professionals, and developers to work from anywhere. As our customers enable an increase in remote work, there’s increased importance on scaling networking capacity while securely connecting employees to resources they need.

◉ Azure is critical to our customers’ ability to rapidly scale their compute and storage infrastructure to meet their business needs. This is made possible because of how customers have transformed their IT operations with Azure. Driving operational efficiency with Azure can also enable businesses to scale on-demand and meet business needs.

◉ IT budgets will be constrained over the coming year—optimization of existing cloud investments and improving cashflow via migration to Azure are top of mind. Our customers are exploring ways to run their businesses with reduced IT budgets. Owning and managing on-premises datacenters is expensive and makes customers vulnerable to business continuity risk. An Azure migration approach is resonating with these customers to transition spend to Opex, improving cash flow and reducing business risk.

◉ The downtime is also becoming an opportunity to accelerate projects. CIOs are looking at this as an opportunity to deliver planned projects and find ways to innovate with Azure. They are counting on this innovation to help their business experience a steep recovery as we exit the current scenario.

In many of my discussions with customers, we still hear uncertainty about how to navigate the cloud migration journey. There is an urgency to act, but often a hesitation to start. There is, no doubt, a learning curve, but Microsoft has traversed it with many customers over the past few years. Businesses need best practices and prescriptive guidance on where to begin, how to best steer, and how to avoid the pitfalls. This blog is aimed to help you make progress on this pressing need. We’ll dive deeper into the steps of the cloud migration journey in upcoming posts.

To get you started on your accelerated journey to Azure, here are our top three recommendations. While these aren’t meant to be one-size-fits-all, these are based on learnings from hundreds of scale migration engagements that our team has helped our customers with.

1. Prioritize assessments


Perform a comprehensive discovery of your datacenters using our free tools such as Azure Migrate or Movere. Creating an inventory of your on-premises infrastructure, databases, and applications is the first step in generating right-sized and optimized cost projections for running your applications in Azure. Between your existing configuration management database (CMDB), Active Directory, management tools, and our discovery tools, you have everything you need to make crucial migration decisions.

The priority should be to cover the entire fleet and then arrive at key decisions related to candidate apps that you can migrate first and the appropriate migration approach for them. As you run your assessments, identify applications that could be quick wins—hardware refresh, software end-of-support, OS end-of-support, and capacity constrained resources are all great places to prioritize for the first project. Bias towards action and demonstrating urgency with triggers that need immediate attention can ensure that you are able to drive operational efficiencies and flip your capital expenses to operational.

Many Azure customers are doing this effectively. One example is GlaxoSmithKline (GSK). In partnership with Azure engineering and Microsoft FastTrack for Azure, and by leveraging the Azure Migration Program (AMP), GSK was able to quickly discover their VMware virtual machines and physical servers with Azure Migrate. By leveraging features such as application inventory and application dependency mapping, GSK was able to build a prioritized list of applications that they could migrate. They then used the discovery and assessment data and incorporated it with their CMDB to build PowerBI dashboards to track the progress of their strategic migration initiatives.

“Microsoft engineering and FastTrack’s ability to quickly aggregate and visualize our application hosting estate is the cornerstone to our migration planning activities. GSK is comprised of many different business units, and we are able to tailor migration priorities for each of these business units. In addition, we also now have clear visibility for each server, what they are dependent on, and can now also determine the appropriate server size in Azure to create our migration bundles and landing zones. With this excellent foundation of data, we are able to quickly move into the migration phase of our cloud journey with a high degree of confidence in our approach.”—Jim Funk, Director, Hosting Services, GlaxoSmithKline

2. Anticipate and mitigate complexities


You will run into complexities as you drive your migration strategy—some of these will be related to the foundational architecture of your cloud deployments, but a lot of it will be about how your organization is aligned for change. It is important that you prepare people, business processes, and IT environments for the change, based on a prioritized and agreed cloud adoption plan. Every migration we’ve been involved in has had its own unique requirements. We find that customers who are moving quickly are those who have established clarity in ownership and requirements across stakeholders from security, networking, IT, and application teams.

“The migration to the cloud was more about the mindset in the organization and that transformation we needed to do in IT to become the driver of change in the company instead of maintaining the old. A big part of the migration was to reinvent the digital for the company." —Mark Dajani, CIO, Carlsberg Group

On the technical front, anticipate complexities and plan for your platform foundation for identity, security, operations, compliance, and governance. With established baselines across these shared-architectural pillars, deploy purpose-built landing zones that leverage these centralized controls. Simply put, landing zones and the platform foundation capture everything that must be in place and ready to enable cloud adoption across the IT portfolio.

In addition to designing your baseline environment, you would also want to consider your approach to managing your applications as they migrate to Azure. Azure offers comprehensive management solutions for backup, disaster recovery, security, monitoring, governance, and cost management, which can help you achieve IT effectiveness as you migrate. Most customers run in a hybrid reality even when they intend to evacuate on-premises datacenters. Azure Arc is a terrific option for customers who want to simplify complex and distributed environments across on-premises and Azure, extending Azure management to any infrastructure.

3. Execute iteratively


Customers who have the most success in executing on their migration strategy are customers who follow an iterative, workload-based, wave-oriented approach to migration. These customers are using our free first-party migration tools to achieve the scale that works best for their business—from a few hundred to thousands of servers and databases. With Azure Migrate you have coverage for Windows Server and Linux, SQL Server and other databases, .NET and PHP-based web applications, and virtual desktop infrastructure (VDI). These capabilities give you options for migration to infrastructure as a service (IaaS) and platform as a service (PaaS) offerings like Azure App Service and Azure SQL.

The key to success and executing effectively is targeting specific workloads and then executing in phases. In addition, leveraging capabilities like dependency mapping and test migration ensures that your migration cutovers are predictable and have high success rates. We strongly recommend using a lift-optimize-shift approach and then innovating in the cloud, especially during these times.

One such customer who has leveraged the Azure Migrate toolset as part of their cloud transformation is Malaysian telecommunications operator, Celcom. Celcom leveraged Azure Migrate’s discovery and assessment features to securely catalog their applications, virtual machines (VMs), and other IT assets, and to determine the best way to host them in the cloud. With their foundational architecture and management strategy in place, Celcom executed in waves, transitioning their complex multi-vendor on-premises environment with multiple applications over to Azure.

Source: azure.microsoft.com

Saturday 13 June 2020

Streamlining your image building process with Azure Image Builder

Customizing virtual machine (VM) images to meet security and compliance requirements and achieve faster deployment is a strong need for many enterprises, but most don't enjoy the process and energy needed for determining the right tooling, building the right pipeline, and maintaining it continuously.

We built Azure Image Builder service to make building customized images easy in Azure.

Azure Image Builder service offers unification and simplification for your image building process across Azure and Azure Stack with an automated image building pipeline. Whether you want to build Windows or Linux virtual machine images, you can use existing image security configurations to build compliant images for your organization and patch existing custom images using Linux commands or Windows Update. Azure Image Builder supports images from multiple Linux distributions, Azure Marketplace, and Windows Virtual Desktop environments and you can build images for specialized VM sizes, including creating images for GPU VMs.

After you build the image, you can manage it with Shared Image Gallery and integrate your CI/CD pipeline with Azure Image Builder service. When you use Azure DevOps or other DevOps solutions, this gives you easy image patching, versioning, and regional replication capabilities.

Azure Study Materials, Azure Guides, Azure Certification, Azure Exam Prep

Finally, Azure Image Builder service offers unmatched governance and compliance where role-based access control is integrated so you can determine who has access in which images and connect your existing VNET to access routable resources, servers, and services, including configuration servers (DSC, Chef, Puppet, and more). Deploying Azure Image Builder does not require a public IP address, which ensures the safety and gives you full control of the asset you’re building.

We’ve designed this service to take on the heavy-lifting when you’re building your next customized image, to meet the corporate and regulatory compliance rules, and preconfiguring VMs with applications for faster deployment without the hassle they used to require. You don't need to spend time learning how to build or maintain image pipelines, learn new tools, or have different tools. Simply describe your image configuration in a template, using your new or existing commands, scripts, build artifacts, and Azure Image Builder will create it for you.

Azure Image Build is expected to be generally available in Q3 2020.

Thursday 11 June 2020

Azure Files enhances data protection capabilities

Protecting your production data is critical for any business. That’s why Azure Files has a multi-layered approach to ensuring your data is highly available, backed up, and recoverable. Whether it’s a ransomware attack, a datacenter outage, or a file share that was accidentally deleted, we want to make sure you can get everything backed up and running again pronto. To give you a peace of mind with your data in Azure Files, we are enhancing features including our new soft delete feature, share snapshots, redundancy options, and access control to data and administrative functions.

Soft delete: a recycle bin for your Azure file shares


Soft delete protects your Azure file shares from accidental deletion. To this end, we are announcing the preview of soft delete for Azure file shares. Think of soft delete like a recycle bin for your file shares. When a file share is deleted, it transitions to a soft deleted state in the form of a soft deleted snapshot. You get to configure how long soft deleted data is recoverable for before it is permanently erased.

Soft-deleted shares can be listed, but to mount them or view their contents, you must undelete them. Upon undelete, the share will be recovered to its previous state, including all metadata as well as snapshots (Previous Versions).

Azure Guides, Azure Tutorial and Material, Azure Exam Prep, Azure Certification

We recommend turning on soft delete for most shares. If you have a workflow where share deletion is common and expected, you may decide to have a very short retention period or not have soft delete enabled at all. Soft delete is one part of a data protection strategy and can help prevent inadvertent data loss.

Soft delete is currently off by default for both new and existing storage accounts, but it will be enabled by default for new storage accounts in the portal later this year. In the API, it will be on by default beginning January 1, 2021. You can toggle the feature on and off at any time during the life of a storage account. The setting will apply to all file shares within the storage account. If you are using Azure Backup, soft delete will be automatically enabled for all protected instances. Soft delete does not protect against individual file deletions—for those, you should restore from your snapshot backups.

Snapshot backups you can restore from


Snapshots are read-only, point-in-time copies of your Azure file share. They’re incremental, meaning they’re very efficient—a snapshot only contains as much data as has changed since the previous snapshot. You can have up to 200 snapshots per file share and retain them for up to 10 years. You can either manually take these snapshots in the Azure portal, via PowerShell, or command-line interface (CLI), or you can use Azure Backup, which recently announced that the snapshot management service for Azure Files is now generally available. Snapshots are stored within your file share, meaning that if you delete your file share, your snapshots will also be deleted. To protect your snapshot backups from accidental deletion, ensure soft delete is enabled for your share.

Azure Backup handles the scheduling and retention of snapshots, you define the backup policy you want when setting up your Recovery Services Vault, and then Backup does the rest. Its new grandfather-father-son (GFS) capabilities mean that you can take daily, weekly, monthly, and yearly snapshots, each with their own distinct retention period. Azure Backup also orchestrates the enablement of soft delete and takes a delete lock on a storage account as soon as any file share within it is configured for backup. Lastly, Azure Backup provides certain key monitoring and alerting capabilities that allow customers to have a consolidated view of their backup estate.

You can perform both item-level and share-level restores in the Azure portal using Azure Backup. All you need to do is choose the restore point (a particular snapshot), the particular file or directory if relevant, and then the location (original or alternate) you wish you restore to. The backup service handles copying the snapshot data over and shows your restore progress in the portal.

Azure Guides, Azure Tutorial and Material, Azure Exam Prep, Azure Certification

If you aren’t using Azure Backup, you can perform manual restores from snapshots. If you are using Windows and have mounted your Azure file share, you can use File Explorer to view and restore from snapshots using the “Previous Versions” API (meaning that users can perform item-level restores on their own). When restoring from a single file, it picks up any versions that were different in previous snapshots. When used on an entire share, it will show all snapshots that you can then browse and copy from.

Azure Guides, Azure Tutorial and Material, Azure Exam Prep, Azure Certification

You can also restore by copying data from your snapshots using your copy tool of choice. We recommend using AzCopy (requires the latest version, v10.4) or Robocopy (requires port 445 to be open). Alternatively, you can simply mount your snapshot and then do a simple copy and paste of the data back into your primary.

If you are using Azure File Sync, you can also utilize server-side Volume Shadow copy Service (VSS) snapshots with Previous Versions to allow users to perform self-service restores. Note that these are different from snapshots of your Azure file share and can be used alongside—but not as a replacement for—cloud-side backups.

Data replication and redundancy options


Azure Files offers different redundancy options to protect your data from planned and unplanned events ranging from transient hardware failures, network and power outages, to massive natural disasters. All Azure file shares can use locally-redundant (LRS) or zone-redundant storage (ZRS). Geo-redundant (GRS) and geo-zone-redundant storage (GZRS) is available for standard file shares under 5 TB and we are actively working on geo-redundant storage for standard file shares of up to 100 TiB.

You can achieve geographic redundancy for your premium file shares in the following ways. You can set up Azure File Sync to sync between your Azure file share (your cloud endpoint) and a mounted file share running on a virtual machine (VM) in another Azure region (your server endpoint). You must disable cloud tiering to ensure all data is present locally (note that your data on the server endpoint may be up to 24 hours outdated, as any changes made directly to the Azure file share are only picked up when the daily change detection process runs). It is also possible to create your own script to copy data to a storage account in secondary region using tools such as AzCopy (use version 10.4 or later to preserve access control lists (ACLs) and timestamps).

Access control options to secure your data


Another part of data protection is securing your data. You have a few different options for this. Azure Files has long supported access control via the storage account key, which is Windows Challenge/Response (NTLM)-based and can be rotated on a regular basis. Any user with storage account key access has superuser permissions. Azure Files also now supports identity-based authentication and access control over Server Message Block (SMB) using on-premises Active Directory (preview) or Azure Active Directory Domain Services (Azure AD DS). Identity-based authentication is Kerberos-based and allows you to enforce granular access control to your Azure file shares.

Once either Azure AD or on-premises Azure AD DS is configured, you can configure share-level access via built-in Role-based Access Control (RBAC) roles or configure custom access roles for Azure AD identities, and you can also configure directory and file-level permissions using standard Windows file permissions (also known as NTFS ACLs).

Multiple data protection strategies for Azure Files


Azure Files gives you many tools to protect your data. Soft delete for Azure file shares protects against accidental deletion, while share snapshots are point-in-time copies of your Azure file share that you can take manually or automatically via Azure Backup and then restore from. To ensure high availability, you have a variety of replication and redundancy options to choose from. In addition, you can ensure appropriate access to your Azure file share with identity-based access control.

Source: microsoft.com