Tuesday 23 April 2024

Azure high-performance computing leads to developing amazing products at Microsoft Surface

The Microsoft Surface organization exists to create iconic end-to-end experiences across hardware, software, and services that people love to use every day. We believe that products are a reflection of the people who build them, and that the right tools and infrastructure can complement the talent and passion of designers and engineers to deliver innovative products. Product level simulation models are routinely used in day-to-day decision making on design, reliability, and product features. The organization is also on a multi-year journey to deliver differentiated products in a highly efficient manner. Microsoft Azure HPC plays a vital role in enabling this vision. Below is an account of how we were able to do more with less by leveraging the power of simulation and Azure HPC.

Azure high-performance computing leads to developing amazing products at Microsoft Surface

Surface devices development on Microsoft Azure 


I’m a Principal Engineer at Microsoft and a structural analyst. I’ve been a heavy user of Azure HPC and an early adopter of Azure A8 and A9 virtual machines. In 2015, with the help of our Surface IT team, we deployed and solved many issues with Abaqus (a Finite Element Analysis (FEA) software) implementation in Azure HPC. By 2016, product level structural simulations for Surface Pro 4 and the original Surface laptop had fully migrated to Azure HPC from on-premises servers. Large models with millions of degrees of freedom became routine and easily solved on Azure HPC. This early use of simulations enabled problem solving for design engineers tasked with robustness and reliability metrics. Usage grew along with product line growth. Along with my colleagues Pritul Shah, Senior Director of a cross product engineering team, and Jarkko Sihvonen, Senior Engineer of the IT Infrastructure and Services team, we collaborated to scale up structural simulation footprint in our organization. The vision to build a global simulation team meant access to computing servers in Western North America and Southeast Asia which was easily deployed by the Surface IT and Azure HPC teams.

Azure high-performance computing leads to developing amazing products at Microsoft Surface

Azure high-performance computing leads to developing amazing products at Microsoft Surface

Product development: Surface laptop  


The availability of Azure HPC for structural simulations using Abaqus helped make this a primary development tool for product design. Design concepts created in digital computer-aided design (CAD) systems are translated into FEA model in detail. These are true digital prototypes and constitute all major subsystems in the device. The analyst can use FEA models to impose different test and reliability conditions in a virtual environment and determine feasibility. In a few days, hundreds of simulations are executed to evaluate various design ideas and solutions to make the device robust. Subsequently, the selected design becomes a protype and then subject to rigorous testing for real-world use conditions. There are multiple feedback loops built into our engineering process to compare actual tests and FEA results for model validation.  

Azure high-performance computing leads to developing amazing products at Microsoft Surface

In the first graphics depicted above, a digital prototype (FEA model) laptop device is set-up to drop on its corner to the floor. This models the real-world physical testing that is conducted in our Reliability Engineering labs. The impact velocity for a given height is the initial condition for the dynamic simulation. The dynamic drop simulation is executed on hundreds of cores of an Azure HPC cluster using Abaqus solver. We used the Abaqus and Explicit solver which is known for its robust and accurate solution for high-speed, nonlinear, dynamic events such as consumer electronics drop testing and automotive crashworthiness. These solvers are optimized especially for Azure HPC clusters and enable scaling to thousands of cores for fast throughputs. The simulation jobs complete in a matter of a few hours on these optimized Azure HPC servers instead of the days it used to take previously. The results are reviewed by the analysts and stress levels are checked against material limits. Design teams and analysts then review the reports and make design updates. This cycle continues in very quick loops as the Azure HPC servers enable fast turnaround for reviews.  

Azure high-performance computing leads to developing amazing products at Microsoft Surface

The second graphic depicts an example of the hinge in the device that was optimized for strength. The team was able to visualize the impact induced motion and stress levels of the hinge internal parts from the simulation. This enabled us to isolate the main issue and make the right design improvements. This insight helped redesign the hinge assembly to cause lower stress levels. Significant time was saved in the design process as only one iteration was needed for success. Tooling, physical prototyping, and testing costs were also saved. 

Presently, the entire Microsoft Surface product line utilizes this approach of validating design with digital prototypes (FEA models) run on Azure HPC clusters. Thousands of simulation jobs are executed routinely in a matter of weeks to enable cutting-edge designs that have very high reliability and customer satisfaction. 

Source: microsoft.com

Saturday 20 April 2024

Azure IoT’s industrial transformation strategy on display at Hannover Messe 2024

Azure IoT’s industrial transformation strategy on display at Hannover Messe 2024

Running and transforming a successful enterprise is like being the coach of a championship-winning sports team. To win the trophy, you need a strategy, game plans, and the ability to bring all the players together. In the early days of training, coaches relied on basic drills, manual strategies, and simple equipment. But as technology advanced, so did the art of coaching. Today, coaches use data-driven training programs, performance tracking technology, and sophisticated game strategies to achieve unimaginable performance and secure victories.

We see a similar change happening in industrial production management and performance and we are excited to showcase how we are innovating with our products and services to help you succeed in the modern era. Microsoft recently launched two accelerators for industrial transformation:

◉ Azure’s adaptive cloud approach—a new strategy
◉ Azure IoT Operations (preview)—a new product

Our adaptive cloud approach connects teams, systems, and sites through consistent management tools, development patterns, and insight generation. Putting the adaptive cloud approach into practice, IoT Operations leverages open standards and works with Microsoft Fabric to create a common data foundation for IT and operational technology (OT) collaboration.

We will be demonstrating these accelerators in the Microsoft booth at Hannover Messe 2024, presenting the new approach on the Microsoft stage, and will be ready to share exciting partnership announcements that enable interoperability in the industry.  

Experience the future of automation with IoT Operations 


Using our adaptive cloud approach, we’ve built a robotic assembly line demonstration that puts together car battery parts for attendees of the event. This production line is partner-enabled and features a standard OT environment, including solutions from Rockwell Automation and PTC. IoT Operations was used to build a monitoring solution for the robots because it embraces industry standards, like Open Platform Communications Unified Architecture (OPC UA), and integrates with existing infrastructure to connect data from an array of OT devices and systems, and flow it to the right places and people. IoT Operations processes data at the edge for local use by multiple applications and sends insights to the cloud for use by multiple applications there too, reducing data fragmentation.  

For those attending Hannover Messe 2024, head to the center of the Microsoft booth and look for the station “Achieve industrial transformation across the value chain.”  

Consult with Azure experts on IT and OT collaboration tools 


Find out how Microsoft Azure’s open and standardized strategy, an adaptive cloud approach, can help you reach the next stage of industrial transformation. Our experts will help your team collect data from assets and systems on the shop floor, compute at the edge, integrate that data into multiple solutions, and create production analytics on a global scale. Whether you’re just starting to connect and digitize your operations, or you’re ready to analyze and reason with your data, make predictions, and apply AI, we’re here to assist.  

For those attending Hannover Messe 2024, these experts are located at the demonstration called “Scale solutions and interoperate with IoT, edge, and cloud innovation.” 

Check out Jumpstart to get your collaboration environment up and running. In May 2024, Jumpstart will have a comprehensive scenario designed for manufacturing.

Attend a presentation on modernizing the shop floor  


We will share the results of a survey on the latest trends, technologies, and priorities for manufacturing companies wanting to efficiently manage their data to prepare for AI and accelerate industrial transformation. 73% of manufacturers agreed that a scalable technology stack is an important paradigm for the future of factories. To make that a reality, manufacturers are making changes to modernize, such as adopting containerization, shifting to central management of devices, and emphasizing IT and OT collaboration tools. These modernization trends can maximize the ROI of existing infrastructure and solutions, enhance security, and apply AI at the edge. 

This presentation “How manufacturers prepare shopfloors for a future with AI,” will take place in the Microsoft theater at our booth, Hall 17, on Monday, April 22, 2024, at 2:00 PM CEST at Hannover Messe 2024.  

Learn about actions and initiatives driving interoperability  


Microsoft is strengthening and supporting the industrial ecosystem to enable at-scale transformation and interoperate solutions. Our adaptive cloud approach both incorporates existing investments in partner technology and builds a foundation for consistent deployment patterns and repeatability for scale.  

Our ecosystem of partners

Microsoft is building an ecosystem of connectivity partners to modernize industrial systems and devices. These partners provide data translation and normalization services across heterogeneous environments for a seamless and secure data flow on the shop floor, and from the shop floor to the cloud. We leverage open standards and provide consistent control and management capabilities for OT and IT assets. To date, we have established integrations with Advantech, Softing, and PTC. 

Siemens and Microsoft have announced the convergence of the Digital Twin Definition Language (DTDL) with the W3C Web of Things standard. This convergence will help consolidate digital twin definitions for assets in the industry and enable new technology innovation like automatic asset onboarding with the help of generative AI technologies.

Microsoft embraces open standards and interoperability. Our adaptive cloud approach is based on those principles. We are thrilled to join project Margo, a new ecosystem-led initiative, that will help industrial customers achieve their digital transformation goals with greater speed and efficiency. Margo will define how edge applications, edge devices, and edge orchestration software interoperate with each other with increased flexibility.

Discover solutions with Microsoft


Visit our booth and speak with our experts to reach new heights of industrial transformation and prepare the shop floor for AI. Together, we will maximize your existing investments and drive scale in the industry. We look forward to working with you.


Source: microsoft.com

Friday 19 April 2024

Microsoft Entra resilience update: Workload identity authentication

Microsoft Entra is not only the identity system for users; it’s also the identity and access management (IAM) system for Azure-based services, all internal infrastructure services at Microsoft, and our customers’ workload identities. This is why our 99.99% service-level promise extends to workload identity authentication, and why we continue to improve our service’s resilience through a multilayered approach that includes the backup authentication system.

In 2021, we introduced the backup authentication system, as an industry-first innovation that automatically and transparently handles authentications for supported workloads when the primary Microsoft Entra ID service is degraded or unavailable. Through 2022 and 2023, we continued to expand the coverage of the backup service across clouds and application types. 

Today, we’ll build on our resilience blogpost series by going further in sharing how workload identities gain resilience from the regionally isolated authentication endpoints as well as from the backup authentication system. We’ll explore two complementary methods that best fit our regional-global infrastructure. One example of workload identity authentication is when an Azure virtual machine (VM) authenticates its identity to Azure Storage. Another example is when one of our customers’ workloads authenticates to application programming interfaces (APIs).

Regionally isolated authentication endpoints 

 
Regionally isolated authentication endpoints provide region-isolated authentication services to an Azure region. All frequently used identities will authenticate successfully without dependencies on other Azure regions. Essentially, they are the primary endpoints for Azure infrastructure services as well as the primary endpoints for managed identities in Azure (Managed identities for Azure resources - Microsoft Entra ID | Microsoft Learn). Managed identities help prevent out-of-region failures by consolidating service dependencies, and improving resilience by handling certificate expiry, rotation, and trust.

This layer of protection and isolation does not need any configuration changes from Azure customers. Key Azure infrastructure services have already adopted it, and it’s integrated with the managed identities service to protect the customer workloads that depend on it. 

How regionally isolated authentication endpoints work 


Each Azure region is assigned a unique endpoint for workload identity authentication. The region is served by a regionally collocated, special instance of Microsoft Entra ID. The regional instance relies on caching metadata (for example, directory data that is needed to issue tokens locally) to respond efficiently and resiliently to the workload identity’s authentication requests. This lightweight design reduces dependencies on other services and improves resilience by allowing the entire authentication to be completed within a single region. Data in the local cache is proactively refreshed. 

The regional service depends on Microsoft Entra ID's global service to update and refill caches when it lacks the data it needs (a cache miss) or when it detects a change in the security posture for a supported service. If the regional service experiences an outage, requests are served seamlessly by Microsoft Entra ID’s global service, making the regional service interruption invisible to the customers.

Performant, resilient, and widely available 


The service has proven itself since 2020 and now serves six billion requests per day across the globe. The regional endpoints, working with global services, exceed 99.99% SLA. The resilience of Azure infrastructure is further protected by workload-side caches kept by Azure client SDKs. Together, the regional and global services have managed to make most service degradations undetectable by dependent infrastructure services. Post-incident recovery is handled automatically. Regional isolation is supported by public and all Sovereign Clouds. 

Infrastructure authentication requests are processed by the same Azure datacenter that hosts the workloads along with their co-located dependencies. This means that endpoints that are isolated to a region also benefit from performance advantages. 

 
Microsoft Entra resilience update: Workload identity authentication

Backup authentication system to cover workload identities for infrastructure authentication 

 
For workload identity authentication that does not depend on managed identities, we’ll rely on the backup authentication system to add fault-tolerant resilience. In our blogpost from November 2021, we explained the approach for user authentication which has been generally available for some time. The system operates in the Microsoft cloud but on separate and decorrelated systems and network paths from the primary Microsoft Entra ID system. This means that it can continue to operate in case of service, network, or capacity issues across many Microsoft Entra ID and dependent Azure services. We are now applying that successful approach to workload identities. 

Backup coverage of workload identities is currently rolling out systematically across Microsoft, starting with Microsoft 365’s largest internal infrastructure services in the first half of 2024. Microsoft Entra ID customer workload identities’ coverage will follow in the second half of 2025. 

Microsoft Entra resilience update: Workload identity authentication

Protecting your own workloads 

 
The benefits of both regionally isolated endpoints and the backup authentication system are natively built into our platform. To further optimize the benefits of current and future investments in resilience and security, we encourage developers to use the Microsoft Authentication Library (MSAL) and leverage managed identities whenever possible. 

What’s next? 
 
We want to assure our customers that our 99.99% uptime guarantee remains in place, along with our ongoing efforts to expand our backup coverage system and increase our automatic backup coverage to include all infrastructure authentication—even for third-party developers—in the next year. We’ll make sure to keep you updated on our progress, including planned improvements to our system capacity, performance, and coverage across all clouds. 

Source: microsoft.com

Saturday 13 April 2024

Advancing memory leak detection with AIOps—introducing RESIN

Advancing memory leak detection with AIOps—introducing RESIN

In the ever-evolving landscape of cloud computing, memory leaks represent a persistent challenge—affecting performance, stability, and ultimately, the user experience. Therefore, memory leak detection is important to cloud service quality. Memory leaks happen when memory is allocated but not released in a timely manner unintentionally. It causes potential performance degradation of the component and possible crashes of the operation system (OS). Even worse, it often affects other processes running on the same machine, causing them to be slowed down or even killed.

Given the impact of memory leak issues, there are many studies and solutions for memory leak detection. Traditional detection solutions fall into two categories: static and dynamic detection. The static leak detection techniques analyze software source code and deduce potential leaks whereas the dynamic method detects leak through instrumenting a program and tracks the object references at runtime.

However, these conventional techniques for detecting memory leaks are not adequate to meet the needs of leak detection in a cloud environment. The static approaches have limited accuracy and scalability, especially for leaks that result from cross-component contract violations, which need rich domain knowledge to capture statically. In general, the dynamic approaches are more suitable for a cloud environment. However, they are intrusive and require extensive instrumentations. Furthermore, they introduce high runtime overhead which is costly for cloud services.

Introducing RESIN


Today, we are introducing RESIN, an end-to-end memory leak detection service designed to holistically address memory leaks in large cloud infrastructure. RESIN has been used in Microsoft Azure production and demonstrated effective leak detection with high accuracy and low overhead.

RESIN system workflow


A large cloud infrastructure could consist of hundreds of software components owned by different teams. Prior to RESIN, memory leak detection was an individual team’s effort in Microsoft Azure. As shown in Figure 1, RESIN utilizes a centralized approach, which conducts leak detection in multi-stages for the benefit of low overhead, high accuracy, and scalability. This approach does not require access to components’ source code or extensive instrumentation or re-compilation.

Advancing memory leak detection with AIOps—introducing RESIN

Figure 1: RESIN workflow

RESIN conducts low-overhead monitoring using monitoring agents to collect memory telemetry data at host level. A remote service is used to aggregate and analyze data from different hosts using a bucketization-pivot scheme. When leaking is detected in a bucket, RESIN triggers an analysis on the process instances in the bucket. For highly suspicious leaks identified, RESIN performs live heap snapshotting and compares it to regular heap snapshots in a reference database. After generating multiple heap snapshots, RESIN runs diagnosis algorithm to localize the root cause of the leak and generates a diagnosis report to attach to the alert ticket to assist developers for further analysis—ultimately, RESIN automatically mitigates the leaking process.

Detection algorithms


There are unique challenges in memory leak detection in cloud infrastructure:

  • Noisy memory usage caused by changing workload and interference in the environment results in high noise in detection using static threshold-based approach.
  • Memory leak in production systems are usually fail-slow faults that could last days, weeks, or even months and it can be difficult to capture gradual change over long periods of time in a timely manner.
  • At the scale of Azure global cloud, it’s not practical to collect fine-grained data over long period of time.

To address these challenges, RESIN uses a two-level scheme to detect memory leak symptoms: A global bucket-based pivot analysis to identify suspicious components and a local individual process leak detection to identify leaking processes.

With the bucket-based pivot analysis at component level, we categorize raw memory usage into a number of buckets and transform the usage data into summary about number of hosts in each bucket. In addition, a severity score for each bucket is calculated based on the deviations and host count in the bucket. Anomaly detection is performed on the time-series data of each bucket of each component. The bucketization approach not only robustly represents the workload trend with noise tolerance but also reduces computational load of the anomaly detection.

However, detection at component level only is not sufficient for developers to investigate the leak efficiently because, normally, many processes run on a component. When a leaking bucket is identified at the component level, RESIN runs a second-level detection scheme at the process granularity to narrow down the scope of investigation. It outputs the suspected leaking process, its start and end time, and the severity score.

Diagnosis of detected leaks


Once a memory leak is detected, RESIN takes a snapshot of live heap, which contains all memory allocations referenced by running application, and analyzes the snapshots to pinpoint the root cause of the detected leak. This makes memory leak alert actionable.

RESIN also leverages Windows heap manager’s snapshot capability to perform live profiling. However, the heap collection is expensive and could be intrusive to the host’s performance. To minimize overhead caused by heap collection, a few considerations are considered to decide how snapshots are taken.

  • The heap manager only stores limited information in each snapshot such as stack trace and size for each active allocation in each snapshot.
  • RESIN prioritizes candidate hosts for snapshotting based on leak severity, noise level, and customer impact. By default, the top three hosts in the suspected list are selected to ensure successful collection.
  • RESIN utilizes a long-term, trigger-based strategy to ensure the snapshots capture the complete leak. To facilitate the decision regarding when to stop the trace collection, RESIN analyzes memory growth patterns (such as steady, spike, or stair) and takes a pattern-based approach to decide the trace completion triggers.
  • RESIN uses a periodical fingerprinting process to build reference snapshots, which is compared with the snapshot of suspected leaking process to support diagnosis.
  • RESIN analyzes the collected snapshots to output stack traces of the root.

Mitigation of detected leaks


When a memory leak is detected, RESIN attempts to automatically mitigate the issue to avoid further customer impact. Depending on the nature of the leak, a few types of mitigation actions are taken to mitigate the issue. RESIN uses a rule-based decision tree to choose a mitigation action that minimizes the impact.

If the memory leak is localized to a single process or Windows service, RESIN attempts the lightest mitigation by simply restarting the process or the service. OS reboot can resolve software memory leaks but takes a much longer time and can cause virtual machine downtime and as such, is normally reserved as the last resort. For a non-empty host, RESIN utilizes solutions such as Project Tardigrade, which skips hardware initialization and only performs a kernel soft reboot, after live virtual machine migration, to minimize user impact. A full OS reboot is performed only when the soft reboot is ineffective.

RESIN stops applying mitigation actions to a target once the detection engine no longer considers the target leaking.

Result and impact of memory leak detection


RESIN has been running in production in Azure since late 2018 and to date, it has been used to monitor millions of host nodes and hundreds of host processes daily. Overall, we achieved 85% precision and 91% recall with RESIN memory leak detection, despite the rapidly growing scale of the cloud infrastructure monitored.

The end-to-end benefits brought by RESIN are clearly demonstrated by two key metrics:

1. Virtual machine unexpected reboots: the average number of reboots per one hundred thousand hosts per day due to low memory.
2. Virtual machine allocation error: the ratio of erroneous virtual machine allocation requests due to low memory.

Between September 2020 and December 2023, the virtual machine reboots were reduced by nearly 100 times, and allocation error rates were reduced by over 30 times. Furthermore, since 2020, no severe outages have been caused by Azure host memory leaks.

Source: microsoft.com

Tuesday 9 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment.

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements.

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com

Saturday 6 April 2024

Get ready for AI at the Migrate to Innovate digital event

Get ready for AI at the Migrate to Innovate digital event

Organizations of all sizes are recognizing that using AI fuels the kind of innovation that’s needed to maintain a competitive edge. What’s often less clear is how to prepare your organization to be able to take full advantage of AI. For organizations running business-critical workloads on Windows Servers and SQL Server, how do you get from running in a traditional, on-premises environment to operating in an environment that supports AI and other modern technologies?

Get ready for AI


To get a solid understanding of how to navigate this type of move, join us at the free digital event Migrate to Innovate: Be AI Ready, Be Secure on Tuesday, April 16, 2024, at 9:00 AM–11:00 AM Pacific Time. Register now to attend information-packed sessions that will help you understand the challenges organizations face in preparing for AI and cloud-native technologies, and what you need to have in place to solve those challenges.

Address your most pressing business challenges


One of the biggest obstacles in the path to modernization is balancing the need to embrace the latest advancements with the need to meet current business challenges. Whether it’s managing rising costs, safeguarding against security threats, maintaining compliance, or controlling IT sprawl as business expands, there are a lot of different priorities competing for your focus, time, and resources.

At the Migrate to Innovate digital event, you’ll learn how Azure provides an optimized platform to fully embrace AI while addressing your most pressing business priorities by maximizing ROI, performance, and resilience. Sessions will focus on how to optimize migration of your Windows Server and SQL Server workloads to Azure to position your organization for innovation, efficiency, growth, and long-term success.

Be the first to see product updates, deep dives, and demos


Attend the Migrate to Innovate digital event to get first access to seeing what’s included in the upcoming Windows Server 2025 release. View product demos of the newest AI innovations, including Microsoft Copilot. Join product experts for deep-dive product sessions covering Windows Server, SQL Server, AI, security, and a range of modernization-related capabilities. Learn about the latest updates on intelligent Azure databases to power your data and AI workloads, and discover strategies for gaining cloud agility, including running VMware workloads across cloud, hybrid, and on-premises environments.

Session highlights include:

  • Keynote address—Understand current business challenges and learn how migrating to Azure provides the agility that’s needed to address them. Hear about the latest product announcements and advancements that will help you get ready to take advantage of AI.
  • Migrate to Azure to be AI Ready—Learn the steps that organizations need to take to be ready for AI. Watch demos showing how customers are using AI solutions—including Microsoft Copilot in Azure—to solve complex problems, and how migration accelerates innovation with AI.
  • Customer interview—Hear customers discuss why they chose to make the move to Azure, and how migration has provided them with the business outcomes they need for success, including AI-readiness, security, cost savings, and performance.
  • Migrate your Windows Server to Azure—Learn how Azure is optimized to help you migrate and modernize your Windows Server workloads. Discover on-premises, hybrid, and cloud scenarios for VMware. Watch a demo on Windows Server 2025 and its support for AI capabilities as well as hybrid and edge scenarios.
  • Migrate your Data to Azure—In the era of AI, learn how to power mission-critical applications with Azure databases. See how to simplify migration with the help of Azure Arc migration assessments.
  • New risks, New rules: Secure Code-to-Cloud Migration—Find out how Azure helps you secure your entire migration journey using a cloud-native application platform (CNAPP), Microsoft Defender for Cloud, and Microsoft Copilot for Security.
  • Get cloud agility anywhere: Strategies for VMware workloads—Understand the issues that on-premises VMware users face, and learn how taking an adaptive cloud approach with Azure helps address these challenges.  
  • Optimize your migration with key guidance, offerings, and tools—Learn about the three most important optimization activities, and discover resources, guidance, and tools that will help you plan and implement your migration solution.

Discover the business outcomes of migrating to Azure


Register for the event to understand how to get your organization on the path to modernization and hear about the business outcomes that customers are seeing when they migrate Windows Server and SQL Server workload to Azure, including:

  • AI readiness: Organizations get results with an AI-ready foundation on Azure. In a study of customers using Azure AI services, a composite organization based on the experiences of six interviewees achieved a three-year ROI of 284%. Work output and operational efficiency increased, employee collaboration and safety improved, and organizations reported faster and more data-driven decision-making.
  • Code-to-cloud security: With Azure, you get a complete code-to-cloud security platform. From foundational security to cloud-native workload protection, replacing multiple third-party security tools with comprehensive, multilayered security reduces risk and costs.
  • Maximizing ROI and performance: Workloads run faster and at a lower cost on Azure than with other cloud providers. AWS is up to 5 times more expensive than Azure for Windows Server and SQL Server—and SQL Server runs up to 5.5 times faster on Azure than on AWS.
  • Cloud agility anywhere: Azure meets organizations where they are in their migration journey through an adaptive cloud approach. Azure provides the tools and support to help you secure and govern your entire digital estate across hybrid, multicloud, and edge environments on your own terms.

Source: microsoft.com

Thursday 4 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment. 

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem 


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements. 

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com