Saturday 20 April 2024

Azure IoT’s industrial transformation strategy on display at Hannover Messe 2024

Azure IoT’s industrial transformation strategy on display at Hannover Messe 2024

Running and transforming a successful enterprise is like being the coach of a championship-winning sports team. To win the trophy, you need a strategy, game plans, and the ability to bring all the players together. In the early days of training, coaches relied on basic drills, manual strategies, and simple equipment. But as technology advanced, so did the art of coaching. Today, coaches use data-driven training programs, performance tracking technology, and sophisticated game strategies to achieve unimaginable performance and secure victories.

We see a similar change happening in industrial production management and performance and we are excited to showcase how we are innovating with our products and services to help you succeed in the modern era. Microsoft recently launched two accelerators for industrial transformation:

◉ Azure’s adaptive cloud approach—a new strategy
◉ Azure IoT Operations (preview)—a new product

Our adaptive cloud approach connects teams, systems, and sites through consistent management tools, development patterns, and insight generation. Putting the adaptive cloud approach into practice, IoT Operations leverages open standards and works with Microsoft Fabric to create a common data foundation for IT and operational technology (OT) collaboration.

We will be demonstrating these accelerators in the Microsoft booth at Hannover Messe 2024, presenting the new approach on the Microsoft stage, and will be ready to share exciting partnership announcements that enable interoperability in the industry.  

Experience the future of automation with IoT Operations 


Using our adaptive cloud approach, we’ve built a robotic assembly line demonstration that puts together car battery parts for attendees of the event. This production line is partner-enabled and features a standard OT environment, including solutions from Rockwell Automation and PTC. IoT Operations was used to build a monitoring solution for the robots because it embraces industry standards, like Open Platform Communications Unified Architecture (OPC UA), and integrates with existing infrastructure to connect data from an array of OT devices and systems, and flow it to the right places and people. IoT Operations processes data at the edge for local use by multiple applications and sends insights to the cloud for use by multiple applications there too, reducing data fragmentation.  

For those attending Hannover Messe 2024, head to the center of the Microsoft booth and look for the station “Achieve industrial transformation across the value chain.”  

Consult with Azure experts on IT and OT collaboration tools 


Find out how Microsoft Azure’s open and standardized strategy, an adaptive cloud approach, can help you reach the next stage of industrial transformation. Our experts will help your team collect data from assets and systems on the shop floor, compute at the edge, integrate that data into multiple solutions, and create production analytics on a global scale. Whether you’re just starting to connect and digitize your operations, or you’re ready to analyze and reason with your data, make predictions, and apply AI, we’re here to assist.  

For those attending Hannover Messe 2024, these experts are located at the demonstration called “Scale solutions and interoperate with IoT, edge, and cloud innovation.” 

Check out Jumpstart to get your collaboration environment up and running. In May 2024, Jumpstart will have a comprehensive scenario designed for manufacturing.

Attend a presentation on modernizing the shop floor  


We will share the results of a survey on the latest trends, technologies, and priorities for manufacturing companies wanting to efficiently manage their data to prepare for AI and accelerate industrial transformation. 73% of manufacturers agreed that a scalable technology stack is an important paradigm for the future of factories. To make that a reality, manufacturers are making changes to modernize, such as adopting containerization, shifting to central management of devices, and emphasizing IT and OT collaboration tools. These modernization trends can maximize the ROI of existing infrastructure and solutions, enhance security, and apply AI at the edge. 

This presentation “How manufacturers prepare shopfloors for a future with AI,” will take place in the Microsoft theater at our booth, Hall 17, on Monday, April 22, 2024, at 2:00 PM CEST at Hannover Messe 2024.  

Learn about actions and initiatives driving interoperability  


Microsoft is strengthening and supporting the industrial ecosystem to enable at-scale transformation and interoperate solutions. Our adaptive cloud approach both incorporates existing investments in partner technology and builds a foundation for consistent deployment patterns and repeatability for scale.  

Our ecosystem of partners

Microsoft is building an ecosystem of connectivity partners to modernize industrial systems and devices. These partners provide data translation and normalization services across heterogeneous environments for a seamless and secure data flow on the shop floor, and from the shop floor to the cloud. We leverage open standards and provide consistent control and management capabilities for OT and IT assets. To date, we have established integrations with Advantech, Softing, and PTC. 

Siemens and Microsoft have announced the convergence of the Digital Twin Definition Language (DTDL) with the W3C Web of Things standard. This convergence will help consolidate digital twin definitions for assets in the industry and enable new technology innovation like automatic asset onboarding with the help of generative AI technologies.

Microsoft embraces open standards and interoperability. Our adaptive cloud approach is based on those principles. We are thrilled to join project Margo, a new ecosystem-led initiative, that will help industrial customers achieve their digital transformation goals with greater speed and efficiency. Margo will define how edge applications, edge devices, and edge orchestration software interoperate with each other with increased flexibility.

Discover solutions with Microsoft


Visit our booth and speak with our experts to reach new heights of industrial transformation and prepare the shop floor for AI. Together, we will maximize your existing investments and drive scale in the industry. We look forward to working with you.


Source: microsoft.com

Friday 19 April 2024

Microsoft Entra resilience update: Workload identity authentication

Microsoft Entra is not only the identity system for users; it’s also the identity and access management (IAM) system for Azure-based services, all internal infrastructure services at Microsoft, and our customers’ workload identities. This is why our 99.99% service-level promise extends to workload identity authentication, and why we continue to improve our service’s resilience through a multilayered approach that includes the backup authentication system.

In 2021, we introduced the backup authentication system, as an industry-first innovation that automatically and transparently handles authentications for supported workloads when the primary Microsoft Entra ID service is degraded or unavailable. Through 2022 and 2023, we continued to expand the coverage of the backup service across clouds and application types. 

Today, we’ll build on our resilience blogpost series by going further in sharing how workload identities gain resilience from the regionally isolated authentication endpoints as well as from the backup authentication system. We’ll explore two complementary methods that best fit our regional-global infrastructure. One example of workload identity authentication is when an Azure virtual machine (VM) authenticates its identity to Azure Storage. Another example is when one of our customers’ workloads authenticates to application programming interfaces (APIs).

Regionally isolated authentication endpoints 

 
Regionally isolated authentication endpoints provide region-isolated authentication services to an Azure region. All frequently used identities will authenticate successfully without dependencies on other Azure regions. Essentially, they are the primary endpoints for Azure infrastructure services as well as the primary endpoints for managed identities in Azure (Managed identities for Azure resources - Microsoft Entra ID | Microsoft Learn). Managed identities help prevent out-of-region failures by consolidating service dependencies, and improving resilience by handling certificate expiry, rotation, and trust.

This layer of protection and isolation does not need any configuration changes from Azure customers. Key Azure infrastructure services have already adopted it, and it’s integrated with the managed identities service to protect the customer workloads that depend on it. 

How regionally isolated authentication endpoints work 


Each Azure region is assigned a unique endpoint for workload identity authentication. The region is served by a regionally collocated, special instance of Microsoft Entra ID. The regional instance relies on caching metadata (for example, directory data that is needed to issue tokens locally) to respond efficiently and resiliently to the workload identity’s authentication requests. This lightweight design reduces dependencies on other services and improves resilience by allowing the entire authentication to be completed within a single region. Data in the local cache is proactively refreshed. 

The regional service depends on Microsoft Entra ID's global service to update and refill caches when it lacks the data it needs (a cache miss) or when it detects a change in the security posture for a supported service. If the regional service experiences an outage, requests are served seamlessly by Microsoft Entra ID’s global service, making the regional service interruption invisible to the customers.

Performant, resilient, and widely available 


The service has proven itself since 2020 and now serves six billion requests per day across the globe. The regional endpoints, working with global services, exceed 99.99% SLA. The resilience of Azure infrastructure is further protected by workload-side caches kept by Azure client SDKs. Together, the regional and global services have managed to make most service degradations undetectable by dependent infrastructure services. Post-incident recovery is handled automatically. Regional isolation is supported by public and all Sovereign Clouds. 

Infrastructure authentication requests are processed by the same Azure datacenter that hosts the workloads along with their co-located dependencies. This means that endpoints that are isolated to a region also benefit from performance advantages. 

 
Microsoft Entra resilience update: Workload identity authentication

Backup authentication system to cover workload identities for infrastructure authentication 

 
For workload identity authentication that does not depend on managed identities, we’ll rely on the backup authentication system to add fault-tolerant resilience. In our blogpost from November 2021, we explained the approach for user authentication which has been generally available for some time. The system operates in the Microsoft cloud but on separate and decorrelated systems and network paths from the primary Microsoft Entra ID system. This means that it can continue to operate in case of service, network, or capacity issues across many Microsoft Entra ID and dependent Azure services. We are now applying that successful approach to workload identities. 

Backup coverage of workload identities is currently rolling out systematically across Microsoft, starting with Microsoft 365’s largest internal infrastructure services in the first half of 2024. Microsoft Entra ID customer workload identities’ coverage will follow in the second half of 2025. 

Microsoft Entra resilience update: Workload identity authentication

Protecting your own workloads 

 
The benefits of both regionally isolated endpoints and the backup authentication system are natively built into our platform. To further optimize the benefits of current and future investments in resilience and security, we encourage developers to use the Microsoft Authentication Library (MSAL) and leverage managed identities whenever possible. 

What’s next? 
 
We want to assure our customers that our 99.99% uptime guarantee remains in place, along with our ongoing efforts to expand our backup coverage system and increase our automatic backup coverage to include all infrastructure authentication—even for third-party developers—in the next year. We’ll make sure to keep you updated on our progress, including planned improvements to our system capacity, performance, and coverage across all clouds. 

Source: microsoft.com

Saturday 13 April 2024

Advancing memory leak detection with AIOps—introducing RESIN

Advancing memory leak detection with AIOps—introducing RESIN

In the ever-evolving landscape of cloud computing, memory leaks represent a persistent challenge—affecting performance, stability, and ultimately, the user experience. Therefore, memory leak detection is important to cloud service quality. Memory leaks happen when memory is allocated but not released in a timely manner unintentionally. It causes potential performance degradation of the component and possible crashes of the operation system (OS). Even worse, it often affects other processes running on the same machine, causing them to be slowed down or even killed.

Given the impact of memory leak issues, there are many studies and solutions for memory leak detection. Traditional detection solutions fall into two categories: static and dynamic detection. The static leak detection techniques analyze software source code and deduce potential leaks whereas the dynamic method detects leak through instrumenting a program and tracks the object references at runtime.

However, these conventional techniques for detecting memory leaks are not adequate to meet the needs of leak detection in a cloud environment. The static approaches have limited accuracy and scalability, especially for leaks that result from cross-component contract violations, which need rich domain knowledge to capture statically. In general, the dynamic approaches are more suitable for a cloud environment. However, they are intrusive and require extensive instrumentations. Furthermore, they introduce high runtime overhead which is costly for cloud services.

Introducing RESIN


Today, we are introducing RESIN, an end-to-end memory leak detection service designed to holistically address memory leaks in large cloud infrastructure. RESIN has been used in Microsoft Azure production and demonstrated effective leak detection with high accuracy and low overhead.

RESIN system workflow


A large cloud infrastructure could consist of hundreds of software components owned by different teams. Prior to RESIN, memory leak detection was an individual team’s effort in Microsoft Azure. As shown in Figure 1, RESIN utilizes a centralized approach, which conducts leak detection in multi-stages for the benefit of low overhead, high accuracy, and scalability. This approach does not require access to components’ source code or extensive instrumentation or re-compilation.

Advancing memory leak detection with AIOps—introducing RESIN

Figure 1: RESIN workflow

RESIN conducts low-overhead monitoring using monitoring agents to collect memory telemetry data at host level. A remote service is used to aggregate and analyze data from different hosts using a bucketization-pivot scheme. When leaking is detected in a bucket, RESIN triggers an analysis on the process instances in the bucket. For highly suspicious leaks identified, RESIN performs live heap snapshotting and compares it to regular heap snapshots in a reference database. After generating multiple heap snapshots, RESIN runs diagnosis algorithm to localize the root cause of the leak and generates a diagnosis report to attach to the alert ticket to assist developers for further analysis—ultimately, RESIN automatically mitigates the leaking process.

Detection algorithms


There are unique challenges in memory leak detection in cloud infrastructure:

  • Noisy memory usage caused by changing workload and interference in the environment results in high noise in detection using static threshold-based approach.
  • Memory leak in production systems are usually fail-slow faults that could last days, weeks, or even months and it can be difficult to capture gradual change over long periods of time in a timely manner.
  • At the scale of Azure global cloud, it’s not practical to collect fine-grained data over long period of time.

To address these challenges, RESIN uses a two-level scheme to detect memory leak symptoms: A global bucket-based pivot analysis to identify suspicious components and a local individual process leak detection to identify leaking processes.

With the bucket-based pivot analysis at component level, we categorize raw memory usage into a number of buckets and transform the usage data into summary about number of hosts in each bucket. In addition, a severity score for each bucket is calculated based on the deviations and host count in the bucket. Anomaly detection is performed on the time-series data of each bucket of each component. The bucketization approach not only robustly represents the workload trend with noise tolerance but also reduces computational load of the anomaly detection.

However, detection at component level only is not sufficient for developers to investigate the leak efficiently because, normally, many processes run on a component. When a leaking bucket is identified at the component level, RESIN runs a second-level detection scheme at the process granularity to narrow down the scope of investigation. It outputs the suspected leaking process, its start and end time, and the severity score.

Diagnosis of detected leaks


Once a memory leak is detected, RESIN takes a snapshot of live heap, which contains all memory allocations referenced by running application, and analyzes the snapshots to pinpoint the root cause of the detected leak. This makes memory leak alert actionable.

RESIN also leverages Windows heap manager’s snapshot capability to perform live profiling. However, the heap collection is expensive and could be intrusive to the host’s performance. To minimize overhead caused by heap collection, a few considerations are considered to decide how snapshots are taken.

  • The heap manager only stores limited information in each snapshot such as stack trace and size for each active allocation in each snapshot.
  • RESIN prioritizes candidate hosts for snapshotting based on leak severity, noise level, and customer impact. By default, the top three hosts in the suspected list are selected to ensure successful collection.
  • RESIN utilizes a long-term, trigger-based strategy to ensure the snapshots capture the complete leak. To facilitate the decision regarding when to stop the trace collection, RESIN analyzes memory growth patterns (such as steady, spike, or stair) and takes a pattern-based approach to decide the trace completion triggers.
  • RESIN uses a periodical fingerprinting process to build reference snapshots, which is compared with the snapshot of suspected leaking process to support diagnosis.
  • RESIN analyzes the collected snapshots to output stack traces of the root.

Mitigation of detected leaks


When a memory leak is detected, RESIN attempts to automatically mitigate the issue to avoid further customer impact. Depending on the nature of the leak, a few types of mitigation actions are taken to mitigate the issue. RESIN uses a rule-based decision tree to choose a mitigation action that minimizes the impact.

If the memory leak is localized to a single process or Windows service, RESIN attempts the lightest mitigation by simply restarting the process or the service. OS reboot can resolve software memory leaks but takes a much longer time and can cause virtual machine downtime and as such, is normally reserved as the last resort. For a non-empty host, RESIN utilizes solutions such as Project Tardigrade, which skips hardware initialization and only performs a kernel soft reboot, after live virtual machine migration, to minimize user impact. A full OS reboot is performed only when the soft reboot is ineffective.

RESIN stops applying mitigation actions to a target once the detection engine no longer considers the target leaking.

Result and impact of memory leak detection


RESIN has been running in production in Azure since late 2018 and to date, it has been used to monitor millions of host nodes and hundreds of host processes daily. Overall, we achieved 85% precision and 91% recall with RESIN memory leak detection, despite the rapidly growing scale of the cloud infrastructure monitored.

The end-to-end benefits brought by RESIN are clearly demonstrated by two key metrics:

1. Virtual machine unexpected reboots: the average number of reboots per one hundred thousand hosts per day due to low memory.
2. Virtual machine allocation error: the ratio of erroneous virtual machine allocation requests due to low memory.

Between September 2020 and December 2023, the virtual machine reboots were reduced by nearly 100 times, and allocation error rates were reduced by over 30 times. Furthermore, since 2020, no severe outages have been caused by Azure host memory leaks.

Source: microsoft.com

Tuesday 9 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment.

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements.

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com

Saturday 6 April 2024

Get ready for AI at the Migrate to Innovate digital event

Get ready for AI at the Migrate to Innovate digital event

Organizations of all sizes are recognizing that using AI fuels the kind of innovation that’s needed to maintain a competitive edge. What’s often less clear is how to prepare your organization to be able to take full advantage of AI. For organizations running business-critical workloads on Windows Servers and SQL Server, how do you get from running in a traditional, on-premises environment to operating in an environment that supports AI and other modern technologies?

Get ready for AI


To get a solid understanding of how to navigate this type of move, join us at the free digital event Migrate to Innovate: Be AI Ready, Be Secure on Tuesday, April 16, 2024, at 9:00 AM–11:00 AM Pacific Time. Register now to attend information-packed sessions that will help you understand the challenges organizations face in preparing for AI and cloud-native technologies, and what you need to have in place to solve those challenges.

Address your most pressing business challenges


One of the biggest obstacles in the path to modernization is balancing the need to embrace the latest advancements with the need to meet current business challenges. Whether it’s managing rising costs, safeguarding against security threats, maintaining compliance, or controlling IT sprawl as business expands, there are a lot of different priorities competing for your focus, time, and resources.

At the Migrate to Innovate digital event, you’ll learn how Azure provides an optimized platform to fully embrace AI while addressing your most pressing business priorities by maximizing ROI, performance, and resilience. Sessions will focus on how to optimize migration of your Windows Server and SQL Server workloads to Azure to position your organization for innovation, efficiency, growth, and long-term success.

Be the first to see product updates, deep dives, and demos


Attend the Migrate to Innovate digital event to get first access to seeing what’s included in the upcoming Windows Server 2025 release. View product demos of the newest AI innovations, including Microsoft Copilot. Join product experts for deep-dive product sessions covering Windows Server, SQL Server, AI, security, and a range of modernization-related capabilities. Learn about the latest updates on intelligent Azure databases to power your data and AI workloads, and discover strategies for gaining cloud agility, including running VMware workloads across cloud, hybrid, and on-premises environments.

Session highlights include:

  • Keynote address—Understand current business challenges and learn how migrating to Azure provides the agility that’s needed to address them. Hear about the latest product announcements and advancements that will help you get ready to take advantage of AI.
  • Migrate to Azure to be AI Ready—Learn the steps that organizations need to take to be ready for AI. Watch demos showing how customers are using AI solutions—including Microsoft Copilot in Azure—to solve complex problems, and how migration accelerates innovation with AI.
  • Customer interview—Hear customers discuss why they chose to make the move to Azure, and how migration has provided them with the business outcomes they need for success, including AI-readiness, security, cost savings, and performance.
  • Migrate your Windows Server to Azure—Learn how Azure is optimized to help you migrate and modernize your Windows Server workloads. Discover on-premises, hybrid, and cloud scenarios for VMware. Watch a demo on Windows Server 2025 and its support for AI capabilities as well as hybrid and edge scenarios.
  • Migrate your Data to Azure—In the era of AI, learn how to power mission-critical applications with Azure databases. See how to simplify migration with the help of Azure Arc migration assessments.
  • New risks, New rules: Secure Code-to-Cloud Migration—Find out how Azure helps you secure your entire migration journey using a cloud-native application platform (CNAPP), Microsoft Defender for Cloud, and Microsoft Copilot for Security.
  • Get cloud agility anywhere: Strategies for VMware workloads—Understand the issues that on-premises VMware users face, and learn how taking an adaptive cloud approach with Azure helps address these challenges.  
  • Optimize your migration with key guidance, offerings, and tools—Learn about the three most important optimization activities, and discover resources, guidance, and tools that will help you plan and implement your migration solution.

Discover the business outcomes of migrating to Azure


Register for the event to understand how to get your organization on the path to modernization and hear about the business outcomes that customers are seeing when they migrate Windows Server and SQL Server workload to Azure, including:

  • AI readiness: Organizations get results with an AI-ready foundation on Azure. In a study of customers using Azure AI services, a composite organization based on the experiences of six interviewees achieved a three-year ROI of 284%. Work output and operational efficiency increased, employee collaboration and safety improved, and organizations reported faster and more data-driven decision-making.
  • Code-to-cloud security: With Azure, you get a complete code-to-cloud security platform. From foundational security to cloud-native workload protection, replacing multiple third-party security tools with comprehensive, multilayered security reduces risk and costs.
  • Maximizing ROI and performance: Workloads run faster and at a lower cost on Azure than with other cloud providers. AWS is up to 5 times more expensive than Azure for Windows Server and SQL Server—and SQL Server runs up to 5.5 times faster on Azure than on AWS.
  • Cloud agility anywhere: Azure meets organizations where they are in their migration journey through an adaptive cloud approach. Azure provides the tools and support to help you secure and govern your entire digital estate across hybrid, multicloud, and edge environments on your own terms.

Source: microsoft.com

Thursday 4 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment. 

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem 


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements. 

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com

Tuesday 2 April 2024

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

Canopius, a global specialty and P&C (re)insurance business, knew its success hinged on its ability to access all its data but that its existing infrastructure was hurting, not helping, its ability to do so. It wanted to capitalize on its data for increased productivity and it wanted a cloud platform capable of handling advanced technologies with ease. In simple terms, Canopius wanted a data platform that would truly enable its success, today and well into the future.

Like so many other companies today, it faced a decision: update its hardware or invest in a modern platform.

Ultimately, Canopius bet on itself and its future. Drawn to the transformative potential of generative AI, it migrated its data to Microsoft Azure and adopted a combination of our newest, most powerful services. The result? An ability to pivot quickly to changing market conditions. An IT team that, armed with the right tools and technologies, could focus solely on innovation. And, with full access to its unique data, a true competitive advantage. Enter the Microsoft Intelligent Data Platform.

Get to know the Microsoft Intelligent Data Platform


The Intelligent Data Platform is an idea that was born out of countless conversations with customers like Canopius who want, once and for all, to get their data house in order to gain a competitive edge. They know their data holds the key to their success. They want to implement AI tools. But after 15 or more years of data and cloud investments, they find their ambition is limited by the very solutions they’ve built over time. Patch after patch, their data estate has become fragmented, hard to manage, and incapable of AI era-workloads.

Even when customers know their infrastructure needs upgrading, many find it difficult to make the switch—and for good reasons. First, fixing a broken or outdated data platform and futureproofing a data platform are two distinct and sizable problems to solve. Second, I’ll be the first to admit our industry doesn’t always make it easy to choose. Azure alone offers more than 300 products, with more on the near-term horizon, and we don’t often present complete solutions because the reality is one size will never fit all. Third, customers often have favorite products and need reassurances those products will interoperate with other products. It’s natural for solutions to become more specialized and thereby more fragmented over time. And finally, migrating data accumulated over years, sometimes decades, is an exercise that can feel risky even to the most experienced IT professional. These are all valid concerns and roadblocks to progress.

With the emergence of generative AI, the original idea of the Intelligent Data Platform has morphed into what I regard now as our strongest, simplest recommendation for a cloud solution capable of solving data fragmentation problems of the past and present, and stands ready to empower uncapped innovation for even the most ambitious organizations. Generative AI changes a data strategy: data solutions created today must be designed with foresight to ensure they can easily manage spikes in data, new workloads and new technologies for many years to come.

The recommendation is this: migrate to some of the most advanced, AI-ready databases on the market today, take advantage of integrated advanced analytics and the latest AI tools, and invest in a comprehensive, end-to-end security solution.

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

The newest technologies and products on the market today have been built with the future in mind—future data requirements and geo-economic environments. To focus on innovation, companies need access to the richness of their data and roadblocks removed, like interoperability and security concerns. That’s table stakes. They also need faster, better access to their data and tools to empower innovation and transformation.

This four-workload construct works no matter which cloud vendor you choose, as long as the technologies within are built for future data demands, interoperability, and security. For customers, existing and new, interested in migrating to Azure or adopting a complete intelligent data platform, we recommend the following: 

1. DATABASES RECOMMENDATION 
The foundation of a powerful solution begins with the database. Our latest hyperscale databases are an entirely new category, designed and tuned for generative AI to provide limitless leeway to reinvent business processes and create apps not possible before. They’re also built to process data at any scale with near-zero downtime, making real-time analytics and insights possible. Azure Cosmos DB
Azure SQL DB Hyperscale
2. ANALYTICS RECOMMENDATION
When paired with a hyperscale database, our newest unified analytics platform delivers tailored business intelligence in real-time. This all-new method of democratizing intelligence emboldens leadership teams to quickly make decisions that move the needle. Microsoft Fabric
3. AI RECOMMENDATION
Our newest AI technologies and tools give creators and thinkers the resources they need to ideate, conceptualize, and build an organization’s next big breakthrough, and its leaders the ability to strategize, reinvent, and streamline in new ways. Azure AI Studio
Azure Machine Learning
Azure AI Search
Azure AI Services
4. SECURITY RECOMMENDATION
We’ve built security controls into every Azure product within our portfolio, and our industry-leading security solution helps protect data from every angle, for better visibility, safety, and compliance. We strongly recommend the added protection that comes with an end-to-end solution. Microsoft Purview
Microsoft Defender for Cloud
Microsoft Defender XDR
Microsoft Sentinel
Microsoft Entra

When combined, these products transform into a futureproofed solution that makes powerful decision making, organizational performance, and limitless innovation possible.

Data is the currency of AI


Data is the key to every organization’s success, and that has taken on a whole new level of importance in the era of AI where data acts as the currency of AI innovation. Large language models can understand the world’s knowledge at a foundational level, but they can’t understand your business.

A few months ago, I connected with a customer—in fact, the Chief Executive Officer of a large company—who asked if he should pause the work his IT department was doing to integrate and modernize their data estate since, now, it’s all about AI. I appreciated his question, but my answer was: no, it’s exactly the opposite. So often, a new technology’s arrival is meant to replace something older. That isn’t the case here. A fully integrated, modern data estate is the bedrock of an innovation engine. A modern estate makes it possible to access all your data. And it’s your data, when applied to AI ideas and inventions, that will create magical results and transformational experiences.

For Epiroc, the Swedish manufacturer that supplies mining and construction industries with advanced equipment, tools, and services, steel is essential to the success of their business. The company wanted to consistently deliver a product of high quality and improve efficiencies but, in a world where there are more than 3,500 grades of steel, it needed to create company-wide best practices for data sharing and a repeatable process to get the output it wanted. Microsoft partnered with Epiroc to create an enterprise-scale machine learning AI factory as part of a broader, modern Intelligent Data Platform. Since it had already taken the step to modernize its data estate on Azure, establishing an enterprise-scale AI factory was quick and easy. And the result? A consistent, quality product, increased organizational efficiencies, and reduced waste.

Unmatched choice with our partner ecosystem


I recently spoke with another customer who successfully migrated to Azure and seamlessly integrated Databricks into their new data platform. And a different partner, already on Azure but wanting to invest in AI, integrated Azure AI products and Adobe with already impressive results. 

Every organization deserves choice. We don’t talk about it as often as we should, but those using and managing a data platform need to trust that it will serve their needs today and in the future. We recognize this and want to empower every organization’s success, no matter their legacy preferences, and our partner ecosystem offers incredible capabilities we are eager for our customers to take advantage of. For this reason, we’ve invested in this growing ecosystem of trusted partners whose products fully integrate with ours, and we’ll continue doing so. As a former partner myself, I firmly believe flexibility to choose the best solution for the job should always be a required feature of a powerful platform.

All clouds welcome


Data circumstances or preferences shouldn’t be a barrier, either. There are incredible benefits to adopting an all-cloud environment as part of a modern data strategy. But we recognize customers may want or need to choose an environment that includes multiple clouds or a mix of cloud and on-premises. The Intelligent Data Platform and all the products within it are successful in any adaptive cloud scenario.

Unleash your data for bold transformation


A key part of the Intelligent Data Platform, Azure AI, makes building data-informed ideas and experiences easy. Last year, we introduced Azure AI Studio, a one-stop-shop for generative AI creation offering model selection, data, security features, and DevOps in one interface. We directly connected Fabric to Azure AI Studio which, for customers, means they can start building AI apps without requiring manual data duplication or additional integration. We also built and tuned Azure Cosmos DB and Azure SQL DB Hyperscale for generative AI innovation. With hyperscale performance, built-in AI including vector search and copilots, and multi-layered AI-powered security embedded into the database, these databases unify, power, and protect the platform.

For over a year, generative AI has captured imaginations and attention. In this time, we’ve learned that AI’s potential comes to life at the intersection of an organization’s data and ideas. The world’s been generating and storing incredible amounts of data for so long, waiting for the day technology would catch up and simplify the process of using the data in bold, meaningful ways.

That day has arrived. The Intelligent Data Platform is the ideal data and AI foundation for every organization’s success, today and tomorrow.

Source: microsoft.com