Saturday 13 April 2024

Advancing memory leak detection with AIOps—introducing RESIN

Advancing memory leak detection with AIOps—introducing RESIN

In the ever-evolving landscape of cloud computing, memory leaks represent a persistent challenge—affecting performance, stability, and ultimately, the user experience. Therefore, memory leak detection is important to cloud service quality. Memory leaks happen when memory is allocated but not released in a timely manner unintentionally. It causes potential performance degradation of the component and possible crashes of the operation system (OS). Even worse, it often affects other processes running on the same machine, causing them to be slowed down or even killed.

Given the impact of memory leak issues, there are many studies and solutions for memory leak detection. Traditional detection solutions fall into two categories: static and dynamic detection. The static leak detection techniques analyze software source code and deduce potential leaks whereas the dynamic method detects leak through instrumenting a program and tracks the object references at runtime.

However, these conventional techniques for detecting memory leaks are not adequate to meet the needs of leak detection in a cloud environment. The static approaches have limited accuracy and scalability, especially for leaks that result from cross-component contract violations, which need rich domain knowledge to capture statically. In general, the dynamic approaches are more suitable for a cloud environment. However, they are intrusive and require extensive instrumentations. Furthermore, they introduce high runtime overhead which is costly for cloud services.

Introducing RESIN


Today, we are introducing RESIN, an end-to-end memory leak detection service designed to holistically address memory leaks in large cloud infrastructure. RESIN has been used in Microsoft Azure production and demonstrated effective leak detection with high accuracy and low overhead.

RESIN system workflow


A large cloud infrastructure could consist of hundreds of software components owned by different teams. Prior to RESIN, memory leak detection was an individual team’s effort in Microsoft Azure. As shown in Figure 1, RESIN utilizes a centralized approach, which conducts leak detection in multi-stages for the benefit of low overhead, high accuracy, and scalability. This approach does not require access to components’ source code or extensive instrumentation or re-compilation.

Advancing memory leak detection with AIOps—introducing RESIN

Figure 1: RESIN workflow

RESIN conducts low-overhead monitoring using monitoring agents to collect memory telemetry data at host level. A remote service is used to aggregate and analyze data from different hosts using a bucketization-pivot scheme. When leaking is detected in a bucket, RESIN triggers an analysis on the process instances in the bucket. For highly suspicious leaks identified, RESIN performs live heap snapshotting and compares it to regular heap snapshots in a reference database. After generating multiple heap snapshots, RESIN runs diagnosis algorithm to localize the root cause of the leak and generates a diagnosis report to attach to the alert ticket to assist developers for further analysis—ultimately, RESIN automatically mitigates the leaking process.

Detection algorithms


There are unique challenges in memory leak detection in cloud infrastructure:

  • Noisy memory usage caused by changing workload and interference in the environment results in high noise in detection using static threshold-based approach.
  • Memory leak in production systems are usually fail-slow faults that could last days, weeks, or even months and it can be difficult to capture gradual change over long periods of time in a timely manner.
  • At the scale of Azure global cloud, it’s not practical to collect fine-grained data over long period of time.

To address these challenges, RESIN uses a two-level scheme to detect memory leak symptoms: A global bucket-based pivot analysis to identify suspicious components and a local individual process leak detection to identify leaking processes.

With the bucket-based pivot analysis at component level, we categorize raw memory usage into a number of buckets and transform the usage data into summary about number of hosts in each bucket. In addition, a severity score for each bucket is calculated based on the deviations and host count in the bucket. Anomaly detection is performed on the time-series data of each bucket of each component. The bucketization approach not only robustly represents the workload trend with noise tolerance but also reduces computational load of the anomaly detection.

However, detection at component level only is not sufficient for developers to investigate the leak efficiently because, normally, many processes run on a component. When a leaking bucket is identified at the component level, RESIN runs a second-level detection scheme at the process granularity to narrow down the scope of investigation. It outputs the suspected leaking process, its start and end time, and the severity score.

Diagnosis of detected leaks


Once a memory leak is detected, RESIN takes a snapshot of live heap, which contains all memory allocations referenced by running application, and analyzes the snapshots to pinpoint the root cause of the detected leak. This makes memory leak alert actionable.

RESIN also leverages Windows heap manager’s snapshot capability to perform live profiling. However, the heap collection is expensive and could be intrusive to the host’s performance. To minimize overhead caused by heap collection, a few considerations are considered to decide how snapshots are taken.

  • The heap manager only stores limited information in each snapshot such as stack trace and size for each active allocation in each snapshot.
  • RESIN prioritizes candidate hosts for snapshotting based on leak severity, noise level, and customer impact. By default, the top three hosts in the suspected list are selected to ensure successful collection.
  • RESIN utilizes a long-term, trigger-based strategy to ensure the snapshots capture the complete leak. To facilitate the decision regarding when to stop the trace collection, RESIN analyzes memory growth patterns (such as steady, spike, or stair) and takes a pattern-based approach to decide the trace completion triggers.
  • RESIN uses a periodical fingerprinting process to build reference snapshots, which is compared with the snapshot of suspected leaking process to support diagnosis.
  • RESIN analyzes the collected snapshots to output stack traces of the root.

Mitigation of detected leaks


When a memory leak is detected, RESIN attempts to automatically mitigate the issue to avoid further customer impact. Depending on the nature of the leak, a few types of mitigation actions are taken to mitigate the issue. RESIN uses a rule-based decision tree to choose a mitigation action that minimizes the impact.

If the memory leak is localized to a single process or Windows service, RESIN attempts the lightest mitigation by simply restarting the process or the service. OS reboot can resolve software memory leaks but takes a much longer time and can cause virtual machine downtime and as such, is normally reserved as the last resort. For a non-empty host, RESIN utilizes solutions such as Project Tardigrade, which skips hardware initialization and only performs a kernel soft reboot, after live virtual machine migration, to minimize user impact. A full OS reboot is performed only when the soft reboot is ineffective.

RESIN stops applying mitigation actions to a target once the detection engine no longer considers the target leaking.

Result and impact of memory leak detection


RESIN has been running in production in Azure since late 2018 and to date, it has been used to monitor millions of host nodes and hundreds of host processes daily. Overall, we achieved 85% precision and 91% recall with RESIN memory leak detection, despite the rapidly growing scale of the cloud infrastructure monitored.

The end-to-end benefits brought by RESIN are clearly demonstrated by two key metrics:

1. Virtual machine unexpected reboots: the average number of reboots per one hundred thousand hosts per day due to low memory.
2. Virtual machine allocation error: the ratio of erroneous virtual machine allocation requests due to low memory.

Between September 2020 and December 2023, the virtual machine reboots were reduced by nearly 100 times, and allocation error rates were reduced by over 30 times. Furthermore, since 2020, no severe outages have been caused by Azure host memory leaks.

Source: microsoft.com

Tuesday 9 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment.

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements.

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com

Saturday 6 April 2024

Get ready for AI at the Migrate to Innovate digital event

Get ready for AI at the Migrate to Innovate digital event

Organizations of all sizes are recognizing that using AI fuels the kind of innovation that’s needed to maintain a competitive edge. What’s often less clear is how to prepare your organization to be able to take full advantage of AI. For organizations running business-critical workloads on Windows Servers and SQL Server, how do you get from running in a traditional, on-premises environment to operating in an environment that supports AI and other modern technologies?

Get ready for AI


To get a solid understanding of how to navigate this type of move, join us at the free digital event Migrate to Innovate: Be AI Ready, Be Secure on Tuesday, April 16, 2024, at 9:00 AM–11:00 AM Pacific Time. Register now to attend information-packed sessions that will help you understand the challenges organizations face in preparing for AI and cloud-native technologies, and what you need to have in place to solve those challenges.

Address your most pressing business challenges


One of the biggest obstacles in the path to modernization is balancing the need to embrace the latest advancements with the need to meet current business challenges. Whether it’s managing rising costs, safeguarding against security threats, maintaining compliance, or controlling IT sprawl as business expands, there are a lot of different priorities competing for your focus, time, and resources.

At the Migrate to Innovate digital event, you’ll learn how Azure provides an optimized platform to fully embrace AI while addressing your most pressing business priorities by maximizing ROI, performance, and resilience. Sessions will focus on how to optimize migration of your Windows Server and SQL Server workloads to Azure to position your organization for innovation, efficiency, growth, and long-term success.

Be the first to see product updates, deep dives, and demos


Attend the Migrate to Innovate digital event to get first access to seeing what’s included in the upcoming Windows Server 2025 release. View product demos of the newest AI innovations, including Microsoft Copilot. Join product experts for deep-dive product sessions covering Windows Server, SQL Server, AI, security, and a range of modernization-related capabilities. Learn about the latest updates on intelligent Azure databases to power your data and AI workloads, and discover strategies for gaining cloud agility, including running VMware workloads across cloud, hybrid, and on-premises environments.

Session highlights include:

  • Keynote address—Understand current business challenges and learn how migrating to Azure provides the agility that’s needed to address them. Hear about the latest product announcements and advancements that will help you get ready to take advantage of AI.
  • Migrate to Azure to be AI Ready—Learn the steps that organizations need to take to be ready for AI. Watch demos showing how customers are using AI solutions—including Microsoft Copilot in Azure—to solve complex problems, and how migration accelerates innovation with AI.
  • Customer interview—Hear customers discuss why they chose to make the move to Azure, and how migration has provided them with the business outcomes they need for success, including AI-readiness, security, cost savings, and performance.
  • Migrate your Windows Server to Azure—Learn how Azure is optimized to help you migrate and modernize your Windows Server workloads. Discover on-premises, hybrid, and cloud scenarios for VMware. Watch a demo on Windows Server 2025 and its support for AI capabilities as well as hybrid and edge scenarios.
  • Migrate your Data to Azure—In the era of AI, learn how to power mission-critical applications with Azure databases. See how to simplify migration with the help of Azure Arc migration assessments.
  • New risks, New rules: Secure Code-to-Cloud Migration—Find out how Azure helps you secure your entire migration journey using a cloud-native application platform (CNAPP), Microsoft Defender for Cloud, and Microsoft Copilot for Security.
  • Get cloud agility anywhere: Strategies for VMware workloads—Understand the issues that on-premises VMware users face, and learn how taking an adaptive cloud approach with Azure helps address these challenges.  
  • Optimize your migration with key guidance, offerings, and tools—Learn about the three most important optimization activities, and discover resources, guidance, and tools that will help you plan and implement your migration solution.

Discover the business outcomes of migrating to Azure


Register for the event to understand how to get your organization on the path to modernization and hear about the business outcomes that customers are seeing when they migrate Windows Server and SQL Server workload to Azure, including:

  • AI readiness: Organizations get results with an AI-ready foundation on Azure. In a study of customers using Azure AI services, a composite organization based on the experiences of six interviewees achieved a three-year ROI of 284%. Work output and operational efficiency increased, employee collaboration and safety improved, and organizations reported faster and more data-driven decision-making.
  • Code-to-cloud security: With Azure, you get a complete code-to-cloud security platform. From foundational security to cloud-native workload protection, replacing multiple third-party security tools with comprehensive, multilayered security reduces risk and costs.
  • Maximizing ROI and performance: Workloads run faster and at a lower cost on Azure than with other cloud providers. AWS is up to 5 times more expensive than Azure for Windows Server and SQL Server—and SQL Server runs up to 5.5 times faster on Azure than on AWS.
  • Cloud agility anywhere: Azure meets organizations where they are in their migration journey through an adaptive cloud approach. Azure provides the tools and support to help you secure and govern your entire digital estate across hybrid, multicloud, and edge environments on your own terms.

Source: microsoft.com

Thursday 4 April 2024

Azure Maia for the era of AI: From silicon to software to systems

Azure Maia for the era of AI: From silicon to software to systems

As the pace of AI and the transformation it enables across industries continues to accelerate, Microsoft is committed to building and enhancing our global cloud infrastructure to meet the needs from customers and developers with faster, more performant, and more efficient compute and AI solutions. Azure AI infrastructure comprises technology from industry leaders as well as Microsoft’s own innovations, including Azure Maia 100, Microsoft’s first in-house AI accelerator, announced in November. In this blog, we will dive deeper into the technology and journey of developing Azure Maia 100, the co-design of hardware and software from the ground up, built to run cloud-based AI workloads and optimized for Azure AI infrastructure.

Azure Maia 100, pushing the boundaries of semiconductor innovation


Maia 100 was designed to run cloud-based AI workloads, and the design of the chip was informed by Microsoft’s experience in running complex and large-scale AI workloads such as Microsoft Copilot. Maia 100 is one of the largest processors made on 5nm node using advanced packaging technology from TSMC.   

Through collaboration with Azure customers and leaders in the semiconductor ecosystem, such as foundry and EDA partners, we will continue to apply real-world workload requirements to our silicon design, optimizing the entire stack from silicon to service, and delivering the best technology to our customers to empower them to achieve more.

Azure Maia for the era of AI: From silicon to software to systems

End-to-end systems optimization, designed for scalability and sustainability 


When developing the architecture for the Azure Maia AI accelerator series, Microsoft reimagined the end-to-end stack so that our systems could handle frontier models more efficiently and in less time. AI workloads demand infrastructure that is dramatically different from other cloud compute workloads, requiring increased power, cooling, and networking capability. Maia 100’s custom rack-level power distribution and management integrates with Azure infrastructure to achieve dynamic power optimization. Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.  

When we developed Maia 100, we also built a dedicated “sidekick” to match the thermal profile of the chip and added rack-level, closed-loop liquid cooling to Maia 100 accelerators and their host CPUs to achieve higher efficiency. This architecture allows us to bring Maia 100 systems into our existing datacenter infrastructure, and to fit more servers into these facilities, all within our existing footprint. The Maia 100 sidekicks are also built and manufactured to meet our zero waste commitment. 

Azure Maia for the era of AI: From silicon to software to systems

Co-optimizing hardware and software from the ground up with the open-source ecosystem 


From the start, transparency and collaborative advancement have been core tenets in our design philosophy as we build and develop Microsoft’s cloud infrastructure for compute and AI. Collaboration enables faster iterative development across the industry—and on the Maia 100 platform, we’ve cultivated an open community mindset from algorithmic data types to software to hardware.  

To make it easy to develop AI models on Azure AI infrastructure, Microsoft is creating the software for Maia 100 that integrates with popular open-source frameworks like PyTorch and ONNX Runtime. The software stack provides rich and comprehensive libraries, compilers, and tools to equip data scientists and developers to successfully run their models on Maia 100. 

Azure Maia for the era of AI: From silicon to software to systems

To optimize workload performance, AI hardware typically requires development of custom kernels that are silicon-specific. We envision seamless interoperability among AI accelerators in Azure, so we have integrated Triton from OpenAI. Triton is an open-source programming language that simplifies kernel authoring by abstracting the underlying hardware. This will empower developers with complete portability and flexibility without sacrificing efficiency and the ability to target AI workloads. 

Azure Maia for the era of AI: From silicon to software to systems

Maia 100 is also the first implementation of the Microscaling (MX) data format, an industry-standardized data format that leads to faster model training and inferencing times. Microsoft has partnered with AMD, ARM, Intel, Meta, NVIDIA, and Qualcomm to release the v1.0 MX specification through the Open Compute Project community so that the entire AI ecosystem can benefit from these algorithmic improvements. 

Azure Maia 100 is a unique innovation combining state-of-the-art silicon packaging techniques, ultra-high-bandwidth networking design, modern cooling and power management, and algorithmic co-design of hardware with software. We look forward to continuing to advance our goal of making AI real by introducing more silicon, systems, and software innovations into our datacenters globally.

Source: microsoft.com

Tuesday 2 April 2024

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

Canopius, a global specialty and P&C (re)insurance business, knew its success hinged on its ability to access all its data but that its existing infrastructure was hurting, not helping, its ability to do so. It wanted to capitalize on its data for increased productivity and it wanted a cloud platform capable of handling advanced technologies with ease. In simple terms, Canopius wanted a data platform that would truly enable its success, today and well into the future.

Like so many other companies today, it faced a decision: update its hardware or invest in a modern platform.

Ultimately, Canopius bet on itself and its future. Drawn to the transformative potential of generative AI, it migrated its data to Microsoft Azure and adopted a combination of our newest, most powerful services. The result? An ability to pivot quickly to changing market conditions. An IT team that, armed with the right tools and technologies, could focus solely on innovation. And, with full access to its unique data, a true competitive advantage. Enter the Microsoft Intelligent Data Platform.

Get to know the Microsoft Intelligent Data Platform


The Intelligent Data Platform is an idea that was born out of countless conversations with customers like Canopius who want, once and for all, to get their data house in order to gain a competitive edge. They know their data holds the key to their success. They want to implement AI tools. But after 15 or more years of data and cloud investments, they find their ambition is limited by the very solutions they’ve built over time. Patch after patch, their data estate has become fragmented, hard to manage, and incapable of AI era-workloads.

Even when customers know their infrastructure needs upgrading, many find it difficult to make the switch—and for good reasons. First, fixing a broken or outdated data platform and futureproofing a data platform are two distinct and sizable problems to solve. Second, I’ll be the first to admit our industry doesn’t always make it easy to choose. Azure alone offers more than 300 products, with more on the near-term horizon, and we don’t often present complete solutions because the reality is one size will never fit all. Third, customers often have favorite products and need reassurances those products will interoperate with other products. It’s natural for solutions to become more specialized and thereby more fragmented over time. And finally, migrating data accumulated over years, sometimes decades, is an exercise that can feel risky even to the most experienced IT professional. These are all valid concerns and roadblocks to progress.

With the emergence of generative AI, the original idea of the Intelligent Data Platform has morphed into what I regard now as our strongest, simplest recommendation for a cloud solution capable of solving data fragmentation problems of the past and present, and stands ready to empower uncapped innovation for even the most ambitious organizations. Generative AI changes a data strategy: data solutions created today must be designed with foresight to ensure they can easily manage spikes in data, new workloads and new technologies for many years to come.

The recommendation is this: migrate to some of the most advanced, AI-ready databases on the market today, take advantage of integrated advanced analytics and the latest AI tools, and invest in a comprehensive, end-to-end security solution.

The Microsoft Intelligent Data Platform—Unleash your data and accelerate your transformation

The newest technologies and products on the market today have been built with the future in mind—future data requirements and geo-economic environments. To focus on innovation, companies need access to the richness of their data and roadblocks removed, like interoperability and security concerns. That’s table stakes. They also need faster, better access to their data and tools to empower innovation and transformation.

This four-workload construct works no matter which cloud vendor you choose, as long as the technologies within are built for future data demands, interoperability, and security. For customers, existing and new, interested in migrating to Azure or adopting a complete intelligent data platform, we recommend the following: 

1. DATABASES RECOMMENDATION 
The foundation of a powerful solution begins with the database. Our latest hyperscale databases are an entirely new category, designed and tuned for generative AI to provide limitless leeway to reinvent business processes and create apps not possible before. They’re also built to process data at any scale with near-zero downtime, making real-time analytics and insights possible. Azure Cosmos DB
Azure SQL DB Hyperscale
2. ANALYTICS RECOMMENDATION
When paired with a hyperscale database, our newest unified analytics platform delivers tailored business intelligence in real-time. This all-new method of democratizing intelligence emboldens leadership teams to quickly make decisions that move the needle. Microsoft Fabric
3. AI RECOMMENDATION
Our newest AI technologies and tools give creators and thinkers the resources they need to ideate, conceptualize, and build an organization’s next big breakthrough, and its leaders the ability to strategize, reinvent, and streamline in new ways. Azure AI Studio
Azure Machine Learning
Azure AI Search
Azure AI Services
4. SECURITY RECOMMENDATION
We’ve built security controls into every Azure product within our portfolio, and our industry-leading security solution helps protect data from every angle, for better visibility, safety, and compliance. We strongly recommend the added protection that comes with an end-to-end solution. Microsoft Purview
Microsoft Defender for Cloud
Microsoft Defender XDR
Microsoft Sentinel
Microsoft Entra

When combined, these products transform into a futureproofed solution that makes powerful decision making, organizational performance, and limitless innovation possible.

Data is the currency of AI


Data is the key to every organization’s success, and that has taken on a whole new level of importance in the era of AI where data acts as the currency of AI innovation. Large language models can understand the world’s knowledge at a foundational level, but they can’t understand your business.

A few months ago, I connected with a customer—in fact, the Chief Executive Officer of a large company—who asked if he should pause the work his IT department was doing to integrate and modernize their data estate since, now, it’s all about AI. I appreciated his question, but my answer was: no, it’s exactly the opposite. So often, a new technology’s arrival is meant to replace something older. That isn’t the case here. A fully integrated, modern data estate is the bedrock of an innovation engine. A modern estate makes it possible to access all your data. And it’s your data, when applied to AI ideas and inventions, that will create magical results and transformational experiences.

For Epiroc, the Swedish manufacturer that supplies mining and construction industries with advanced equipment, tools, and services, steel is essential to the success of their business. The company wanted to consistently deliver a product of high quality and improve efficiencies but, in a world where there are more than 3,500 grades of steel, it needed to create company-wide best practices for data sharing and a repeatable process to get the output it wanted. Microsoft partnered with Epiroc to create an enterprise-scale machine learning AI factory as part of a broader, modern Intelligent Data Platform. Since it had already taken the step to modernize its data estate on Azure, establishing an enterprise-scale AI factory was quick and easy. And the result? A consistent, quality product, increased organizational efficiencies, and reduced waste.

Unmatched choice with our partner ecosystem


I recently spoke with another customer who successfully migrated to Azure and seamlessly integrated Databricks into their new data platform. And a different partner, already on Azure but wanting to invest in AI, integrated Azure AI products and Adobe with already impressive results. 

Every organization deserves choice. We don’t talk about it as often as we should, but those using and managing a data platform need to trust that it will serve their needs today and in the future. We recognize this and want to empower every organization’s success, no matter their legacy preferences, and our partner ecosystem offers incredible capabilities we are eager for our customers to take advantage of. For this reason, we’ve invested in this growing ecosystem of trusted partners whose products fully integrate with ours, and we’ll continue doing so. As a former partner myself, I firmly believe flexibility to choose the best solution for the job should always be a required feature of a powerful platform.

All clouds welcome


Data circumstances or preferences shouldn’t be a barrier, either. There are incredible benefits to adopting an all-cloud environment as part of a modern data strategy. But we recognize customers may want or need to choose an environment that includes multiple clouds or a mix of cloud and on-premises. The Intelligent Data Platform and all the products within it are successful in any adaptive cloud scenario.

Unleash your data for bold transformation


A key part of the Intelligent Data Platform, Azure AI, makes building data-informed ideas and experiences easy. Last year, we introduced Azure AI Studio, a one-stop-shop for generative AI creation offering model selection, data, security features, and DevOps in one interface. We directly connected Fabric to Azure AI Studio which, for customers, means they can start building AI apps without requiring manual data duplication or additional integration. We also built and tuned Azure Cosmos DB and Azure SQL DB Hyperscale for generative AI innovation. With hyperscale performance, built-in AI including vector search and copilots, and multi-layered AI-powered security embedded into the database, these databases unify, power, and protect the platform.

For over a year, generative AI has captured imaginations and attention. In this time, we’ve learned that AI’s potential comes to life at the intersection of an organization’s data and ideas. The world’s been generating and storing incredible amounts of data for so long, waiting for the day technology would catch up and simplify the process of using the data in bold, meaningful ways.

That day has arrived. The Intelligent Data Platform is the ideal data and AI foundation for every organization’s success, today and tomorrow.

Source: microsoft.com

Saturday 30 March 2024

Introducing modern data governance for the era of AI

Introducing modern data governance for the era of AI

The era of generative AI has arrived, offering new possibilities for every person, business, and industry. At the same time, the speed, scale, and sophistication of cyberattacks, increasing regulations, an ever-expanding data estate, and business demand for data insights are all converging. This convergence pressurizes business leaders to adopt a modern data governance and security strategy to confidently ensure AI readiness.

A modern data governance and security solution unifies data protection and governance capabilities, simplifies actions through business-friendly profiles and terminology with AI-powered business efficiency, and enables federated governance across a disparate multi-cloud data estate.

Microsoft Purview is a comprehensive set of solutions that can help your organization govern, protect, and manage data, wherever it lives. Microsoft Purview provides integrated coverage and helps address the fragmentation of data across organizations, the lack of visibility that hampers data protection and governance, and the blurring of traditional IT management roles. 

Today, we are excited to announce a reimagined data governance experience within Microsoft Purview, available in preview April 8, 2024. This new software-as-a-service (SaaS) experience offers sophisticated yet simple business-friendly interaction, integration across data sources, AI-enabled business efficiency, and actions and insights to help you put the ‘practice’ into your data governance practice.

Modern data governance with Microsoft Purview 


I led Microsoft through our own modern data governance journey the past several years and this experience exposed the realities, challenges, and key ingredients of the modern data governance journey.

Our new Microsoft Purview data governance solution is grounded in years of applied learning and proven practices from navigating this data transformation journey along with the transformation journeys of our enterprise customers. To that end, our vision for a modern data governance solution is based on the following design principles: 

Anchored on durable business concepts 

The practice of data governance should enable an organization to accelerate the creation of responsible value from their data. By anchoring data governance investments to measurable business objectives and key results (OKRs), organizations can align their data governance practice to business priorities and demonstrate business value outcomes.

A unified, integrated, and extensible experience 

A modern data governance solution should offer a single-pane-of-glass experience that integrates across multi-cloud data estate sources for data curation, management, health controls, discovery, and understanding, backed with compliant, self-serve data access. The unified experience reduces the need for laborious and costly custom-built or multiple-point solutions. This enables a focus on accelerating data governance practices, activating federated data governance across business units, and ensuring leaders have real-time insights into governance health. 

Scale success with AI-enabled experiences 

An ever-growing and changing data estate demands simplicity in how it is governed and to ensure business adoption and implementation efficiencies. Natural language interactions and machine learning (ML)-based recommendations across governance capabilities are critical to this simplification and accelerating data governance adoption.

A culture of data governance and protection 

Data governance solutions must be built for the practice of federated data governance, unique to each organization. Just as adopting cloud solutions requires one to become a cloud company, adopting data governance requires one to become a data governance company. Modern data governance success requires C-Suite alignment and support, and must be simple, efficient, customizable, and flexible to activate your unique practice. 

Introducing data governance for the business, by the business 


We are thrilled to introduce the new Microsoft Purview data governance experience. Our new data governance capabilities will help any organization of any size to accelerate business value creation in the era of AI.

A business-friendly approach to govern multi-cloud data estates 

Designed with the business in mind, the new governance experience supports different functions across the business with clear role definitions for governance administrators, business domain creators, data health owners, and data health readers.

Within Data Management, customers can easily define and assign business-friendly terminology (such as Finance and Claims). Business-friendly language follows the data governance experience through Data Products (a collection of data assets used for a business function), Business Domains (ownership of Data Products), Data Quality (assessment of quality), Data Access, Actions, and Data Estate Health (reports and insights). 

This new data governance experience allows you to scan and search data across your data estate assets.

Introducing modern data governance for the era of AI

Built-in data quality capabilities and rules which follow the data 

The new data quality model enables your organization to set rules top down with business domains, data products, and the data assets themselves. Policies can be set on a term or rule which flows through and helps save data stewards hours to days of manual work depending on the scale of your estate. Once rules and policies are applied, the data quality model will generate data quality scores at the asset, data product, or business domain level giving you snapshot insights into your data quality relative to your business rules. 

Within the data quality model, there are two metadata analysis capabilities: 1) profiling—quick sample set insights 2) data quality scans—in-depth scans of full data sets. These profiling capabilities use your defined rules or built-in templates to reason over your metadata and give you data quality insights and recommendations. 

Introducing modern data governance for the era of AI

Apply industry standard controls in data estate health management

In partnership with EDM Council, new data health controls include a set of 14 standards for cloud data management controls. These standards govern how data is to be managed while controls create fidelity of how data assets are used/accessed. Examples are metadata completeness, cataloging, classification, access entitlement, and data quality. A data office can configure rules which determine the score and define what constitutes a red/yellow/green indicator score, ensuring your rules and indicators reflect the unique standards of your organization. 

Introducing modern data governance for the era of AI

Summarized insights help activate and sustain your practice 

Data governance is a practice which is nurtured over time. Aggregated insights help you put the “practice” into your data governance practice by showcasing the overall health of your governed data estate. Built-in reports surface deep insight across a variety of dimensions: assets, catalog adoption, classifications, data governance, data stewardship, glossary, and sensitivity labels.

The image below is the Data Governance report which can be filtered by business domain, data product, and status for deeper insights.

Introducing modern data governance for the era of AI

Stay on top of data governance health with aggregated actions

The new Actions center aggregates and summarizes governance-related actions by role, data product, or business domain. Actions stem from usage or implementation being out of alignment from defined controls. This interactive summary makes it easy for teams to manage and track actions—simply click on the action to make the change required. Cleaning up outstanding actions helps improve the overall posture of your data governance practice—key to making governance a team sport. 

Introducing modern data governance for the era of AI

Announcing technology partnerships for even greater customer value 


We are excited to announce a solution initiative with Ernst & Young LLP (EY US), who will bring their extensive experience in data solutions within financial services, to collaborate with Microsoft on producing data governance reports and playbooks purpose-built for US-oriented financial services customers. These reports and playbooks aim to accelerate the customer time to value for activating a governance practice that adheres to the unique regulation needs of the financial sector. These assets will be made available in Azure Marketplace over the course of preview and the learnings from this will also help inform future product roadmap.

Additionally, a modern data governance solution integrates and extends across your technology estate. With this new data governance experience, we are also excited to announce technology partnerships that will help seamlessly extend the value of Microsoft Purview to customers through pre-built integration. Integrations will light up over the course of preview and be available in Azure Marketplace.

Master Data Management

◉ CluedIn brings native Master Data Management and Data Quality functionality to Microsoft Fabric, Microsoft Purview, and the Azure stack.
◉ Profisee Master Data Management is a complimentary and necessary piece of your data governance strategy.
◉ Semarchy combines master data management, data intelligence, and data integration into a singular application in any environment.

Data Lineage

◉ Solidatus empowers data-rich enterprises to visualize, understand, and govern data like never before.

Source: microsoft.com

Thursday 28 March 2024

Microsoft Azure delivers game-changing performance for generative AI Inference

Microsoft Azure delivers game-changing performance for generative AI Inference

Microsoft Azure has delivered industry-leading results for AI inference workloads among cloud service providers in the most recent MLPerf Inference results published publicly by MLCommons. The Azure results were achieved using the new NC H100 v5 series virtual machines (VMs) powered by NVIDIA H100 NVL Tensor Core GPUs and reinforced the commitment from Azure to designing AI infrastructure that is optimized for training and inferencing in the cloud.

The evolution of generative AI models


Models for generative AI are rapidly expanding in size and complexity, reflecting a prevailing trend in the industry toward ever-larger architectures. Industry-standard benchmarks and cloud-native workloads consistently push the boundaries, with models now reaching billions and even trillions of parameters. A prime example of this trend is the recent unveiling of Llama2, which boasts a staggering 70 billion parameters, marking it as MLPerf’s most significant test of generative AI to date (figure 1). This monumental leap in model size is evident when comparing it to previous industry standards such as the Large Language Model GPT-J, which pales in comparison with 10x fewer parameters. Such exponential growth underscores the evolving demands and ambitions within the AI industry, as customers strive to tackle increasingly complex tasks and generate more sophisticated outputs.

Tailored specifically to address the dense or generative inferencing needs that models like Llama 2 require, the Azure NC H100 v5 VMs marks a significant leap forward in performance for generative AI applications. Its purpose-driven design ensures optimized performance, making it an ideal choice for organizations seeking to harness the power of AI with reliability and efficiency. With the NC H100 v5-series, customers can expect enhanced capabilities with these new standards for their AI infrastructure, empowering them to tackle complex tasks with ease and efficiency. 

Microsoft Azure delivers game-changing performance for generative AI Inference
Figure 1: Evolution of the size of the models in the MLPerf Inference benchmarking suite. 

However, the transition to larger model sizes necessitates a shift toward a different class of hardware that is capable of accommodating the large models on fewer GPUs. This paradigm shift presents a unique opportunity for high-end systems, highlighting the capabilities of advanced solutions like the NC H100 v5 series. As the industry continues to embrace the era of mega-models, the NC H100 v5 series stands ready to meet the challenges of tomorrow’s AI workloads, offering unparalleled performance and scalability in the face of ever-expanding model sizes.

Enhanced performance with purpose-built AI infrastructure


The NC H100 v5-series shines with purpose-built infrastructure, featuring a superior hardware configuration that yields remarkable performance gains compared to its predecessors. Each GPU within this series is equipped with 94GB of HBM3 memory. This substantial increase in memory capacity and bandwidth translates in a 17.5% boost in memory size and a 64% boost in memory bandwidth over the previous generations. . Powered by NVIDIA H100 NVL PCIe GPUs and 4th-generation AMD EPYC™ Genoa processors, these virtual machines feature up to 2 GPUs, alongside up to 96 non-multithreaded AMD EPYC Genoa processor cores and 640 GiB of system memory.

In today’s announcement from MLCommons, the NC H100 v5 series premiered performance results in the MLPerf Inference v4.0 benchmark suite. Noteworthy among these achievements is a 46% performance gain over competing products equipped with GPUs of 80GB of memory (figure 2), solely based on the impressive 17.5% increase in memory size (94 GB) of the NC H100 v5-series. This leap in performance is attributed to the series’ ability to fit the large models into fewer GPUs efficiently. For smaller models like GPT-J with 6 billion parameters, there is a notable 1.6x speedup from the previous generation (NC A100 v4) to the new NC H100 v5. This enhancement is particularly advantageous for customers with dense Inferencing jobs, as it enables them to run multiple tasks in parallel with greater speed and efficiency while utilizing fewer resources.

Microsoft Azure delivers game-changing performance for generative AI Inference
Figure 2: Azure results on the model Llama2 (70 billion parameters) from MLPerf Inference v4.0 in March 2024 (4.0-0004) and (4.0-0068). 

Performance delivering a competitive edge


The increase in performance is important not just compared to previous generations of comparable infrastructure solutions In the MLPerf benchmarks results, Azure’s NC H100 v5 series virtual machines results are standout compared to other cloud computing submissions made. Notably, when compared to cloud offerings with smaller memory capacities per accelerator, such as those with 16GB memory per accelerator, the NC H100 v5 series VMs exhibit a substantial performance boost. With nearly six times the memory per accelerator, Azure’s purpose-built AI infrastructure series demonstrates a performance speedup of 8.6x to 11.6x (figure 3). This represents a performance increase of 50% to 100% for every byte of GPU memory, showcasing the unparalleled capacity of the NC H100 v5 series. These results underscore the series’ capacity to lead the performance standards in cloud computing, offering organizations a robust solution to address their evolving computational requirements.

Microsoft Azure delivers game-changing performance for generative AI Inference
Figure 3: Performance results on the model GPT-J (6 billion parameters) from MLPerf Inference v4.0 in March 2024 on Azure NC H100 v5 (4.0-0004) and an offering with 16GB of memory per accelerator (4.0-0045) – with one accelerator each.

In conclusion, the launch of the NC H100 v5 series marks a significant milestone in Azure’s relentless pursuit of innovation in cloud computing. With its outstanding performance, advanced hardware capabilities, and seamless integration with Azure’s ecosystem, the NC H100 v5 series is revolutionizing the landscape of AI infrastructure, enabling organizations to fully leverage the potential of generative AI Inference workloads. The latest MLPerf Inference v4.0 results underscore the NC H100 v5 series’ unparalleled capacity to excel in the most demanding AI workloads, setting a new standard for performance in the industry. With its exceptional performance metrics and enhanced efficiency, the NC H100 v5 series reaffirms its position as a frontrunner in the realm of AI infrastructure, empowering organizations to unlock new possibilities and achieve greater success in their AI initiatives. Furthermore, Microsoft’s commitment, as announced during the NVIDIA GPU Technology Conference (GTC), to continue innovating by introducing even more powerful GPUs to the cloud, such as the NVIDIA Grace Blackwell GB200 Tensor Core GPUs, further enhances the prospects for advancing AI capabilities and driving transformative change in the cloud computing landscape.

Source: microsoft.com