Showing posts with label Azure Data Lake. Show all posts
Showing posts with label Azure Data Lake. Show all posts

Tuesday, 3 October 2023

Manage your big data needs with HDInsight on AKS

As companies today look to do more with data, take full advantage of the cloud, and vault into the age of AI, they’re looking for services that process data at scale, reliably, and efficiently. Today, we’re excited to announce the upcoming public preview of HDInsight on Azure Kubernetes Service (AKS), our cloud-native, open-source big data service, completely rearchitected on Azure Kubernetes Service infrastructure with two new workloads and numerous improvements across the stack.

HDInsight on AKS amplifying performance

HDInsight on AKS includes Apache Spark, Apache Flink, and Trino workloads on an Azure Kubernetes Service infrastructure, and features deep integration with popular Azure analytics services like Power BI, Azure Data Factory, and Azure Monitor, while leveraging Azure managed services for Prometheus and Grafana for monitoring. HDInsight on AKS is an end-to-end, open-source analytics solution that is easy to deploy and cost-effective to operate. 

Manage your big data needs with HDInsight on AKS

HDInsight on AKS helps customers leverage open-source software for their analytics needs by: 

  • Providing a curated set of open-source analytics workloads like Apache Spark, Apache Flink, and Trino. These workloads are the best-in-class open-source software for data engineering, machine learning, streaming, and querying.
  • Delivering managed infrastructure, security, and monitoring so that teams can spend their time building innovative applications without needing to worry about the other components of their stack. Teams can be confident that HDInsight helps keep their data safe. 
  • Offering flexibility that teams need to extend capabilities by tapping into today’s rich, open-source ecosystem for reusable libraries, and customizing applications through script actions.

Customers who are deeply invested in open-source analytics can use HDInsight on AKS to reduce costs by setting up fully functional, end-to-end analytics systems in minutes, leveraging ready-made integrations, built-in security, and reliable infrastructure. Our investments in performance improvements and features like autoscale enable customers to run their analytics workloads at optimal cost. HDInsight on AKS comes with a very simple and consistent pricing structure per vcore per hour regardless of the size of the resource or the region, plus the cost of resources provisioned.

Developers love HDInsight for the flexibility it offers to extend the base capabilities of open-source workloads through script actions and library management. HDInsight on AKS has an intuitive portal experience for managing libraries and monitoring resources. Developers have the flexibility to use a Software Development Kit(SDK), Azure Resource Manager (ARM) templates, or the portal experience based on their preference.

Open, managed, and flexible


HDInsight on AKS covers the full gamut of enterprise analytics needs spanning streaming, query processing, batch, and machine learning jobs with unified visualization. 

Curated open-source workloads

HDInsight on AKS includes workloads chosen based on their usage in typical analytics scenarios, community adoption, stability, security, and ecosystem support. This ensures that customers don’t need to grapple with the complexity of choice on account of myriad offerings with overlapping capabilities and inconsistent interoperability.  

Each of the workloads on HDInsight on AKS is the best-in-class for the analytics scenarios it supports: 

  • Apache Flink is the open-source distributed stream processing framework that powers stateful stream processing and enables real-time analytics scenarios. 
  • Trino is the federated query engine that is highly performant and scalable, addressing ad-hoc querying across a variety of data sources, both structured and unstructured.  
  • Apache Spark is the trusted choice of millions of developers for their data engineering and machine learning needs. 

HDInsight on AKS offers these popular workloads with a common authentication model, shared meta store support, and prebuilt integrations which make it easy to deploy analytics applications.

Managed service reduces complexity

HDInsight on AKS is a managed service in the Azure Kubernetes Service infrastructure. With a managed service, customers aren’t burdened with the management of infrastructure and other software components, including operating systems, AKS infrastructure, and open-source software. This ensures that enterprises can benefit from ongoing security and functional and performance enhancements without investing precious development hours.  

Containerization enables seamless deployment, scaling, and management of key architectural components. The inherent resiliency of AKS allows pods to be automatically rescheduled on newly commissioned nodes in case of failures. This means jobs can run with minimal disruptions to Service Level Agreements (SLAs). 

Customers combining multiple workloads in their data lakehouse need to deal with a variety of user experiences, resulting in a steep learning curve. HDInsight on AKS provides a unified experience for managing their lakehouse. Provisioning, managing, and monitoring all workloads can be done in a single pane of glass. Additionally, with managed services for Prometheus and Grafana, administrators can monitor cluster health, resource utilization, and performance metrics.  

Through the autoscale capabilities included in HDInsight on AKS, resources—and thereby cost—can be optimized based on usage needs. For jobs with predictable load patterns, teams can schedule the autoscaling of resources based on a predefined timetable. Graceful decommission enables the definition of wait periods for jobs to be completed before ramping down resources, elegantly balancing costs with experience. Load-based autoscaling can ramp resources up and down based on usage patterns measured by compute and memory usage. 

HDInsight on AKS marks a shift away from traditional security mechanisms like Kerberos. It embraces OAuth 2.0 as the security framework, providing a modern and robust approach to safeguarding data and resources. In HDInsight on AKS authorization, access controls are based on managed identities. Customers can also bring their own virtual networks and associate them during cluster setup, increasing security and enabling compliance with their enterprise policies. The clusters are isolated with namespaces to protect data and resources within the tenant. HDInsight on AKS also allows management of cluster access using Azure Resource Manager (ARM) roles. 

Customers who’ve participated in the private preview love HDInsight on AKS. 

Here’s what one user had to say about his experience. 

“With HDInsight on AKS, we’ve seamlessly transitioned from the constraints of our in-house solution to a robust managed platform. This pivotal shift means our engineers are now free to channel their expertise towards core business innovation, rather than being entangled in platform management. The harmonious integration of HDInsight with other Azure products has elevated our efficiency. Enhanced security bolsters our data’s integrity and trustworthiness, while scalability ensures we can grow without hitches. In essence, HDInsight on AKS fortifies our data strategy, enabling more streamlined and effective business operations.”

Matheus Antunes, Data Architect, XP Inc

Source: microsoft.com

Sunday, 26 April 2020

Optimize cost and performance with Query Acceleration for Azure Data Lake Storage

The explosion of data-driven decision making is motivating businesses to have a data strategy to provide better customer experiences, improve operational efficiencies, and make real-time decisions based on data. As businesses become data driven, we see more customers build data lakes on Azure. We also hear that more cost optimization and more performance are two of the most important features of data lake architecture on Azure. Normally, these two qualities are traded off for each other—if you want more performance, you will need to pay more; if you want to save money, expect your performance curve to go down.

We’re announcing the preview of Query Acceleration for Azure Data Lake Storage—a new capability of Azure Data Lake Storage, which improves both performance and cost. The feature is now available for customers to start realizing these benefits and improving their data lake deployment on Azure.

How Query Acceleration for Azure Data Lake improves performance and cost


Big data analytics frameworks, such as Spark, Hive, and large-scale data processing applications, work by reading all of the data using a horizontally-scalable distributed computing platform with techniques such as MapReduce. However, a given query or transformation generally does not require all of the data to achieve its goal. Therefore, applications typically incur the costs of reading, transferring over the network, parsing into memory and finally filtering out the majority of the data that is not required. Given the scale of such data lake deployments, these costs become a major factor that impacts the design and how ambitious you can be. Improving cost and performance at the same time enhances how much valuable insight you can extract from your data.

Query Acceleration for Azure Data Lake Storage allows applications and frameworks to push-down predicates and column projections, so they may be applied at the time data is first read, meaning that all downstream data handling is saved from the cost of filtering and processing unrequired data.

The following diagram illustrates how a typical application uses Query Acceleration to process data:

Microsoft Online Guides, Microsoft Tutorial and Material, Azure Certification, Azure Data Lake

1. The client application requests file data by specifying predicates and column projections.

2. Query Acceleration parses the specified query and distributes work to parse and filter data.

3. Processors read the data from the disk, parses the data by using the appropriate format, and then filters data by applying the specified predicates and column projections.

4. Query Acceleration combines the response shards to stream back to client application.

5. The client application receives and parses the streamed response. The application doesn't need to filter any additional data and can apply the desired calculation or transformation directly.

Azure offers powerful analytic services


Query Acceleration for Azure Data Lake Storage is yet another example of how we’re committed to making Azure the best place for organizations to unlock transformational insights from all data. Customers can benefit from tight integration with other Azure Services for building powerful cloud scale end-to-end analytics solutions. These solutions support modern data warehousing, advanced analytics, and real-time analytics easily and more economically.

We’re also committed to remaining an open platform where the best-in-breed open source solutions benefit equally from the innovations occurring at all points within the platform. With Azure Data Lake Storage underpinning an entire ecosystem of powerful analytics services, customers can extract transformational insights from all data assets.

Saturday, 27 July 2019

Silo busting 2.0—Multi-protocol access for Azure Data Lake Storage

Cloud data lakes solve a foundational problem for big data analytics—providing secure, scalable storage for data that traditionally lives in separate data silos. Data lakes were designed from the start to break down data barriers and jump start big data analytics efforts. However, a final “silo busting” frontier remained, enabling multiple data access methods for all data—structured, semi-structured, and unstructured—that lives in the data lake.

Providing multiple data access points to shared data sets allow tools and data applications to interact with the data in their most natural way. Additionally, this allows your data lake to benefit from the tools and frameworks built for a wide variety of ecosystems. For example, you may ingest your data via an object storage API, process the data using the Hadoop Distributed File System (HDFS) API, and then ingest the transformed data using an object storage API into a data warehouse.

Single storage solution for every scenario


We are very excited to announce the preview of multi-protocol access for Azure Data Lake Storage! Azure Data Lake Storage is a unique cloud storage solution for analytics that offers multi-protocol access to the same data. Multi-protocol access to the same data, via Azure Blob storage API and Azure Data Lake Storage API, allows you to leverage existing object storage capabilities on Data Lake Storage accounts, which are hierarchical namespace-enabled storage accounts built on top of Blob storage. This gives you the flexibility to put all your different types of data in your cloud data lake knowing that you can make the best use of your data as your use case evolves.

Announcements, Backup & Recovery, Azure Data Lake, Azure Storage, Azure Certifications, Azure Study Materials
Single storage solution

Expanded feature set, ecosystem, and applications


Existing blob features such as access tiers and lifecycle management policies are now unlocked for your Data Lake Storage accounts. This is paradigm-shifting because your blob data can now be used for analytics. Additionally, services such as Azure Stream Analytics, IoT Hub, Azure Event Hubs capture, Azure Data Box, Azure Search, and many others integrate seamlessly with Data Lake Storage. Important scenarios like on-premises migration to the cloud can now easily move PB-sized datasets to Data Lake Storage using Data Box.

Multi-protocol access for Data Lake Storage also enables the partner ecosystem to use their existing Blob storage connector with Data Lake Storage.  Here is what our ecosystem partners are saying:

“Multi-protocol access for Azure Data Lake Storage is a game changer for our customers. Informatica is committed to Azure Data Lake Storage native support, and Multi-protocol access will help customers accelerate their analytics and data lake modernization initiatives with a minimum of disruption.”

You will not need to update existing applications to gain access to your data stored in Data Lake Storage. Furthermore, you can leverage the power of both your analytics and object storage applications to use your data most effectively.

Announcements, Backup & Recovery, Azure Data Lake, Azure Storage, Azure Certifications, Azure Study Materials
Multi-protocol access enables features and ecosystem

Multiple API endpoints—Same data, shared features


This capability is unprecedented for cloud analytics services because not only does this support multiple protocols, this supports multiple storage paradigms. We now bring you this powerful capability to your storage in the cloud. Existing tools and applications that use the Blob storage API gain these benefits without any modification. Directory and file-level access control lists (ACL) are consistently enforced regardless of whether an Azure Data Lake Storage API or Blob storage API is used to access the data.  

Announcements, Backup & Recovery, Azure Data Lake, Azure Storage, Azure Certifications, Azure Study Materials
Multi-protocol access on Azure Data Lake Storage

Features and expanded ecosystem now available on Data Lake Storage


Multi-protocol access for Data Lake Storage brings together the best features of Data Lake Storage and Blob storage into one holistic package. It enables many Blob storage features and ecosystem support for your data lake storage.

Features
More information 
Access tiers
Cool and Archive tiers are now available for Data Lake Storage.
Lifecycle management policies 
You can now set policies to a tier or delete data in Data Lake Storage. 
Diagnostics logs 
Logs for the Blob storage API and Azure Data Lake Storage API are now available in v1.0 and v2.0 formats.  
SDKs  
Existing blob SDKs can now be used with Data Lake Storage. 
PowerShell 
PowerShell for data plane operations is now available for Data Lake Storage.  
CLI 
Azure CLI for data plane operations is now available for Data Lake Storage.  
Notifications via Azure Event Grid 
You can now get Blob notifications through Event Grid.

Ecosystem partner More information 
Azure Stream Analytics Azure Stream Analytics now writes to, as well as reads from, Data Lake Storage. 
Azure Event Hubs capture  The capture feature within Azure Event Hubs now lets you pick Data Lake Storage as one of its destinations.
IoT Hub  IoT Hub message routing now allows routing to Azure Data Lake Storage Gen 2.
Azure Search   You can now index and apply machine learning models to your Data Lake Storage content using Azure Search. 
Azure Data Box  You can now ingest huge amounts of data from on-premises to Data Lake Storage using Data Box. 

Saturday, 9 June 2018

Azure Data Lake Tools for VSCode supports Azure blob storage integration

We are pleased to announce the integration of VSCode explorer with Azure blob storage. If you are a data scientist and want to explore the data in your Azure blob storage, please try the Data Lake Explorer blob storage integration. If you are a developer and want to access and manage your Azure blob storage files, please try the Data Lake Explorer blob storage integration. The Data Lake Explorer allows you easily navigate to your blob storage, access and manage your blob container, folder and files.

Summary of new features


◈ Blob container - Refresh, Delete Blob Container and Upload Blob

Azure Certification, Azure Learning, Azure Guides, Azure Tutorials and Materials

◈ Folder in blob - Refresh and Upload Blob 

Azure Certification, Azure Learning, Azure Guides, Azure Tutorials and Materials

◈ File in blob - Preview/Edit, Download, Delete, Create EXTRACT Script (only available for CSV, TSV and TXT files), as well as Copy Relative Path, and Copy Full Path

Azure Certification, Azure Learning, Azure Guides, Azure Tutorials and Materials

How to install or update


Install Visual Studio Code and download Mono 4.2.x (for Linux and Mac). Then get the latest Azure Data Lake Tools by going to the VSCode Extension repository or the VSCode Marketplace and searching Azure Data Lake Tools.

Azure Certification, Azure Learning, Azure Guides, Azure Tutorials and Materials

Thursday, 24 May 2018

Control Azure Data Lake costs using Log Analytics to create service alerts

Azure Data Lake customers use the Data Lake Store and Data Lake Analytics to store and run complex analytics on massive amounts of data. However it is challenging to manage costs, keep up-to-date with activity in the accounts, and proactively know when usage thresholds are nearing certain limits. Using Log Analytics and Azure Data Lake we can address these challenges and know when the costs are increasing or when certain activities take place.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Analytics

In this post, you will learn how to use Log Analytics with your Data Lake accounts to create alerts that can notify you of Data Lake activity events and when certain usage thresholds are reached. It is easy to get started!

Step 1: Connect Azure Data Lake and Log Analytics


Data Lake accounts can be configured to generate diagnostics logs, some of which are automatically generated (e.g. regular Data Lake operations such as reporting current usage, or whenever a job completes). Others are generated based on requests (e.g. when a new file is created, opened, or when a job is submitted). Both Data Lake Analytics and Data Lake Store can be configured to send these diagnostics logs to a Log Analytics account where we can query the logs and create alerts based on the query results.

Step 2: Create a query that can identify a specific event or aggregated threshold


Specific key questions about the state or usage of your Azure Data Lake account can be generally answered with a query that parses usage or metric logs. To query the logs in Log Analytics, in the account home (OMS Workspace), click on Log Search.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Analytics

In the Log Search blade, you can start typing queries using Log Analytics Query Language:

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Analytics

There are two main types of queries that can be used in Log Analytics to configure alerts:

◈ Queries that return individual events, these single events will show a single entry per row (e.g. every time a file is opened).
◈ Queries that aggregate values or metrics for a specific window of time as a threshold by aggregating single events (e.g. 10 files opened in the past five minutes), or the values of a metric (e.g. total AUs assigned to jobs).

Here are some sample queries, the first two return events while the third aggregate values:

◈ This query returns a new entry every time a new Data Lake Store folder is created in the specified Azure Data Lake Store (ADLS) account:

AzureDiagnostics
| where Category == "Requests"
| where ResourceProvider == "MICROSOFT.DATALAKESTORE"
| where Resource == "[Your ADLS Account Name]"
| where OperationName == "mkdirs"

◈ This query returns a new entry every time a job fails in any of the Data Lake Analytics accounts configured to the Log Analytics workspace:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DATALAKEANALYTICS"
| where OperationName == "JobEnded"
| where ResultType == "CompletedFailure"

◈ This query returns a list of jobs submitted by users in a 24-hour interval, including user account and sum of jobs submitted in the 24h interval:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DATALAKEANALYTICS"
| where OperationName == "SubmitJob"
| summarize AggregatedValue = count(identity_s) by bin(TimeGenerated, 24h), identity_s

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Analytics

Queries like these will be used in the next step when configuring alerts.

Queries like these will be used in the next step when configuring alerts.


Step 3: Create an alert to be notified when the event is detected or when the threshold is reached.
Using a query such as those shown in the previous step, Log Analytics can be used to create an alert that will notify users via e-mail, text message, or webhook when the event is captured or metric threshold is reached.

Please note that the alerts will be slightly delayed and you can read more details regarding the delays and Log Analytics SLAs in Understanding alerts in Log Analytics.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Analytics

Friday, 5 January 2018

Azure Data Lake tools integrates with VSCode Data Lake Explorer and Azure Account

If you are a data scientist and want to explore the data and understand what is being saved and what the hierarchy of the folder is, please try Data Lake Explorer in VSCode ADL Tools. If you are a developer and look for easier navigation inside the ADLS, please use Data Lake Explorer in VSCode ADL Tools. The VSCode Data Lake Explorer enhances your Azure login experiences, empowers you to manage your ADLA metadata in a tree like hierarchical way and enables easier file exploration for ADLS resources under your Azure subscriptions. You can also preview, delete, download, and upload files through contextual menu. With the integration of VSCode explorer, you can choose your preferred way to manage your U-SQL databases and your ADLS storage accounts in addition to the existing ADLA and ADLS commands.

If you have difficulties to login to Azure and look for simpler sign in processes, the Azure Data Lake Tools integration with VSCode Azure account enables auto sign in and greatly enhance the integration with Azure experiences. If you are an Azure multi-tenant user, the integration with Azure account unblocks you and empowers you to navigate your Azure subscription resources across tenants.

If your source code is in GitHub, a new command ADL: Set Git Ignore has been added to auto exclude system generated files and folders from your GitHub source repository.

Key Customer Benefits


◉ Support Azure auto sign in and improve sign in experiences via integration with Azure Account extension.
◉ Enable multi-tenants support to allow you to manage your Azure subscription resources across tenants.
◉ Browse ADLA metadata and view metadata schema while performing U-SQL authoring.
◉ Create and delete your U-SQL database objects anytime in a tree like explorer.
◉ Navigate across ADLS storage accounts for file exploration, file preview, file download, file/folder delete, and file/folder upload in a tree like explorer.
◉ Exclude system generated files and folders from the GitHub repository through command.

Summary of new features


◉ Azure Data Lake Analytics integration with Data Lake Explorer

Microsoft Guides, Microsoft Tutorials and Materials, Microsoft Learning

◉ Azure Data Lake Storage integration with Data Lake Explorer 

Microsoft Guides, Microsoft Tutorials and Materials, Microsoft Learning

◉ Set Git Ignore file

Microsoft Guides, Microsoft Tutorials and Materials, Microsoft Learning

Microsoft Guides, Microsoft Tutorials and Materials, Microsoft Learning

How to install or update


Install Visual Studio Code and download Mono 4.2.x (for Linux and Mac). Then get the latest Azure Data Lake Tools by going to the VSCode Extension repository or the VSCode Marketplace and searching Azure Data Lake Tools.

Microsoft Guides, Microsoft Tutorials and Materials, Microsoft Learning

Wednesday, 6 December 2017

ADL Tools for Visual Studio Code (VSCode) supports Python & R Programming

We are thrilled to introduce support for Azure Data Lake (ADL) Python and R extensions within Visual Studio Code (VSCode). This means you can easily add Python or R scripts as custom code extensions in U-SQL scripts, and submit such scripts directly to ADL with one click. For data scientists who value the productivity of Python and R, ADL Tools for VSCode offers a fast and powerful code editing solution. VSCode makes it simple to get started and provides easy integration with U-SQL for data extract, data processing, and data output.

With ADL Tools for VSCode, you can choose your preferred language and use already familiar techniques to build your custom code. For example, developers using Python can now use REFERENCE ASSEMBLY to bring in the needed Python libraries and leverage built-in reducers to run Python code on each job execution vertex. You can also embed your Python code, which accepts a pandas DataFrame as input and returns a pandas DataFrame as output, into your U-SQL script. For data scientist using R, you can perform massively parallel execution of R code for data science scenarios such as merging various data files, parallel feature engineering, partitioned data model building, and so on.  To facilitate code clarity and reuse, the tools also allow to write code behind using different languages for a U-SQL file.

Key customer benefits


◉ Local editor authoring and execution experience for Python Code-Behind to support distributed analytics.
◉ Local editor authoring and execution experience for R Code-Behind to support distributed analytics.
◉ Flexible mechanism to allow you to write single or multiple Python, R, and C# Code-Behind as part of a single U-SQL file.
◉ Dynamic Code-Behind to embed Python and R script into your U-SQL script.
◉ Integration with Azure Data Lake for Python and R with easy U-SQL job submissions.

How to develop U-SQL with Python and R


◉ Right-click the U-SQL script file, select ADL: Generate Python Code Behind File, and a xxx.usql.py file is generated in your working folder. Then write your Python code.

ADL, Microsoft Guides, Microsoft Tutorials and Materials, Azure Microsoft

ADL, Microsoft Guides, Microsoft Tutorials and Materials, Azure Microsoft

◉ Right-click the U-SQL script file, select ADL: Generate R Code Behind File, and a xxx.usql.r file is generated in your working folder. Then write your R code. 

ADL, Microsoft Guides, Microsoft Tutorials and Materials, Azure Microsoft

ADL, Microsoft Guides, Microsoft Tutorials and Materials, Azure Microsoft

How to install or update


First, install Visual Studio Code and download Mono 4.2.x (for Linux and Mac). Then get the latest Azure Data Lake Tools by going to the VSCode Extension repository or the VSCode Marketplace and searching “Azure Data Lake Tools”.

ADL, Microsoft Guides, Microsoft Tutorials and Materials, Azure Microsoft

Second, please complete the one-time set up to register Python and R extensions assemblies for your ADL account.