Showing posts with label Azure Databricks. Show all posts
Showing posts with label Azure Databricks. Show all posts

Tuesday, 11 June 2024

Azure Databricks: Differentiated synergy

Azure Databricks: Differentiated synergy

Companies have long collected data from various sources, leading to the development of data lakes for storing data at scale. However, data lakes lacked critical features such as data quality. The Lakehouse architecture emerged to address the limitations of data warehouses and data lakes. Lakehouse is a robust framework for enterprise data infrastructure, with Delta Lake as the storage layer which has gained popularity. Databricks, a pioneer of the Data Lakehouse, an integral component of their Data Intelligence Platform is available as a fully managed first party Data and AI solution on Microsoft Azure as Azure Databricks, making Azure the optimal cloud for running Databricks workloads. This blog post discusses the key advantages of Azure Databricks in detail: 

1. Seamless integration with Azure.
2. Regional availability and performance.
3. Security and compliance.
4. Unique partnership: Microsoft and Databricks.

1. Seamless integration with Azure 


Azure Databricks is a first-party service on Microsoft Azure, offering native integration with vital Azure Services and workloads that add value, allowing for rapid onboarding onto a Databricks workspace with just a few clicks.

Native integration—as a first party service 


◉ Microsoft Entra ID (formerly Azure Active Directory): Azure Databricks integrates with Microsoft Entra ID, enabling managed access control and authentication effortlessly. Engineering teams jointly at Microsoft and Databricks have natively built this integration out of the box with Azure Databricks, so they don’t have to build this integration on their own. 

◉ Azure Data Lake Storage (ADLS Gen2): Databricks can directly read and write data from ADLS Gen2 which has been collaboratively optimized for fastest possible data access, enabling efficient data processing and analytics. The integration of Azure Databricks with Azure Storage platforms such as Data Lake and Blob Storage provides a more streamlined experience on data workloads. 

◉ Azure Monitor and Log Analytics: Azure Databricks clusters and jobs can be monitored using Azure Monitor and gain insights through Log Analytics.

◉ Databricks extension to VS code: The Databricks extension for Visual Studio Code is specifically designed to work with Azure Databricks, providing a direct connection between the local development environment and Azure Databricks workspace.

Integrated services that deliver value 


◉ Power BI: Power BI is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities. Using Azure Databricks as a data source with Power BI brings the advantages of Azure Databricks performance and technology beyond data scientists and data engineers to all business users. Power BI Desktop can be connected to Azure Databricks clusters and Databricks SQL warehouses. Power BI’s strong enterprise semantic modeling and calculation capabilities allows defining calculations, hierarchies, and other business logic that’s meaningful to customers, and orchestrating the data flows into the model with Azure Databricks Lakehouse. It is possible to publish Power BI reports to the Power BI service and enable users to access the underlying Azure Databricks data using single sign-on (SSO), passing along the same Microsoft Entra ID credentials they use to access the report. With a Premium Power BI license, it is possible to Direct Publish from Azure Databricks, allowing you to create Power BI datasets from tables and schemas from data present in Unity Catalog directly from the Azure Databricks UI. Direct Lake mode is a unique feature currently available in Power BI Premium and Microsoft Fabric FSKU ( Fabric Capacity/SKU) capacity that works with Azure Databricks. It allows for the analysis of very large data volumes by loading parquet-formatted files directly from a data lake. This feature is particularly useful for analyzing very large models with less delay and models with frequent updates at the data source. 

◉ Azure Data Factory (ADF): ADF provides the capability to natively ingest data to the Azure cloud from over 100 different data sources. It also provides graphical data orchestration and monitoring capabilities that are easy to build, configure, deploy, and monitor in production. ADF has native integration with Azure Databricks via the Azure Databricks linked service and can execute notebooks, Java Archive file format (JARs), and Python code activities which enables organizations to build scalable data orchestration pipelines that ingest data from various data sources and curate that data in the Lakehouse.

◉ Azure Open AI: Azure Databricks includes built-in tools to support ML workflows, including AI Functions, a built-in DB SQL function, allowing you to access Large Language Models (LLMs) directly from SQL. With this launch, customers can now quickly experiment with LLMs on their company’s data from within a familiar SQL interface. Once the correct LLM prompt has been developed, it can turn quickly into a production pipeline using existing Databricks tools such as Delta Live Tables or scheduled Jobs.

◉ Microsoft Purview: Microsoft Azure’s data governance solution, Microsoft Purview integrates with Azure Databricks Unity Catalog’s catalog, lineage and policy Application Programming Interfaces (APIs). This allows discovery and request-for-access within Microsoft Purview, while keeping Unity Catalog as the operational catalog on Azure Databricks. Microsoft Purview supports metadata sync with Azure Databricks Unity Catalog which includes metastore catalogs, schemas, tables including the columns, and views including the columns. In addition, this integration enables discovery of Lakehouse data and bringing its metadata into Data Map which allows scanning the entire Unity Catalog metastore or choosing to scan only selective catalogs. The integration of data governance policies in Microsoft Purview and Databricks Unity Catalog enables a single pane experience for Data and Analytics Governance in Microsoft Purview.

Best of both worlds with Azure Databricks and Microsoft Fabric 


Azure Databricks: Differentiated synergy
Microsoft Fabric is a unified analytics platform that includes all the data and analytics tools that organizations need. It brings together experiences such as Data Engineering, Data Factory, Data Science, Data Warehouse, Real-Time Intelligence, and Power BI onto a shared SaaS foundation, all seamlessly integrated into a single service. Microsoft Fabric comes with OneLake, an open and governed, unified SaaS data lake that serves as a single place to store organizational data. Microsoft Fabric simplifies data access by creating shortcuts to files, folders, and tables in its native open format Delta-Parquet into OneLake. These shortcuts allow all Microsoft Fabric engines to operate on the data without the need for data movement or copying with no disruption to existing usage by the host engines.

For instance, creating a shortcut to Delta-Lake tables generated by Azure Databricks enables customers to effortlessly serve Lakehouse data to Power BI via the option of Direct Lake mode. Power BI Premium, as a core component of Microsoft Fabric, offers Direct Lake mode to serve data directly from OneLake without the need to query an Azure Databricks Lakehouse or warehouse endpoint, thereby eliminating the need for data duplication or import into a Power BI model enabling blazing fast performance directly over data in OneLake as an alternative to serving to Power BI via ADLS Gen2. Having access to both Azure Databricks and Microsoft Fabric built on the Lakehouse architecture, Microsoft Azure customers have a choice to work with either one or both powerful open governed Data and AI solutions to get the most from their data unlike other public clouds. Azure Databricks and Microsoft Fabric together can simplify organizations’ overall data journey with deeper integration in the development pipeline.

2. Regional availability and performance 


Azure provides robust scalability and performance capabilities for Azure Databricks: 

  • Azure Compute optimization for Azure Databricks: Azure offers a variety of compute options, including GPU-enabled instances, which accelerate machine learning and deep learning workloads collaboratively optimized with Databricks engineering. Azure Databricks globally spins up more than 10 million virtual machines (VMs) a day. 
  • Availability: Azure currently has 43 available regions worldwide supporting Azure Databricks and growing. 

3. Security and compliance 


All the enterprise grade security, compliance measures of Azure apply to Azure Databricks prioritizing it to meet customer requirements: 

  • Azure Security Center: Azure Security Center provides monitoring and protection of Azure Databricks environment against threats. Azure Security Center automatically collects, analyzes, and integrates log data from a variety of Azure resources. A list of prioritized security alerts is shown in Security Center along with the information needed to quickly investigate the problem along with recommendations on how to remediate an attack. Azure Databricks provides encryption features for additional control of data.
  • Azure Compliance Certifications: Azure holds industry-leading compliance certifications, ensuring Azure Databricks workloads meet regulatory standards. Azure Databricks is certified under PCI-DSS (Classic) and HIPAA (Databricks SQL Serverless, Model Serving).
  • Azure Confidential Compute (ACC) is only available on Azure. Using Azure confidential computing on Azure Databricks allows end-to-end data encryption. Azure offers Hardware-based Trusted Execution Environments (TEEs) to provide a higher level of security by encrypting data in use in addition to AMD-based Azure Confidential Virtual Machines (VMs) which provides full VM encryption while minimizing performance impact.
  • Encryption: Azure Databricks supports customer-managed keys from Azure Key Vault and Azure Key Vault Managed HSM (Hardware Security Modules) natively. This feature provides an additional layer of security and control over encrypted data.

4. Unique partnership: Databricks and Microsoft


One of the standout attributes of Azure Databricks is the unique partnership between Databricks and Microsoft. Here’s why it’s special: 

  • Joint engineering: Databricks and Microsoft collaborate on product development, ensuring tight integration and optimized performance. This includes dedicated Microsoft resources in engineering for developing Azure Databricks resource providers, workspace, and Azure Infra integrations, as well as customer support escalation management in addition to growing engineering investments for Azure Databricks. 
  • Service operation and support: As a first party offering, Azure Databricks is exclusively available in the Azure portal, simplifying deployment and management for customers. Azure Databricks is managed by Microsoft with support coverage under Microsoft support contracts subject to the same SLAs, security policies, and support contracts as other Azure services, ensuring quick resolution of support tickets in collaboration with Databricks support teams as needed. 
  • Unified billing: Azure provides a unified billing experience, allowing customers to manage Azure Databricks costs transparently alongside other Azure services. 
  • Go-To-Market and marketing: Co-marketing, GTM collaboration, and co-sell activities between both organizations that include events, funding programs, marketing campaigns, joint customer testimonials, and account-planning and much more provides elevated customer care and support throughout their data journey. 
  • Commercial: Large strategic enterprises generally prefer dealing directly with Microsoft for sales offers, technical support, and partner enablement for Azure Databricks. In addition to Databricks sales teams, Microsoft has a global footprint of dedicated sales, business development, and planning coverage for Azure Databricks meeting unique needs of all customers.

Let Azure Databricks help boost your productivity


Choosing the right data analytics platform is crucial. Azure Databricks, a powerful data analytics and AI platform, offers a well-integrated, managed, and secure environment for data professionals, resulting in increased productivity, cost savings, and ROI. With Azure’s global presence, integration of workloads, security, compliance, and a unique partnership with Microsoft, Azure Databricks is a compelling choice for organizations seeking efficiency, innovation, and intelligence from their data estate 

Source: microsoft.com

Saturday, 25 December 2021

What is Databricks Data Science & Engineering?

Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based on Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.

Databricks Data Science & Engineering, Azure Exam Prep, Azure Certification, Azure Learning, Azure Learn, Azure Preparation, Azure Online Guide, Azure Data

For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark.

Databricks Data Science & Engineering, Azure Exam Prep, Azure Certification, Azure Learning, Azure Learn, Azure Preparation, Azure Online Guide, Azure Data

Apache Spark analytics platform


Databricks Data Science & Engineering comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Databricks Data Science & Engineering includes the following components:

Databricks Data Science & Engineering, Azure Exam Prep, Azure Certification, Azure Learning, Azure Learn, Azure Preparation, Azure Online Guide, Azure Data

◉ Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python.

◉ Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with HDFS, Flume, and Kafka.

◉ MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

◉ GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.

◉ Spark Core API: Includes support for R, SQL, Python, Scala, and Java.

Apache Spark in Azure Databricks


Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

◉ Fully managed Spark clusters

◉ An interactive workspace for exploration and visualization

◉ A platform for powering your favorite Spark applications

Fully managed Apache Spark clusters in the cloud


Azure Databricks has a secure and reliable production environment in the cloud, managed and supported by Spark experts. You can:

◉ Create clusters in seconds.

◉ Dynamically autoscale clusters up and down and share them across teams.

◉ Use clusters programmatically by invoking REST APIs.

◉ Use secure data integration capabilities built on top of Spark that enable you to unify your data without centralization.

◉ Get instant access to the latest Apache Spark features with each release.

Databricks Runtime


Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.

Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure.

For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

Workspace for collaboration


Through a collaborative and integrated environment, Databricks Data Science & Engineering streamlines the process of exploring data, prototyping, and running data-driven applications in Spark.

◉ Determine how to use data with easy data exploration.

◉ Document your progress in notebooks in R, Python, Scala, or SQL.

◉ Visualize data in a few clicks, and use familiar tools like Matplotlib, ggplot, or d3.

◉ Use interactive dashboards to create dynamic reports.

◉ Use Spark and interact with the data simultaneously.

Enterprise security


Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

◉ Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure Databricks.

◉ Azure Databricks roles-based access enables fine-grained user permissions for notebooks, clusters, jobs, and data.

◉ Enterprise-grade SLAs.

Integration with Azure services


Databricks Data Science & Engineering integrates deeply with Azure databases and stores: Synapse Analytics, Cosmos DB, Data Lake Store, and Blob storage.

Integration with Power BI


Through rich integration with Power BI, Databricks Data Science & Engineering allows you to discover and share your impactful insights quickly and easily. You can use other BI tools as well, such as Tableau Software.

Source: microsoft.com

Friday, 8 March 2019

Azure Databricks – New capabilities at lower cost

Azure Databricks provides a fast, easy, and collaborative Apache Spark™-based analytics platform to accelerate and simplify the process of building big data and AI solutions backed by industry leading SLAs.

With Azure Databricks, customers can set up an optimized Apache Spark environment in minutes. Data scientists and data engineers can collaborate using an interactive workspace with languages and tools of their choice. Native integration with Azure Active Directory (Azure AD) and other Azure services enables customers to build end-to-end modern data warehouse, machine learning and real-time analytics solutions.

We have seen tremendous adoption of Azure Databricks and today we are excited to announce new capabilities that we are bringing to market.

General availability of Data Engineering Light


Customers can now get started with Azure Databricks with a new low-priced workload called Data Engineering Light that enables customers to run batch applications on managed Apache Spark. It is meant for simple, non-critical workloads that don’t need the performance, autoscaling, and other benefits provided by Data Engineering and Data Analytics workloads.

Additionally, we have reduced the price for the Data Engineering workload across both the Standard and Premium SKUs. Both the SKUs are now available at up to 25 percent lower cost.

Preview of managed MLflow


MLflow is an open source framework for managing the machine learning lifecycle. With managed MLflow, customers can access it natively from their Azure Databricks environment and leverage Azure Active Directory for authentication. With managed MLflow on Azure Databricks customers can:

◈ Track experiments by automatically recording parameters, results, code, and data to an out-of-the-box hosted MLflow tracking server. Runs can now be organized in experiments from within the Azure Databricks, and results can be queried from within the Azure Databricks notebooks to identify the best performing models.

◈ Package machine learning code and dependencies locally in a reproducible project format and execute remotely on a Databricks cluster.

◈ Quickly deploy models to production.

Machine learning on Azure with Azure Machine Learning and Azure Databricks


Since the general availability of Azure Machine Learning service (AML) in December 2018, and its integration with Azure Databricks, we have received overwhelmingly positive feedback from customers who are using this combination to accelerate machine learning on big data. Azure Machine Learning complements the Azure Databricks experience by:

◈ Unlocking advanced automated machine learning capability which enables data scientists of all skill levels to identify suitable algorithms and hyperparameters faster.

◈ Enabling DevOps for machine learning for simplified management, monitoring, and updating of machine learning models.

◈ Deploying models from the cloud and the edge.

◈ Providing a central registry for experiments, machine learning pipelines, and models that are being created across the organization.

The combination of Azure Databricks and Azure Machine Learning makes Azure the best cloud for machine learning. Customers benefit from an optimized, autoscaling Apache Spark based environment, an interactive collaborate workspace, automated machine learning, and end-to-end Machine Learning Lifecycle management.

Azure Databricks, Azure Tutorial and Material, Azure Learning, Azure Study Materials

Tuesday, 26 June 2018

Structured streaming with Azure Databricks into Power BI & Cosmos DB

In this blog we’ll discuss the concept of Structured Streaming and how a data ingestion path can be built using Azure Databricks to enable the streaming of data in near-real-time. We’ll touch on some of the analysis capabilities which can be called from directly within Databricks utilising the Text Analytics API and also discuss how Databricks can be connected directly into Power BI for further analysis and reporting. As a final step we cover how streamed data can be sent from Databricks to Cosmos DB as the persistent storage.

Structured streaming is a stream processing engine which allows express computation to be applied on streaming data (e.g. a Twitter feed). In this sense it is very similar to the way in which batch computation is executed on a static dataset. Computation is performed incrementally via the Spark SQL engine which updates the result as a continuous process as the streaming data flows in.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The above architecture illustrates a possible flow on how Databricks can be used directly as an ingestion path to stream data from Twitter (via Event Hubs to act as a buffer), call the Text Analytics API in Cognitive Services to apply intelligence to the data and then finally send the data directly to Power BI and Cosmos DB.

The concept of structured streaming


All data which arrives from the data stream is treated as an unbounded input table. For each new data within the data stream, a new row is appended to the unbounded input table. The entirety of the input isn’t stored, but the end result is equivalent to retaining the entire input and executing a batch job.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The input table allows us to define a query on itself, just as if it were a static table, which will compute a final result table written to an output sink. This batch-like query is automatically converted by Spark into a streaming execution plan via a process called incremental execution.

Incremental execution is where Spark natively calculates the state required to update the result every time a record arrives. We are able to utilize built in triggers to specify when to update the results. For each trigger that fires, Spark looks for new data within the input table and updates the result on an incremental basis.

Queries on the input table will generate the result table. For every trigger interval (e.g. every three seconds) new rows are appended to the input table, which through the process of Incremental Execution, update the result table. Each time the result table is updated, the changed results are written as an output.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The output defines what gets written to external storage, whether this be directly into the Databricks file system, or in our example CosmosDB.

To implement this within Azure Databricks the incoming stream function is called to initiate the StreamingDataFrame based on a given input (in this example Twitter data). The stream is then processed and written as parquet format to internal Databricks file storage as shown in the below code snippet:

val streamingDataFrame = incomingStream.selectExpr("cast (body as string) AS Content")
.withColumn("body", toSentiment(%code%nbsp;"Content"))

import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val result = streamingDataFrame
.writeStream.format("parquet")
.option("path", "/mnt/Data")
.option("checkpointLocation", "/mnt/sample/check")
.start()

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

Mounting file systems within Databricks (CosmosDB)


Several different file systems can be mounted directly within Databricks such as Blob Storage, Data Lake Store and even SQL Data Warehouse. In this blog we’ll explore the connectivity capabilities between Databricks and Cosmos DB.

Fast connectivity between Apache Spark and Azure Cosmos DB accelerates the ability to solve fast moving Data Sciences problems where data can be quickly persisted and retrieved using Azure Cosmos DB. With the Spark to Cosmos DB connector, it’s possible to solve IoT scenarios, update columns when performing analytics, push-down predicate filtering, and perform advanced analytics against fast changing data against a geo-replicated managed document store with guaranteed SLAs for consistency, availability, low latency, and throughput.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

◈ From within Databricks, a connection is made from the Spark master node to Cosmos DB gateway node to get the partition information from Cosmos.
◈ The partition information is translated back to the Spark master node and distributed amongst the worker nodes.
◈ That information is translated back to Spark and distributed amongst the worker nodes.
◈ This allows the Spark worker nodes to interact directly to the Cosmos DB partitions when a query comes in. The worked nodes are able to extract the data that is needed and bring the data back to the Spark partitions within the Spark worker nodes.

Communication between Spark and Cosmos DB is significantly faster because the data movement is between the Spark worker nodes and the Cosmos DB data nodes.

Using the Azure Cosmos DB Spark connector (currently in preview) it is possible to connect directly into a Cosmos DB storage account from within Databricks, enabling Cosmos DB to act as an input source or output sink for Spark jobs as shown in the code snippet below:

import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config

val writeConfig = Config(Map("Endpoint, MasterKey, Database, PreferredRegions, Collection, WritingBatchSize"))

import org.apache.spark.sql.SaveMode
sentimentdata.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

Connecting Databricks to PowerBI


Microsoft Power BI is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities, enabling end users to create reports and dashboards by themselves without having to depend on information technology staff or database administrators.

Azure Databricks can be used as a direct data source with Power BI, which enables the performance and technology advantages of Azure Databricks to be brought beyond data scientists and data engineers to all business users.

Power BI Desktop can be connected directly to an Azure Databricks cluster using the built-in Spark connector (Currently in preview). The connector enables the use of DirectQuery to offload processing to Databricks, which is great when you have a large amount of data that you don’t want to load into Power BI or when you want to perform near real-time analysis as discussed throughout this blog post.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

This connector utilises JDBC/ODBC connection via DirectQuery, enabling the use of a live connection into the mounted file store for the streaming data entering via Databricks. From Databricks we can set a schedule (e.g. every 5 seconds) to write the streamed data into the file store and from Power BI pull this down regularly to obtain a near-real time stream of data.

From within Power BI, various analytics and visualisations can be applied to the streamed dataset bringing it to life!

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

Want to have a go at building this architecture out? For more examples of Databricks see the official Azure documentation:

Friday, 13 April 2018

Three common analytics use cases with Microsoft Azure Databricks

Data science and machine learning can be applied to solve many common business scenarios, yet there are many barriers preventing organizations from adopting them. Collaboration between data scientists, data engineers, and business analysts and curating data, structured and unstructured, from disparate sources are two examples of such barriers - and we haven’t even gotten to the complexity involved when trying to do these things with large volumes of data.

Recommendation engines, churn analysis, and intrusion detection are common scenarios that many organizations are solving across multiple industries. They require machine learning, streaming analytics, and utilize massive amounts of data processing that can be difficult to scale without the right tools. Companies like Lennox International, E.ON, and renewables.AI are just a few examples of organizations that have deployed Apache Spark™ to solve these challenges using Microsoft Azure Databricks.

Your company can enable data science with high-performance analytics too. Designed in collaboration with the original creators of Apache Spark, Azure Databricks is a fast, easy, and collaborative Apache Spark™ based analytics platform optimized for Azure. Azure Databricks is integrated with Azure through one-click setup and provides streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure Blob Storage, Azure Data Factory, Azure Data Lake Store, Azure SQL Data Warehouse, and Azure Cosmos DB allows organizations to use Azure Databricks to clean, join, and aggregate data no matter where it sits.

Learn how your organization can improve and scale your analytics solutions with Azure Databricks, a high-performance processing engine optimized for Azure. Now is the perfect time to get started. Not sure how? Sign up for our webinar on April 12, 2018 and we’ll walk you through the benefits of Spark on Azure, and how to get started with Azure Databricks.

Get started with Azure Databricks today!

Recommendation engine


Microsoft Azure Databricks, Microsoft Tutorials and Materials, Azure Certifications, Azure Guides

As mobile apps and other advances in technology continue to change the way users choose and utilize information, recommendation engines are becoming an integral part of applications and software products.

Churn analysis


Microsoft Azure Databricks, Microsoft Tutorials and Materials, Azure Certifications, Azure Guides

Churn analysis also known as customer attrition, customer turnover,  or customer defection, is the loss of clients or customers. Predicting and preventing customer churn is vital to a range of businesses.

Intrusion detection


Microsoft Azure Databricks, Microsoft Tutorials and Materials, Azure Certifications, Azure Guides

Intrusion detection is needed to monitor network or system activities for malicious activities or policy violations and produces electronic reports to a management station.

Tuesday, 3 April 2018

Ingest, prepare, and transform using Azure Databricks and Data Factory

Today’s business managers depend heavily on reliable data integration systems that run complex ETL/ELT workflows (extract, transform/load and load/transform data). These workflows allow businesses to ingest data in various forms and shapes from different on-prem/cloud data sources; transform/shape the data and gain actionable insights into data to make important business decisions.

With the general availability of Azure Databricks comes support for doing ETL/ELT with Azure Data Factory. This integration allows you to operationalize ETL/ELT workflows (including analytics workloads in Azure Databricks) using data factory pipelines that do the following:

1. Ingest data at scale using 70+ on-prem/cloud data sources

2. Prepare and transform (clean, sort, merge, join, etc.) the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines

3. Monitor and manage your E2E workflow.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Guides

Take a look at a sample data factory pipeline where we are ingesting data from Amazon S3 to Azure Blob, processing the ingested data using a Notebook running in Azure Databricks and moving the processed data in Azure SQL Datawarehouse.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Guides

You can parameterize the entire workflow (folder name, file name, etc.) using rich expression support and operationalize by defining a trigger in data factory.

Get started today!


We are excited for you to try Azure Databricks and Azure Data Factory integration and let us know your feedback.

Get started by clicking the Author & Monitor tile in your provisioned v2 data factory blade.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Guides

Click on the Transform data with Azure Databricks tutorial and learn step by step how to operationalize your ETL/ELT workloads including analytics workloads in Azure Databricks using Azure Data Factory.

Azure Tutorials and Materials, Azure Learning, Azure Certifications, Azure Guides

Sunday, 19 November 2017

A technical overview of Azure Databricks

We introduced Azure Databricks, an exciting new service in preview that brings together the best of the Apache Spark analytics platform and Azure cloud. As a close partnership between Databricks and Microsoft, Azure Databricks brings unique benefits not present in other cloud platforms. This blog post introduces the technology and new capabilities available for data scientists, data engineers, and business decision-makers using the power of Databricks on Azure.

Apache Spark + Databricks + enterprise cloud = Azure Databricks


Once you manage data at scale in the cloud, you open up massive possibilities for predictive analytics, AI, and real-time applications. Over the past five years, the platform of choice for building these applications has been Apache Spark, with a massive community at thousands of enterprises worldwide, Spark makes it possible to run powerful analytics algorithms at scale and in real time to drive business insights. However, managing and deploying Spark at scale has remained challenging, especially for enterprise use cases with large numbers of users and strong security requirements.

Enter Databricks. Founded by the team that started the Spark project in 2013, Databricks provides an end-to-end, managed Apache Spark platform optimized for the cloud. Featuring one-click deployment, autoscaling, and an optimized Databricks Runtime that can improve the performance of Spark jobs in the cloud by 10-100x, Databricks makes it simple and cost-efficient to run large-scale Spark workloads. Moreover, Databricks includes an interactive notebook environment, monitoring tools, and security controls that make it easy to leverage Spark in enterprises with thousands of users.

In Azure Databricks, we have gone one step beyond the base Databricks platform by integrating closely with Azure services through collaboration between Databricks and Microsoft. Azure Databricks features optimized connectors to Azure storage platforms (e.g. Data Lake and Blob Storage) for the fastest possible data access, and one-click management directly from the Azure console. This is the first time that an Apache Spark platform provider has partnered closely with a cloud provider to optimize data analytics workloads from the ground up.

Benefits for data engineers and data scientists


Why is Azure Databricks so useful for data scientists and engineers? Let’s look at some ways:

Optimized environment

Azure Databricks is optimized from the ground up for performance and cost-efficiency in the cloud. The Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs by as much as 10-100x when running on Azure, including:

1. High-speed connectors to Azure storage services, such as Azure Blob Store and Azure Data Lake, developed together with the Microsoft teams behind these services.

2. Auto-scaling and auto-termination for Spark clusters to automatically minimize costs.

3. Performance optimizations including caching, indexing, and advanced query optimization, which can improve performance by as much as 10-100x over traditional Apache Spark deployments in cloud or on-premise environments.

Seamless collaboration

Remember the jump in productivity when documents became truly multi-editable? Why can’t we have that for data engineering and data science? Azure Databricks brings exactly that. Notebooks on Databricks are live and shared, with real-time collaboration, so that everyone in your organization can work with your data. Dashboards enable business users to call an existing job with new parameters. Also, Databricks integrates closely with PowerBI for interactive visualization. All this is possible because Azure Databricks is backed by Azure Database and other technologies that enable highly concurrent access, fast performance, and geo-replication.

Easy to use

Azure Databricks comes packaged with interactive notebooks that let you connect to common data sources, run machine learning algorithms, and learn the basics of Apache Spark to get started quickly. It also features an integrated debugging environment to let you analyze the progress of your Spark jobs from within interactive notebooks, and powerful tools to analyze past jobs. Finally, other common analytics libraries, such as the Python and R data science stacks, are preinstalled so that you can use them with Spark to derive insights. We really believe that big data can become 10x easier to use, and we are continuing the philosophy started in Apache Spark to provide a unified, end-to-end platform.

Architecture of Azure Databricks

So how is Azure Databricks put together? At a high level, the service launches and manages worker nodes in each Azure customer's subscription, letting customers leverage existing management tools within their account.

Specifically, when a customer launches a cluster via Databricks, a "Databricks appliance" is deployed as an Azure resource in the customer's subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer's subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.

Azure Databricks, Microsoft Tutorial and Material, Microsoft Certifications

For users, this design means two things. First, they can easily connect Azure Databricks to any storage resource in their account, e.g., an existing Blob Store subscription or Data Lake. Second, Databricks is managed centrally from the Azure control center, requiring no additional setup.

Total Azure integration


We are integrating Azure Databricks closely with all features of the Azure platform in order to provide the best of the platform to users. Here are some pieces we’ve done so far:

◉ Diversity of VM types: Customers can use all existing VMs including F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.

◉ Security and Privacy: In Azure, ownership and control of data is with the customer. We have built Azure Databricks to adhere to these standards. We aim for Azure Databricks to provide all the compliance certifications that the rest of Azure adheres to.

◉ Flexibility in network topology: Customers have a diversity of network infrastructure needs. Azure Databricks supports deployments in customer VNETs, which can control which sources and sinks can be accessed and how they are accessed.

◉ Azure Storage and Azure Data Lake integration: These storage services are exposed to Databricks users via DBFS to provide caching and optimized analysis over existing data.

◉ Azure Power BI: Users can connect Power BI directly to their Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools.

◉ Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions, so naturally AAD can be used to control access to sources, results, and jobs.

◉ Azure SQL Data Warehouse, Azure SQL DB, and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.

In addition to all the integration you can see, we have worked hard to integrate in ways that you can’t see – but can see the benefits of.

◉ Internally, we use Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.

◉ Accelerated Networking provides the fastest virtualized network infrastructure in the cloud. Azure Databricks utilizes this to further improve Spark performance.

◉ The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO. These make Databricks I/O performance even better.

We are just scratching the surface though! As the service becomes generally available and moves beyond that, we expect to add continued integrations with other upcoming Azure services.