Showing posts with label Power BI. Show all posts
Showing posts with label Power BI. Show all posts

Tuesday, 31 December 2019

New in Stream Analytics: Machine Learning, online scaling, custom code, and more

Azure Stream Analytics is a fully managed Platform as a Service (PaaS) that supports thousands of mission-critical customer applications powered by real-time insights. Out-of-the-box integration with numerous other Azure services enables developers and data engineers to build high-performance, hot-path data pipelines within minutes. The key tenets of Stream Analytics include Ease of use, Developer productivity, and Enterprise readiness. Today, we're announcing several new features that further enhance these key tenets. Let's take a closer look at these features:

Preview Features


Rollout of these preview features begins November 4th, 2019. Worldwide availability to follow in the weeks after.

Also Read: 70-745: Microsoft Implementing a Software-Defined Datacenter

Online scaling

In the past, changing Streaming Units (SUs) allocated for a Stream Analytics job required users to stop and restart. This resulted in extra overhead and latency, even though it was done without any data loss.

With online scaling capability, users will no longer be required to stop their job if they need to change the SU allocation. Users can increase or decrease the SU capacity of a running job without having to stop it. This builds on the customer promise of long-running mission-critical pipelines that Stream Analytics offers today.

Azure Tutorial and Material, Azure Study Material, Azure Certifications, Azure Learning, Azure Online Exam

Change SUs on a Stream Analytics job while it is running.

C# custom de-serializers

Azure Stream Analytics has always supported input events in JSON, CSV, or AVRO data formats out of the box. However, millions of IoT devices are often programmed to generate data in other formats to encode structured data in a more efficient yet extensible format.

With our current innovations, developers can now leverage the power of Azure Stream Analytics to process data in Protobuf, XML, or any custom format. You can now implement custom de-serializers in C#, which can then be used to de-serialize events received by Azure Stream Analytics.

Extensibility with C# custom code

Azure Stream Analytics traditionally offered SQL language for performing transformations and computations over streams of events. Though there are many powerful built-in functions in the currently supported SQL language, there are instances where a SQL-like language doesn't provide enough flexibility or tooling to tackle complex scenarios.

Developers creating Stream Analytics modules in the cloud or on IoT Edge can now write or reuse custom C# functions and invoke them right in the query through User Defined Functions. This enables scenarios such as complex math calculations, importing custom ML models using ML.NET, and programming custom data imputation logic. Full-fidelity authoring experience is made available in Visual Studio for these functions.

Managed Identity authentication with Power BI

Dynamic dashboarding experience with Power BI is one of the key scenarios that Stream Analytics helps operationalize for thousands of customers worldwide.

Azure Stream Analytics now offers full support for Managed Identity based authentication with Power BI for dynamic dashboarding experience. This helps customers align better with their organizational security goals, deploy their hot-path pipelines using Visual Studio CI/CD tooling, and enables long-running jobs as users will no longer be required to change passwords every 90 days.

While this new feature is going to be immediately available, customers will continue to have the option of using the Azure Active Directory User-based authentication model.

Stream Analytics on Azure Stack

Azure Stream Analytics is supported on Azure Stack via IoT Edge runtime. This enables scenarios where customers are constrained by compliance or other reasons from moving data to the cloud, but at the same time wish to leverage Azure technologies to deliver a hybrid data analytics solution at the Edge.

Rolling out as a preview option beginning January 2020, this will offer customers the ability to analyze ingress data from Event Hubs or IoT Hub on Azure Stack, and egress the results to a blob storage or SQL database on the same.

Debug query steps in Visual Studio

We've heard a lot of user feedback about the challenge of debugging the intermediate row set defined in a WITH statement in Azure Stream Analytics query. Users can now easily preview the intermediate row set on a data diagram when doing local testing in Azure Stream Analytics tools for Visual Studio. This feature can greatly help users to breakdown their query and see the result step-by-step when fixing the code.

Local testing with live data in Visual Studio Code

When developing an Azure Stream Analytics job, developers have expressed a need to connect to live input to visualize the results. This is now available in Azure Stream Analytics tools for Visual Studio Code, a lightweight, free, and cross-platform editor. Developers can test their query against live data on their local machine before submitting the job to Azure. Each testing iteration takes less than two to three seconds on average, resulting in a very efficient development process.

Azure Tutorial and Material, Azure Study Material, Azure Certifications, Azure Learning, Azure Online Exam

Live Data Testing feature in Visual Studio Code

Private preview for Azure Machine Learning


Real-time scoring with custom Machine Learning models

Azure Stream Analytics now supports high-performance, real-time scoring by leveraging custom pre-trained Machine Learning models managed by the Azure Machine Learning service, and hosted in Azure Kubernetes Service (AKS) or Azure Container Instances (ACI), using a workflow that requires users to write absolutely no code.

Users can build custom models by using any popular python libraries such as Scikit-learn, PyTorch, TensorFlow, and more to train their models anywhere, including Azure Databricks, Azure Machine Learning Compute, and HD Insight. Once deployed in Azure Kubernetes Service or Azure Container Instances clusters, users can use Azure Stream Analytics to surface all endpoints within the job itself. Users simply navigate to the functions blade within an Azure Stream Analytics job, pick the Azure Machine Learning function option, and tie it to one of the deployments in the Azure Machine Learning workspace.

Advanced configurations, such as the number of parallel requests sent to Azure Machine Learning endpoint, will be offered to maximize the performance.

Saturday, 13 October 2018

Data models within Azure Analysis Services and Power BI

In a world where self-service and speed of delivery are key priorities in any reporting BI solution, the importance of creating a comprehensive and performant data semantic model layer is sometimes overlooked.

I have seen quite a few occurrences where the relational data store such as Azure SQL Database and Azure SQL Data Warehouse are well structured, the reporting tier is well presented whether that is SQL Server reporting services or Power BI, but still, the performance is not as expected.

Before we drill down to the data semantic model, I always advise that understanding your data and how you want to present and report on it is key. By creating a report that takes the end consumer through a data journey is the difference between a good and a bad report. Report designing should take into account who is consuming and what they want to achieve out of the report. For example, if you have a small number of consumers who need to view a lower level of hierarchical data with additional measures or KPIs, then it may not be suitable to visualize this on the first page. As the majority of consumers may want to view a more aggregated view of the data. This example could lead to the first page of the report taking longer to return the data, thus giving a perception of a slow running report to the majority of consumers. To achieve a better experience, we could take the consumers through a data journey to ensure that the detail level does not impact the higher-level data points.

Setting the correct service level agreements and performance expectation is key. Setting an SLA for less than two seconds may be achievable if the data is two million rows. But would this still be met if it was two billion rows? There are many factors in understanding what is achievable, and the data set size is one of them. Network, compute, architecture patterns, data hierarchies, measures, KPIs, consumer device, consumer location, and real-time vs batch reporting are other impacts that can affect the perception of performance to the end consumer.

However, creating and/or optimizing your data semantic model layer will have a drastic impact on the overall performance.

The basics


The question I sometimes get is, with more computing power and the use of Azure, why can I not just report directly from my data lake or operational SQL Server? The answer is that reporting from data is very different from writing and reading data in an online transaction processing (OLTP) approach.

Dimensional modeling developed by Kimball has now been a data warehouse proven methodology and widely used for the last 20 plus years. The ideology behind the dimensional modeling is to be able to generate interactive reporting where consumers can retrieve calculations, aggregate data, and show business KPIs.

Due to creating dimensional models within a star or snowflake schema, you have the ability to retrieve the data in a more performant way due to the schema being designed for retrieval and reads rather than reads and writes, which are commonly associated with an OLTP database design.

In a star or snowflake schema, you have a fact table and many dimension tables. The fact table contains the foreign keys relating them to the dimension tables along with metrics and KPIs. The dimension tables contain the attributes associated with that particular dimension table. An example of this could be a date dimension table that contains month, day, and year as the attributes.

Creating dimensional modeling for data warehousing such as SQL Server or Azure SQL Data Warehouse will assist in what you are trying to achieve out of a reporting solution. Again, traditional warehouses have been deployed widely over the last 20 years. Even with this approach, creating a data semantic model on top of the warehouse can improve performance as well as things like improving concurrency and even adding an additional layer of security between the end consumers and the source warehouse data. SQL Server Analysis Services and now Azure Analysis Services have been designed for this purpose. We can essentially serve the required data from the warehouse into a model that can then be consumed as below.

Azure Analysis Services, Power BI, Azure Learning, Azure Tutorial and Material

Traditionally, the architecture was exactly that. Ingest data from the warehouse into the cube sitting on an instance of Analysis Services, process the cube, and then serve to SQL Server Reporting Services, Excel, and more. The landscape of data ingestion has changed over the last five to ten years with the adoption of big data and the end consumers wanting to consume a wider range of data sources, whether that is SQL, Spark, CSV, JSON, or others.

Modern BI reporting now needs to ingest, mash, clean and then present this data. Therefore, the data semantic layer needs to be agile but also delivering on performance. However, the principles of designing a model that aligns to dimensional modeling are still key.

In a world of data services in Azure, Analysis Services and Power BI are good candidates for building data semantic models on top of a data warehousing dimensional modeling. The fundamental principles of these services have formed the foundations for Microsoft BI solutions historically, even though they have evolved and now use modern in-memory architecture and allow agility for self-service.

Power BI and Analysis Services Tabular Model


SQL Server Analysis Services Tabular model, Azure Analysis Services, and Power BI share the same underlining fundamentals and principles. They are all built using the tabular model which was first released on SQL Server in 2012.

SQL Server Analysis Services Multi Dimension is a different architecture and is set at the server configuration section at the install point.

The Analysis Service Tabular model (in Power BI) is built on the columnar in-memory architecture, which forms the VertiPaq engine.

Azure Analysis Services, Power BI, Azure Learning, Azure Tutorial and Material

At processing time, the rows of data are converted into columns, encoded, and compressed allowing more data to be stored. Due to the data being stored in memory the analytical reporting delivers high performance, versus retrieving data from disk-based systems. The purpose of in-memory tabular models is to minimize read times for reporting purposes. Again, understanding the consumer reporting behavior is key, tabular models are designed to retrieve a small number of columns. The balance here is that when you retrieve a high number of columns, the engine needs to sort back into rows of data and decompress, which impacts compute.

Best practices for data modeling


The best practices below are some of the key observations I have seen over the last several years, particularly when creating data semantic models in SQL Server Analysis Services, Azure Analysis Services, or Power BI.

◈ Create a dimension model star and/or snowflake, even if you are ingesting data from different sources.

◈ Ensure that you create integer surrogate keys on dimension tables. Natural keys are not best practice and can cause issues if you need to change them at a later date. Natural keys are generally strings, so larger in size and can perform poorly when joining to other tables. The key point in regards to performance with tabular models is that natural keys are not optimal for compression. The process with natural keys is that they are:

    ◈ Encoded, hash/dictionary encoding.

    ◈ Foreign keys encoded on the fact table relating to the dimension table, again hash/dictionary encoding.

    ◈ Build the relationships.

◈ This has an impact on performance and reduces the available memory for data as a proportion, which will be needed for the dictionary encoding.

◈ Only bring into the model the integer surrogate keys or value encoding and exclude any natural keys from the dimension tables.

◈ Only bring into the model the foreign keys or integer surrogate keys on the fact table from the dimension tables.

◈ Only bring columns into your model that are required for analysis, this may be excluding columns that are not needed or filter on data to only bring the data in that is being analyzed.

◈ Reduce cardinality so that the values uniqueness can be reduced, allowing for much greater compression.

◈ Add a date dimension into your model.

◈ Ideally, we should run calculations at the compute layer if possible.

The best practices noted above have all been used in part or collectively to improve the performance for the consumer experience. Once the data semantic models have been created to align with best practices, then performance expectations can be gauged and aligned with SLA’s. The key focus on the best practices above is to ensure that we utilize the VertiPaq in-memory architecture. A large part of this is to ensure that data can be compressed as much as possible so that we can store more data within the model but also so that we can report upon the data in an efficient way.

Tuesday, 26 June 2018

Structured streaming with Azure Databricks into Power BI & Cosmos DB

In this blog we’ll discuss the concept of Structured Streaming and how a data ingestion path can be built using Azure Databricks to enable the streaming of data in near-real-time. We’ll touch on some of the analysis capabilities which can be called from directly within Databricks utilising the Text Analytics API and also discuss how Databricks can be connected directly into Power BI for further analysis and reporting. As a final step we cover how streamed data can be sent from Databricks to Cosmos DB as the persistent storage.

Structured streaming is a stream processing engine which allows express computation to be applied on streaming data (e.g. a Twitter feed). In this sense it is very similar to the way in which batch computation is executed on a static dataset. Computation is performed incrementally via the Spark SQL engine which updates the result as a continuous process as the streaming data flows in.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The above architecture illustrates a possible flow on how Databricks can be used directly as an ingestion path to stream data from Twitter (via Event Hubs to act as a buffer), call the Text Analytics API in Cognitive Services to apply intelligence to the data and then finally send the data directly to Power BI and Cosmos DB.

The concept of structured streaming


All data which arrives from the data stream is treated as an unbounded input table. For each new data within the data stream, a new row is appended to the unbounded input table. The entirety of the input isn’t stored, but the end result is equivalent to retaining the entire input and executing a batch job.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The input table allows us to define a query on itself, just as if it were a static table, which will compute a final result table written to an output sink. This batch-like query is automatically converted by Spark into a streaming execution plan via a process called incremental execution.

Incremental execution is where Spark natively calculates the state required to update the result every time a record arrives. We are able to utilize built in triggers to specify when to update the results. For each trigger that fires, Spark looks for new data within the input table and updates the result on an incremental basis.

Queries on the input table will generate the result table. For every trigger interval (e.g. every three seconds) new rows are appended to the input table, which through the process of Incremental Execution, update the result table. Each time the result table is updated, the changed results are written as an output.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

The output defines what gets written to external storage, whether this be directly into the Databricks file system, or in our example CosmosDB.

To implement this within Azure Databricks the incoming stream function is called to initiate the StreamingDataFrame based on a given input (in this example Twitter data). The stream is then processed and written as parquet format to internal Databricks file storage as shown in the below code snippet:

val streamingDataFrame = incomingStream.selectExpr("cast (body as string) AS Content")
.withColumn("body", toSentiment(%code%nbsp;"Content"))

import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val result = streamingDataFrame
.writeStream.format("parquet")
.option("path", "/mnt/Data")
.option("checkpointLocation", "/mnt/sample/check")
.start()

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

Mounting file systems within Databricks (CosmosDB)


Several different file systems can be mounted directly within Databricks such as Blob Storage, Data Lake Store and even SQL Data Warehouse. In this blog we’ll explore the connectivity capabilities between Databricks and Cosmos DB.

Fast connectivity between Apache Spark and Azure Cosmos DB accelerates the ability to solve fast moving Data Sciences problems where data can be quickly persisted and retrieved using Azure Cosmos DB. With the Spark to Cosmos DB connector, it’s possible to solve IoT scenarios, update columns when performing analytics, push-down predicate filtering, and perform advanced analytics against fast changing data against a geo-replicated managed document store with guaranteed SLAs for consistency, availability, low latency, and throughput.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

◈ From within Databricks, a connection is made from the Spark master node to Cosmos DB gateway node to get the partition information from Cosmos.
◈ The partition information is translated back to the Spark master node and distributed amongst the worker nodes.
◈ That information is translated back to Spark and distributed amongst the worker nodes.
◈ This allows the Spark worker nodes to interact directly to the Cosmos DB partitions when a query comes in. The worked nodes are able to extract the data that is needed and bring the data back to the Spark partitions within the Spark worker nodes.

Communication between Spark and Cosmos DB is significantly faster because the data movement is between the Spark worker nodes and the Cosmos DB data nodes.

Using the Azure Cosmos DB Spark connector (currently in preview) it is possible to connect directly into a Cosmos DB storage account from within Databricks, enabling Cosmos DB to act as an input source or output sink for Spark jobs as shown in the code snippet below:

import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config

val writeConfig = Config(Map("Endpoint, MasterKey, Database, PreferredRegions, Collection, WritingBatchSize"))

import org.apache.spark.sql.SaveMode
sentimentdata.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

Connecting Databricks to PowerBI


Microsoft Power BI is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities, enabling end users to create reports and dashboards by themselves without having to depend on information technology staff or database administrators.

Azure Databricks can be used as a direct data source with Power BI, which enables the performance and technology advantages of Azure Databricks to be brought beyond data scientists and data engineers to all business users.

Power BI Desktop can be connected directly to an Azure Databricks cluster using the built-in Spark connector (Currently in preview). The connector enables the use of DirectQuery to offload processing to Databricks, which is great when you have a large amount of data that you don’t want to load into Power BI or when you want to perform near real-time analysis as discussed throughout this blog post.

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

This connector utilises JDBC/ODBC connection via DirectQuery, enabling the use of a live connection into the mounted file store for the streaming data entering via Databricks. From Databricks we can set a schedule (e.g. every 5 seconds) to write the streamed data into the file store and from Power BI pull this down regularly to obtain a near-real time stream of data.

From within Power BI, various analytics and visualisations can be applied to the streamed dataset bringing it to life!

Azure Databricks, Power BI & Cosmos DB, Azure Study Materials, Azure Guides, Azure Learning

Want to have a go at building this architecture out? For more examples of Databricks see the official Azure documentation: