Thursday, 1 February 2018

Azure Database

Lambda Architecture using Azure CosmosDB: Faster performance, Low TCO, Low DevOps

Microsoft Database, Microsoft Azure Guides, Microsoft Tutorial and Materials

Azure Cosmos DB provides a scalable database solution that can handle both batch and real-time ingestion and querying and enables developers to implement lambda architectures with low TCO. Lambda architectures enable efficient data processing of massive data sets. Lambda architectures use batch-processing, stream-processing, and a serving layer to minimize the latency involved in querying big data.

To implement a lambda architecture, you can use a combination of the following technologies to accelerate real-time big data analytics:

◈ Azure Cosmos DB, the industry's first globally distributed, multi-model database service.

◈ Apache Spark for Azure HDInsight, a processing framework that runs large-scale data analytics applications

◈ Azure Cosmos DB change feed, which streams new data to the batch layer for HDInsight to process

◈ The Spark to Azure Cosmos DB Connector

We wrote a detailed article that describes the fundamentals of a lambda architecture based on the original multi-layer design and the benefits of a "rearchitected" lambda architecture that simplifies operations.

What is a lambda architecture?

The basic principles of a lambda architecture are depicted in the figure above:

1. All data is pushed into both the batch layer and speed layer.

2. The batch layer has a master dataset (immutable, append-only set of raw data) and pre-computes the batch views.

3. The serving layer has batch views for fast queries.

4. The speed layer compensates for processing time (to the serving layer) and deals with recent data only.

5. All queries can be answered by merging results from batch views and real-time views or pinging them individually.

Speed layer

For speed layer, you can utilize the Azure Cosmos DB change feed support to keep the state for the batch layer while revealing the Azure Cosmos DB change log via the Change Feed API for your speed layer.

What’s important in these layers:

1. All data is pushed only into Azure Cosmos DB, thus you can avoid multi-casting issues.

2. The batch layer has a master dataset (immutable, append-only set of raw data) and pre-computes the batch views.

3. The serving layer is discussed in the next section.

4. The speed layer utilizes HDInsight (Apache Spark) to read the Azure Cosmos DB change feed. This enables you to persist your data as well as to query and process it concurrently.

5. All queries can be answered by merging results from batch views and real-time views or pinging them individually.

Batch and serving layers

Since the new data is loaded into Azure Cosmos DB (where the change feed is being used for the speed layer), this is where the master dataset (an immutable, append-only set of raw data) resides. From this point onwards, you can use HDInsight (Apache Spark) to perform the pre-compute functions from the batch layer to serving layer, as shown in the following figure:

What’s important in these layers:

1. All data is pushed only into Azure Cosmos DB (to avoid multi-cast issues).

2. The batch layer has a master dataset (immutable, append-only set of raw data) stored in Azure Cosmos DB. Using HDI Spark, you can pre-compute your aggregations to be stored in your computed batch views.

3. The serving layer is an Azure Cosmos DB database with collections for the master dataset and computed batch view.

4. The speed layer is discussed later in this article.

5. All queries can be answered by merging results from the batch views and real-time views or pinging them individually.

For code example,

◈ Lambda Architecture Rearchitected - Batch Layer HTML | ipynb

◈ Lambda Architecture Rearchitected - Batch to Serving Layer HTML | ipynb

Speed layer

As previously noted, using the Azure Cosmos DB Change Feed Library allows you to simplify the operations between the batch and speed layers. In this architecture, use Apache Spark (via HDInsight) to perform the structured streaming queries against the data. You may also want to temporarily persist the results of your structured streaming queries so other systems can access this data.

To do this, create a separate Azure Cosmos DB collection to save the results of your structured streaming queries. This allows you to have other systems access this information not just Apache Spark. As well with the Azure Cosmos DB Time-to-Live (TTL) feature, you can configure your documents to be automatically deleted after a set duration.