Why use Azure Data Lake?

November 28, 2017

Quite simply because you should

Building analytics solutions is a complex undertaking, and it is a certainty that you will either make mistakes or requirements will change. So how does Azure Data Lake help us?

First, let me say that the only way this will work is if you can automate the process of ingesting data. When building a data warehouse, you will need to process many different data sources across your organization. Regardless of whether you are using an ensemble model like Data Vault or building a dimensional data warehouse, you will not be able to analyze and model all the requirements at the start of the project. For this reason, we recommend using Azure Data Lake and/or Azure Blob Storage as your persistent staging layer.

The what and why of using a persistent staging layer is best explained in the following blogs by Hans Michiels and Roelant Vos. For me, the most important factor of using a persistent staging layer is you don’t know what you don't know at the start of a project. We have worked with numerous customers and the flexibility to refactor a data warehouse model as requirements change is invaluable. Coupling this with “cheap” flexible storage makes it a no-brainer.

Microsoft recently published their Technical reference implementation for enterprise BI and reporting.

Microsoft Technical reference implementation for enterprise BI and reporting

We will only focus on the STAGE, STORE & PROCESS sections of the above technical reference. Note that Azure Data Lake and Azure Blob Storage is referenced as a staging layer and not the Swiss army knife that will solve all your Business Intelligence needs, so please keep it in context.

Before you dive into the lake

There are a couple of things you will need to know and plan for. Getting the data into the lake is relatively easy to automate, the trick is to make sure you can get the right data out in the correct format as efficiently as possible. Think about how easy it is to throw all your washing into a drawer, now try and find the other blue sock when you are in a hurry.

Understanding where Azure Data Lake and Azure Blob Storage fits into your analytics platform is important and it is easy to get overwhelmed with all the possibilities and just start dumping data into it. In the context of delivering analytics to your end-users the STAGE layer only has value when processed and transformed into a structure that can be understood and consumed by user friendly BI tools like PowerBI.

Staging Considerations

You can store any data in their native format without requiring any prior transformation. However, and this is where automation comes into play, you may want to prepare the data for consumption. Let’s say you are loading data from a database and some of the columns have long comment fields with line breaks. These line breaks could cause system errors when loading the data into a Azure SQL DW using Polybase. It is a good idea to replace line breaks with other internal characters and referred them when loading the data.

Example of extrating a text column:

REPLACE(REPLACE([Comments], CHAR(10), CHAR(26)), CHAR(13), CHAR(29)) AS [Comments]

Example of importing the same text column:

REPLACE(REPLACE([Comments], CHAR(26), CHAR(10)), CHAR(29), CHAR(13)) AS [Comments]

Another good example is to conform your numeric and date values into a standard format.

Microsoft Sql Server:

CONVERT(VARCHAR(27), DateColumn, 121)

Oracle Example:

CAST(TO_CHAR(DateColumn, 'YYYY-MM-DD HH24:MI:SS.FFFFFF') AS VARCHAR2(27))

MySql Example:

LEFT(DATE_FORMAT(DATE_ADD(DateColumn, INTERVAL 0 DAY), '%Y-%m-%d %T.%f'), 27)


Handling Changes

Processing incremental loads is well documented so I will not discuss it here, you will however need a strategy to handle changes to the structure of your sources. We live in a world where systems are evolving and table structures change constantly. Therefore it is essential to not just load your data but also load the associated metadata. This way when you need to reload or replay your historical data into a data warehouse you can use the appropriate metadata associated with the file at a point in time. We can go even further by adding to this control metadata the code and scripts required to create your external tables or other queries.

Regardless of what type of source you’re working with it is likely that at the time of extracting the data you will have the information to extract the associated metadata.

Want to know more?

If you would like to see how this works in practice, I recommend joining us for our webinars where we go into this in more detail and show you how to automate your solution. We will demonstrate both using SSIS and Azure Data Factory to automate all the code using metadata injection.

Using Azure Data Lake for Persistent Staging

Persistent Staging (PSA) is a record, or archive, of all data deltas that can reload a Data Warehouse. Traditional approaches to creating a PSA requires a BIG database that stores all unique records loaded. In this webinar, we will show you how BimlFlex with Azure Data Lake allows for data reloads and management with minimal configuration.

Azure Data Factory v2 Automation with BimlStudio

The upcoming version of BimlStudio will include full support for Azure Data Factory v2, including new first-class Biml elements to create Gateways, Linked Services, Datasets, Pipelines, and Triggers. In this webinar, we will give you a comprehensive tour of the new functionality in the context of an end-to-end solution. In just a couple hundred lines of Biml code, we will show how to read a source schema from an on-premises system, generate an Azure Data Factory that loads the source data into an Azure Data Lake, and create a PowerShell script to automate the build and deployment process. All of the code will be made available for use in your projects.