Learn How to Build a Modern Data Warehouse on Microsoft Azure Up to 10X Faster Than Standard Methods
For most, if not all organizations, data is their most important business asset. It not only allows them to make data-driven decisions but also gives them the ability to make their business processes more efficient and their employees more productive.
Yet, despite this, many organizations fail to realize any value from their data, either because they have no way of easily accessing their data, or because they don't have the necessary tools and systems in place to properly analyze it. In other words, they lack the necessary data management capabilities. That's where the modern data warehouse comes in.
It allows organizations to turn the massive amount of data they gather from the operational systems like, for example, marketing platforms, CRMs, and sales platforms into actionable data that's easy to understand. As such, it can store historical data and ingest current data from a variety of systems to form a single source of truth that allows organizations to make better decisions.
The problem is, however, that it can be challenging for organizations to implement a modern data warehouse or modern data estate that suits their specific needs and requirements. Compounding this challenge is that there's so much information available out there that it's difficult to know where to start or even find the right information.
Fortunately, we're here to help, and with this article, we’ll look at modern data warehouses in more detail and see what their characteristics are, why you should consider modernizing your data warehouse, and more importantly, how you should go about implementing a modern data warehouse.
Before looking at modern data warehouse architecture in more detail, we'll start at the basics and look at what a data warehouse is. According to Wikipedia, a data warehouse has several fundamental characteristics. These are:
The concept sounds somewhat complicated, so it's necessary to unpack the characteristics above into a simple definition
We can define a data warehouse as a central data repository that gathers, stores, and consolidates data from a variety of sources within an organization to support data analytics and business intelligence. If we simplify this even further, we can say that a data warehouse gathers data, processes it, and organizes it to facilitate efficient analysis.
Now that we have a simple definition for a data warehouse, we must distinguish it from other common data solutions to clear up any confusion. So, compared to data warehouses:
Even though data warehouses have been around for quite some time, they have evolved considerably in recent times as a result of new technology innovations and the advent and popularity of big data. So, now that we've looked at what a data warehouse is, let's take this a step further in consider what a modern data warehouse is.
Although there's no strict definition of what a modern data warehouse is, there are some overriding themes we see in modern data warehouses. These are:
You typically need to modernize your data warehouse if you struggle with the following:
By modernizing your data warehouse, you’ll eliminate all these challenges, because it allows you to turn vast amounts of data from your operational systems into a format that is easy to understand. And, because data warehouses are optimized for read access, you’ll be able to generate reports faster. This, in turn, allows you to make better decisions and make your business process is more efficient and productive.
In addition, you’ll enjoy many other benefits when you modernize your data warehouse. These include:
In simple terms, modern data warehousing is a straightforward and cost-effective way for you and your organization to take advantage of the latest innovations in technology and gain insights from your data without the expense of traditional data warehouses.
Now that we’ve recapped what a modern data warehouse is and why you should consider modernizing your data warehouse, we'll look at how to build a modern data warehouse with Azure.
Building and implementing a modern data warehouse can take up to three years and, considering all the aspects you should focus on, it's simply not possible to implement the entire solution at once. It, therefore, makes sense to structure the implementation of the data warehouse into smaller parts. In other words, you’ll implement your data warehouse in phases, each with its own goals and requirements.
With that in mind the goals of your first phase could typically look like this:
Although these goals can differ slightly depending on your specific needs and requirements, they are typically the standard goals for many data warehouse implementations. As you can imagine, though, they require a lot of work and come with a set of challenges you'll need to overcome to make your implementation a success.
With that in mind let's look at the process of implementing an enterprise data warehouse.
To look at the implementation process, we’ll deal with each component of a typical modern data warehouse architecture in Azure. In this example, the Azure architecture will consist of:
As mentioned above, one of the first goals of implementing a data warehouse is building a central, unified data model that uses data from a single operational area. Also, you'll need to integrate data from your largest line of business system into the data warehouse.
To do this, you'll need to combine all your structured, unstructured, and semi-structured data. Typically, unstructured data will comprise of logs, files, and various types of media. On the other hand, structured data will be the data that you get from your business applications like your CRM, marketing platform, or sales platform. As stated earlier, in this example, we'll only use one source of data.
Once you know how and what data you will ingest the data into your data warehouse, the next step is to extract all the data from the respective sources into files. Here, you’ll encounter one of the main challenges with a modern data warehouse: how do you store data effectively?
To answer this question, you will typically need to consider 3 vital things:
Let’s look at these questions in more detail.
It’s vitally important to plan how your unstructured, semi-structured, and structured raw data from your data sources will be stored. Typically, when you implement a modern data warehouse on Microsoft Azure, you can store your files in a data lake or Blob storage.
Blob storage is Microsoft’s object storage solution for the cloud. It’s specifically designed and optimized for storing vast amounts of unstructured data. As such, it’s fully capable of:
In contrast, built on Azure Blob Storage, Azure Data Lake Storage Gen2 has a set of capabilities that’s specifically aimed at big data analytics. It effectively combines the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. As such, Data Lake Storage Gen 2 provides:
Although the choice of the right storage solution depends on your specific needs and requirements, modern data warehouses are designed and implemented with big data analytics in mind. As such, when implementing a modern data warehouse, Azure Data Lake Storage Gen 2 may be the most appropriate choice.
When you do choose to implement it, you’ll typically enjoy the following benefits:
Once you’ve decided which storage solution you’ll use, the next vital thing you’ll need to decide on, and plan is how the data within the data lake will be structured. In other words, you’ll need to plan what folders you’ll use to store the data in, how these folders will be partitioned, and how you’ll name the folders and individual files.
It’s crucial that you plan these aspects carefully as they will, ultimately, determine how easy you’ll be able to navigate through the data stored on the data lake.
The next step will be to plan how you’ll divide the files and how much data each file should contain. Here, you’ll typically need to consider the amount of data you already have, and how fast the volume of your data increases. Using this information, you’ll then be able to decide how you’ll split data into files.
For example, with dictionary data, you’ll usually use one file for all the data in a dictionary table, irrespective of how much data the table stores. In contrast, with transaction data, you’ll have a choice of storing data for one day, one month, one year, or longer or shorter depending on your specific needs and requirements.
The next decision you’ll be faced with is what format you’ll use to extract the data to. Although this may sound like a simple choice, it’s crucial to get it right as the file format has a significant impact on the final data size.
Typically, you’ll have a choice between the following file formats:
You’ll need to carefully consider the file format your data is in and what the effect will be if you store it in one of the above formats. For example, moving your data from text files created from a SQL database could increase the data size significantly with some formats while it could reduce with others.
Making the right choice has the ability to not only reduce the amount of your storage you need significantly but can also substantially reduce the amount of time it takes to transfer your data to the cloud.
Once your planning is done, you can proceed to extract the data and transfer it to your data lake. Here, you have a lot of options like Azure CLI and PowerShell. One of the best options, though, is TimeXtender. It is specifically designed for the high-performance copying of data into Azure blob storage and, as such, is a fast and efficient way to transfer your data from your on-premises storage to Azure.
There are, however, some things you need to keep in mind when copying your data to your Azure Data Lake Storage Gen 2. For one, you shouldn’t run the utility on the same machine that runs your production workloads because the resources needed could interfere with your production workload.
You should also aim to create a storage account in a region near where your on-premises storage is located to ensure that the transfer happens faster. Finally, AzCopy creates a temporary journal file when transferring the data which enables it to restart the transfer if it’s interrupted. You should, therefore, make sure that you have enough storage space available to store the journal files.
Remember, the ultimate goal of having a modern data warehouse built on Azure is to serve the data to Power BI dashboards across an entire enterprise. To achieve that, you’ll need to load the files in your data lake into the data warehouse.
Here, you’ll use Polybase to load the files into the data warehouse. It’s uses Azure Synapse’s Massively Parallel Processing (MPP) which makes it the fastest way to load data into Azure Synapse.
Loading the data into Azure Synapse is a two-step process. During the first step, you’ll create a set of external tables for the data. These external tables are merely table definitions that point to data stored outside Azure Synapse, in our case, in the data lake. It’s important to note that this step does not move any data into the warehouse.
During the next step, you’ll create staging tables and load the data into these staging tables. During this step, the data is copied into the warehouse. Once the data is copied into Azure Synapse, you’ll transform the data and move it into production tables that are suitable for semantic modeling.
Next, you’ll load the data into a tabular model in Azure Analysis Services. During this step, you’ll typically create a semantic model using SQL Server Data Tools (SSDT). Here, you also have the option of creating a semantic model by importing it from a Power BI Desktop file.
It’s important to keep in mind here that you’ll need to add the relationships to the semantic model so that you can join data across tables. This is simply because Azure Synapse doesn’t support foreign keys. When you’re done with this step, you’ll be able to visualize your data in Power BI.
Power BI has two options to connect to Azure Analysis Services so that you can visualize your data. The first is to import your data into Power BI. The other option is to use Live Connection where Power BI pulls data directly from Azure Analysis Services.
Although the choice, ultimately, depends on your specific needs and requirements, it’s recommended to use Live Connection because you don’t need to copy data into Power BI.
When visualizing your data, there are also some things you need to consider. For one, Azure Analytics Services is specifically designed to handle to query requirements of a Power BI dashboard. As a result, it’s a recommended practice to query Azure Analytics Services directly from Power BI.
Considering the above, the other thing you need to keep in mind is that you should avoid running queries directly against the data warehouse. This could impact performance as refreshing the dashboard will count against the number of concurrent queries.
We mentioned earlier that a modern data warehouse should be implemented in phases and our example above perfectly illustrates what the first phase of the implementation could look like. So, what does the implementation look like in later stages when we want to incorporate more features into the data warehouse?
In this example, we’ll build on the previous example and add some important features that are vital for modern data warehouse implementations. These features include:
In this example, the Azure architecture consists of:
To ingest your data into the data warehouse you'll use Data Factory pipelines. These pipelines are logical groupings of activities that work together to perform a specific task. For instance, a pipeline could contain a set of activities that ingest and clean data from a variety of systems and then starts a data flow to analyze this data.
Another example would be when you use a copy activity to copy external data from, for instance, a SQL database to your Azure Blob Storage. This is similar to our example where we’ll use a pipeline to load and transform data into Azure Synapse.
One of the main benefits of using these pipelines is that it allows you to manage the activities together instead of each one individually. You, therefore, deploy and schedule the pipeline instead of deploying each activity independently.
In contrast to our first example, this architecture will also implement the incremental loading of data. When you use an automated ELT process, it’s far more efficient to load only new data, or in other words, only data that has changed, into the data warehouse compared to loading all the data.
Also known as system-versioned tables, these tables provide information about the data stored in the table at any point in time. It does this by automatically recording the history of all changes in a separate history table. From an ETL perspective, you can then query the historical data to determine whether an incremental load should be performed.
Ultimately, Data Factory will perform an incremental load if there are any changes. Keep in mind that, after a new batch of data is loaded into the warehouse, you'll need to refresh the Analysis Services tabular model. It's also important to keep in mind that data cleansing should be part of your ELT process to ensure good quality data.
In contrast to our first example, we’ll now incorporate multiple sources of data. Here, your data factory will orchestrate and coordinate the extraction of data from the external sources into our Blob storage or Azure Data Lake Storage Gen 2 by using a Copy Activity to move the data from both on-premises and cloud data sources. Like before, you're able to copy data to your data lake in one of the file formats mentioned earlier.
From here it can copy data directly to Azure Synapse by using the blob storage connector. It’s important to keep in mind, however, that the blob storage connector only supports certain authentication types like account key authentication, shared access signature authentication, and system-assigned managed identity authentication, amongst others.
As such, it requires a connection string or shared access signature and can, therefore, not be used to copy a blob with public read access. For a blob with public read access, you would need to use Polybase to create an external table over your blob storage and copy the external table to Azure Synapse.
When working with binary data, Data Factory’s Copy Activity will also be able to copy from your data source to your data lake and on to Azure Synapse. It’s important to note, however, that when using the copy activity, you can only copy from binary data to binary data.
You should also keep in mind that, when using Polybase as a workaround described above, it only supports maximum column sizes of 8000 bytes. In this case, you’ll have to break the data up into smaller pieces during copying and then reassemble the pieces after copying is done.
It’s no secret that, in today’s competitive market, agility is absolutely crucial to success. Not only does it allow organizations to adapt to changing market conditions and take advantage of business opportunities when they arise, but it also makes them more efficient.
So, when you want to give your organization the agility and flexibility required by today’s competitive market, it’s vital that you implement or modernize your data warehouse. It allows you to combine the data from all your operational areas into one source of truth and gives you increased scalability, flexibility, and automation capabilities.
Hopefully, this post helped illustrate how you can implement a modern data warehouse. To learn more about modern data warehouses or about how our automated, low-code, drag-and-drop Data Estate Builder can help you implement and operate a modern data warehouse without writing any code, visit our website for more details.
About TimeXtender
TimeXtender is an automated, low-code, drag-and-drop Data Estate Builder that empowers you to build a Modern Data Estate up to 10x faster than standard methods, prepare data for analysis, and make quality business decisions with data, mind, and heart. We do this for one simple reason: because time matters.
As a Microsoft Gold Certified Partner, we serve our 3,000+ customers, from mid-sized companies to Fortune 500, through our global network of partners.
Visit the TimeXtender customer page to learn how we are helping clients build reliable data estates faster than standard methods.