16 min read
Everything You Need to Know About Building a Modern Data Warehouse
Written by: Juan Keena, Chief Marketing Officer - September 30, 2021
Learn How to Build a Modern Data Warehouse on Microsoft Azure Up to 10X Faster Than Standard Methods
For most, if not all organizations, data is their most important business asset. It not only allows them to make data-driven decisions but also gives them the ability to make their business processes more efficient and their employees more productive.
Yet, despite this, many organizations fail to realize any value from their data, either because they have no way of easily accessing their data, or because they don't have the necessary tools and systems in place to properly analyze it. In other words, they lack the necessary data management capabilities. That's where the modern data warehouse comes in.
It allows organizations to turn the massive amount of data they gather from the operational systems like, for example, marketing platforms, CRMs, and sales platforms into actionable data that's easy to understand. As such, it can store historical data and ingest current data from a variety of systems to form a single source of truth that allows organizations to make better decisions.
The problem is, however, that it can be challenging for organizations to implement a modern data warehouse or modern data estate that suits their specific needs and requirements. Compounding this challenge is that there's so much information available out there that it's difficult to know where to start or even find the right information.
Fortunately, we're here to help, and with this article, we’ll look at modern data warehouses in more detail and see what their characteristics are, why you should consider modernizing your data warehouse, and more importantly, how you should go about implementing a modern data warehouse.
What Is a Data Warehouse?
Before looking at modern data warehouse architecture in more detail, we'll start at the basics and look at what a data warehouse is. According to Wikipedia, a data warehouse has several fundamental characteristics. These are:
- A data warehouse is used for reporting and data analysis and is a core component of business intelligence.
- Data warehouses are central repositories of integrated data gathered from one or more disparate sources or operational areas.
- Data warehouses store both current and historical data in one single place where it’s used to create analytical reports for employees across an entire enterprise.
- The data stored in a data warehouse is gathered from various operational systems like marketing or sales.
- The data in the data warehouse may pass through an operational data store and may require data cleansing to ensure data quality and integrity before it is used for reporting and analytics.
The concept sounds somewhat complicated, so it's necessary to unpack the characteristics above into a simple definition
We can define a data warehouse as a central data repository that gathers, stores, and consolidates data from a variety of sources within an organization to support data analytics and business intelligence. If we simplify this even further, we can say that a data warehouse gathers data, processes it, and organizes it to facilitate efficient analysis.
Now that we have a simple definition for a data warehouse, we must distinguish it from other common data solutions to clear up any confusion. So, compared to data warehouses:
- Databases collect data to enable reporting, but are not designed to centralize and store large amounts of data. Databases are, however, a core component of a data warehouse.
- Data lakes also store vast amounts of data, but this data is structured, semi-structured, unstructured in its native format, and processes this data on demand. As such, data lakes have a different purpose but complements data warehouses.
- Data marts are a simple form of data warehouse and, as such, only gather data from a few sources and are typically used in only one part of an organization.
What Is a Modern Data Warehouse?
Even though data warehouses have been around for quite some time, they have evolved considerably in recent times as a result of new technology innovations and the advent and popularity of big data. So, now that we've looked at what a data warehouse is, let's take this a step further in consider what a modern data warehouse is.
Although there's no strict definition of what a modern data warehouse is, there are some overriding themes we see in modern data warehouses. These are:
- Automation. Modern data warehouses implement data extraction and transformation tools, as well as a variety of automation tools that simplify the process of gathering, storing, and analyzing data. In turn, these tools help organizations ensure data quality and integrity while eliminating errors. Ultimately, this makes them more productive and efficient.
- Flexible. Modern data warehouses are typically flexible enough to allow organizations to make adjustments as their specific needs and requirements change. In contrast, with traditional data warehouses, this was simply not possible.
- Modular. Modern data warehouses typically have a modular architecture that allows organizations to implement and use specific components depending on their specific needs and requirements. This also allows them to extend their data warehouse when necessary.
- Robust and resilient. Modern data warehouses often have highly redundant, available components as part of the architecture which ensures that the services are always available. As a result, it completely supports an organization’s business operations and eliminates failures that could impact their performance.
- Scalable. Because of their modular architecture and their ability to be scaled up or down depending on an organization's needs and requirements, modern data warehouses are far more scalable than their traditional counterparts where organizations would be limited by fixed installations of hardware.
- Data. Modern data warehouses allow organizations to gather and store different types of data including streaming data, structured data, semi-structured data, and unstructured data.
- Components. Modern data warehouses typically feature a wealth of out-of-the-box components, extensions, and built-in packages. This allows organizations to extend the functionality of the platform where necessary and minimizes custom coding in the implementation and maintenance of the data warehouse.
- Cloud-first. Cloud computing technology has evolved significantly in recent years, and it has become a lot cheaper. As a result, it's far more affordable than on-premises data infrastructure where organizations need to install and maintain hardware and update software. In simple terms, it reduces the maintenance overhead of traditional data warehouse models while also lowering the price.
Do You Need a Modern Data Warehouse?
You typically need to modernize your data warehouse if you struggle with the following:
- You have different types of users you want to provide data for (power users working with Raw Data and AI and ML in the ODX, business users working with cleaned, enriched and consolidated data in the Data Warehouse, or casual users working with routine reports in the Semantic Layer).
- You want to spend more time getting value from your data than organizing, cleaning, and orchestrating it.
- You spend too much time waiting for data to be delivered to you.
- You spend too much time documenting it.
- You have problems with version control.
- You want to be independent from the developer.
- You want to be independent from the frontend.
- You gather data from different sources which takes a lot of time and effort and could lead to missed business opportunities.
- You don't have a single view of your business processes. As a result, you're less efficient and less productive.
- You struggle to identify your most valuable customers and, as a result, you're unable to improve your processes to retain your customers.
- You don't have enough information and data on your customers, which negatively impacts your marketing campaigns and makes them less effective and inefficient.
- The quality of your data is such that you don't have stable reporting. This then negatively impacts your business processes.
- You struggle to import and integrate data from your line of business systems because of scalability and availability issues.
- The pace at which you consume data is too fast. In other words, the window for gathering data, analyzing it, and reporting on the data is too short.
By modernizing your data warehouse, you’ll eliminate all these challenges, because it allows you to turn vast amounts of data from your operational systems into a format that is easy to understand. And, because data warehouses are optimized for read access, you’ll be able to generate reports faster. This, in turn, allows you to make better decisions and make your business process is more efficient and productive.
In addition, you’ll enjoy many other benefits when you modernize your data warehouse. These include:
- If you are going for a “cloud-first” approach, you don't need to purchase, install, or maintain physical hardware. As such, you don't have any capital expenditure upfront, and you only have a low operational expenditure.
- Modern data warehouses are far less complex to set up compared to their traditional counterparts. This also makes them easier to prototype and provide business value for your organization.
- Modern data warehouses are a lot easier, quicker, and less expensive to scale as your needs and requirements change.
- Modern data warehouses can perform complex analytical queries a lot faster than traditional data warehouses because of the way they store data. This, ultimately, means that you'll be able to gain insights into your data faster and make better decisions quicker.
In simple terms, modern data warehousing is a straightforward and cost-effective way for you and your organization to take advantage of the latest innovations in technology and gain insights from your data without the expense of traditional data warehouses.
How To Implement a Modern Data Warehouse With Azure
Now that we’ve recapped what a modern data warehouse is and why you should consider modernizing your data warehouse, we'll look at how to build a modern data warehouse with Azure.
Goals of Implementing a Data Warehouse
Building and implementing a modern data warehouse can take up to three years and, considering all the aspects you should focus on, it's simply not possible to implement the entire solution at once. It, therefore, makes sense to structure the implementation of the data warehouse into smaller parts. In other words, you’ll implement your data warehouse in phases, each with its own goals and requirements.
With that in mind the goals of your first phase could typically look like this:
- Build a central, unified data model that leverages data from a single business or operational areas like sales, marketing, or human resources but sufficiently allows for expansion into other operational areas in the future.
- Integrate data from your largest line of business system into the data warehouse.
- Make data available by a specified local time irrespective of when the data is gathered or the type of data.
- New data from your line of business system should be loaded regularly into the data warehouse.
- Data in the data warehouse should be prepared for use by all employees across the organization, and security settings ensure that data consumers have access to relevant data and are shielded from data that's not relevant.
- The analytical model that will provide you with valuable insights into your data will contain vast amounts of historical business data.
- Develop an overview dashboard for the specific business area with all security measures applied and that responds in less than a few seconds.
Although these goals can differ slightly depending on your specific needs and requirements, they are typically the standard goals for many data warehouse implementations. As you can imagine, though, they require a lot of work and come with a set of challenges you'll need to overcome to make your implementation a success.
With that in mind let's look at the process of implementing an enterprise data warehouse.
Typical Modern Data Warehouse Architecture
To look at the implementation process, we’ll deal with each component of a typical modern data warehouse architecture in Azure. In this example, the Azure architecture will consist of:
- SQL Server as a data source
- Blob storage in the form of Azure Data Lake Gen 2 for storing data before loading it into the data warehouse.
- SQL Elastic pool to perform analytics on vast amounts of data.
- Azure Analysis Services to provide data modeling capabilities.
- Azure Active Directory to authenticate users who connect to Azure Analysis Services through Power BI.
Getting the Data
As mentioned above, one of the first goals of implementing a data warehouse is building a central, unified data model that uses data from a single operational area. Also, you'll need to integrate data from your largest line of business system into the data warehouse.
To do this, you'll need to combine all your structured, unstructured, and semi-structured data. Typically, unstructured data will comprise of logs, files, and various types of media. On the other hand, structured data will be the data that you get from your business applications like your CRM, marketing platform, or sales platform. As stated earlier, in this example, we'll only use one source of data.
Storing the Data
Once you know how and what data you will ingest the data into your data warehouse, the next step is to extract all the data from the respective sources into files. Here, you’ll encounter one of the main challenges with a modern data warehouse: how do you store data effectively?
To answer this question, you will typically need to consider 3 vital things:
- Where you will store the files and how you will structure and organize them.
- How you will divide the files and how much data each file should contain.
- What file format you will extract the data to.
Let’s look at these questions in more detail.
Where will you store the files and how will you structure and organize them?
It’s vitally important to plan how your unstructured, semi-structured, and structured raw data from your data sources will be stored. Typically, when you implement a modern data warehouse on Microsoft Azure, you can store your files in a data lake or Blob storage.
Blob storage is Microsoft’s object storage solution for the cloud. It’s specifically designed and optimized for storing vast amounts of unstructured data. As such, it’s fully capable of:
- Serving images, files, or documents directly to a web browser.
- Storing files for distributed access across an entire enterprise.
- Streaming video and audio.
- Writing data to log files.
- Storing data for backup and restore operations, archiving, or disaster recovery.
- Storing data for analysis by an Azure-hosted or on-premises data analysis solution.
In contrast, built on Azure Blob Storage, Azure Data Lake Storage Gen2 has a set of capabilities that’s specifically aimed at big data analytics. It effectively combines the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. As such, Data Lake Storage Gen 2 provides:
- A hierarchical file system
- File system semantics
- File-level security
- Scalability
- Low cost, tiered storage
- High availability
- Strong consistency
- Disaster recovery capabilities
Although the choice of the right storage solution depends on your specific needs and requirements, modern data warehouses are designed and implemented with big data analytics in mind. As such, when implementing a modern data warehouse, Azure Data Lake Storage Gen 2 may be the most appropriate choice.
When you do choose to implement it, you’ll typically enjoy the following benefits:
- Centralized access to a replica of the data in the various data sources
- The performance of your data warehouse will be optimized because you don’t need to copy or transform data as a requirement for analytics. If you compare this to Blob storage’s flat namespace, the hierarchical namespace allows you to improve overall performance by improving the performance of directory management operations.
- Data management becomes easier because you can organize your data in directories and subdirectories.
- Because you’re able to define POSIX permissions on directories or individual files, security is enforceable.
- Because it’s built on Azure Blob storage which is low-cost by design, it’s very cost-effective and its additional features further lower the cost of ownership.
How will you divide the files and how much data will each file contain?
Once you’ve decided which storage solution you’ll use, the next vital thing you’ll need to decide on, and plan is how the data within the data lake will be structured. In other words, you’ll need to plan what folders you’ll use to store the data in, how these folders will be partitioned, and how you’ll name the folders and individual files.
It’s crucial that you plan these aspects carefully as they will, ultimately, determine how easy you’ll be able to navigate through the data stored on the data lake.
The next step will be to plan how you’ll divide the files and how much data each file should contain. Here, you’ll typically need to consider the amount of data you already have, and how fast the volume of your data increases. Using this information, you’ll then be able to decide how you’ll split data into files.
For example, with dictionary data, you’ll usually use one file for all the data in a dictionary table, irrespective of how much data the table stores. In contrast, with transaction data, you’ll have a choice of storing data for one day, one month, one year, or longer or shorter depending on your specific needs and requirements.
What file format will you use to extract the data to?
The next decision you’ll be faced with is what format you’ll use to extract the data to. Although this may sound like a simple choice, it’s crucial to get it right as the file format has a significant impact on the final data size.
Typically, you’ll have a choice between the following file formats:
- Avro format
- Binary format
- Delimited text format
- Excel format
- JSON format
- ORC format
- Parquet format
- XML format
You’ll need to carefully consider the file format your data is in and what the effect will be if you store it in one of the above formats. For example, moving your data from text files created from a SQL database could increase the data size significantly with some formats while it could reduce with others.
Making the right choice has the ability to not only reduce the amount of your storage you need significantly but can also substantially reduce the amount of time it takes to transfer your data to the cloud.
Once your planning is done, you can proceed to extract the data and transfer it to your data lake. Here, you have a lot of options like Azure CLI and PowerShell. One of the best options, though, is TimeXtender. It is specifically designed for the high-performance copying of data into Azure blob storage and, as such, is a fast and efficient way to transfer your data from your on-premises storage to Azure.
There are, however, some things you need to keep in mind when copying your data to your Azure Data Lake Storage Gen 2. For one, you shouldn’t run the utility on the same machine that runs your production workloads because the resources needed could interfere with your production workload.
You should also aim to create a storage account in a region near where your on-premises storage is located to ensure that the transfer happens faster. Finally, AzCopy creates a temporary journal file when transferring the data which enables it to restart the transfer if it’s interrupted. You should, therefore, make sure that you have enough storage space available to store the journal files.
Using the Data
Remember, the ultimate goal of having a modern data warehouse built on Azure is to serve the data to Power BI dashboards across an entire enterprise. To achieve that, you’ll need to load the files in your data lake into the data warehouse.
Here, you’ll use Polybase to load the files into the data warehouse. It’s uses Azure Synapse’s Massively Parallel Processing (MPP) which makes it the fastest way to load data into Azure Synapse.
Loading the data into Azure Synapse is a two-step process. During the first step, you’ll create a set of external tables for the data. These external tables are merely table definitions that point to data stored outside Azure Synapse, in our case, in the data lake. It’s important to note that this step does not move any data into the warehouse.
During the next step, you’ll create staging tables and load the data into these staging tables. During this step, the data is copied into the warehouse. Once the data is copied into Azure Synapse, you’ll transform the data and move it into production tables that are suitable for semantic modeling.
Next, you’ll load the data into a tabular model in Azure Analysis Services. During this step, you’ll typically create a semantic model using SQL Server Data Tools (SSDT). Here, you also have the option of creating a semantic model by importing it from a Power BI Desktop file.
It’s important to keep in mind here that you’ll need to add the relationships to the semantic model so that you can join data across tables. This is simply because Azure Synapse doesn’t support foreign keys. When you’re done with this step, you’ll be able to visualize your data in Power BI.
Power BI has two options to connect to Azure Analysis Services so that you can visualize your data. The first is to import your data into Power BI. The other option is to use Live Connection where Power BI pulls data directly from Azure Analysis Services.
Although the choice, ultimately, depends on your specific needs and requirements, it’s recommended to use Live Connection because you don’t need to copy data into Power BI.
When visualizing your data, there are also some things you need to consider. For one, Azure Analytics Services is specifically designed to handle to query requirements of a Power BI dashboard. As a result, it’s a recommended practice to query Azure Analytics Services directly from Power BI.
Considering the above, the other thing you need to keep in mind is that you should avoid running queries directly against the data warehouse. This could impact performance as refreshing the dashboard will count against the number of concurrent queries.
Expanding the Capabilities and Features
We mentioned earlier that a modern data warehouse should be implemented in phases and our example above perfectly illustrates what the first phase of the implementation could look like. So, what does the implementation look like in later stages when we want to incorporate more features into the data warehouse?
In this example, we’ll build on the previous example and add some important features that are vital for modern data warehouse implementations. These features include:
- Automating the pipeline using Data Factory.
- Incremental loading of data.
- Integrating data from multiple sources.
- Loading and using binary data like geospatial data, images, and other media.
In this example, the Azure architecture consists of:
- On-premises SQL Server and external data as data sources.
- Blob storage for storing data before loading it into Azure Synapse.
- Azure Data Factory to orchestrate and automate the movement and transformation of data and coordinate the various stages of the extract, load, transform (ELT) process.
- Azure Analysis Services that provides data modeling capabilities.
- Power BI for data analysis.
- Azure Active Directory to authenticate users who use Power BI to connect to Azure Analysis Services.
Data Pipeline and Incremental Loading
To ingest your data into the data warehouse you'll use Data Factory pipelines. These pipelines are logical groupings of activities that work together to perform a specific task. For instance, a pipeline could contain a set of activities that ingest and clean data from a variety of systems and then starts a data flow to analyze this data.
Another example would be when you use a copy activity to copy external data from, for instance, a SQL database to your Azure Blob Storage. This is similar to our example where we’ll use a pipeline to load and transform data into Azure Synapse.
One of the main benefits of using these pipelines is that it allows you to manage the activities together instead of each one individually. You, therefore, deploy and schedule the pipeline instead of deploying each activity independently.
In contrast to our first example, this architecture will also implement the incremental loading of data. When you use an automated ELT process, it’s far more efficient to load only new data, or in other words, only data that has changed, into the data warehouse compared to loading all the data.
Also known as system-versioned tables, these tables provide information about the data stored in the table at any point in time. It does this by automatically recording the history of all changes in a separate history table. From an ETL perspective, you can then query the historical data to determine whether an incremental load should be performed.
Ultimately, Data Factory will perform an incremental load if there are any changes. Keep in mind that, after a new batch of data is loaded into the warehouse, you'll need to refresh the Analysis Services tabular model. It's also important to keep in mind that data cleansing should be part of your ELT process to ensure good quality data.
Multiple Data Sources
In contrast to our first example, we’ll now incorporate multiple sources of data. Here, your data factory will orchestrate and coordinate the extraction of data from the external sources into our Blob storage or Azure Data Lake Storage Gen 2 by using a Copy Activity to move the data from both on-premises and cloud data sources. Like before, you're able to copy data to your data lake in one of the file formats mentioned earlier.
From here it can copy data directly to Azure Synapse by using the blob storage connector. It’s important to keep in mind, however, that the blob storage connector only supports certain authentication types like account key authentication, shared access signature authentication, and system-assigned managed identity authentication, amongst others.
As such, it requires a connection string or shared access signature and can, therefore, not be used to copy a blob with public read access. For a blob with public read access, you would need to use Polybase to create an external table over your blob storage and copy the external table to Azure Synapse.
Binary Data
When working with binary data, Data Factory’s Copy Activity will also be able to copy from your data source to your data lake and on to Azure Synapse. It’s important to note, however, that when using the copy activity, you can only copy from binary data to binary data.
You should also keep in mind that, when using Polybase as a workaround described above, it only supports maximum column sizes of 8000 bytes. In this case, you’ll have to break the data up into smaller pieces during copying and then reassemble the pieces after copying is done.
The Bottom Line
It’s no secret that, in today’s competitive market, agility is absolutely crucial to success. Not only does it allow organizations to adapt to changing market conditions and take advantage of business opportunities when they arise, but it also makes them more efficient.
So, when you want to give your organization the agility and flexibility required by today’s competitive market, it’s vital that you implement or modernize your data warehouse. It allows you to combine the data from all your operational areas into one source of truth and gives you increased scalability, flexibility, and automation capabilities.
Hopefully, this post helped illustrate how you can implement a modern data warehouse. To learn more about modern data warehouses or about how our automated, low-code, drag-and-drop Data Estate Builder can help you implement and operate a modern data warehouse without writing any code, visit our website for more details.
About TimeXtender
TimeXtender is an automated, low-code, drag-and-drop Data Estate Builder that empowers you to build a Modern Data Estate up to 10x faster than standard methods, prepare data for analysis, and make quality business decisions with data, mind, and heart. We do this for one simple reason: because time matters.
As a Microsoft Gold Certified Partner, we serve our 3,000+ customers, from mid-sized companies to Fortune 500, through our global network of partners.
Visit the TimeXtender customer page to learn how we are helping clients build reliable data estates faster than standard methods.