Data Warehouse, Data Lake, Data Hub or a Data Platform?

Written by Guest Post from E-mergo | September 16, 2020

Growth of the amount of data, data sources and analysis possibilities ensure that there are also multiple possibilities for storing and processing data. All kinds of concepts are on the table - such as Data Lake, Data Hub, Data Warehouse and Data Platform. A Gartner study has shown that demand for Data Hubs increased by 20% between 2018 and 2019. Interestingly, Gartner noted that more than 25% of customers thought that a Data Hub was a Data Lake solution^[1]. Gartner's research illustrates how much confusion there is about what the different concepts entail. In practice, we also notice that there is a great deal of ambiguity; so how do the concepts differ from each other? This blog provides more clarity on the meaning of these terms.

Data Warehouse

A Data Warehouse (DWH) consists of the integrated storage of information with the aim of feeding business decisions and analyses. With a data warehouse you lay the foundation for Business Intelligence (BI) and Analytics. The data that is in a data warehouse often comes from different sources inside, or outside, the organization. Because a data warehouse brings together huge amounts of data from different data sources (HRM, DMS, ERP), valuable insights can be gained. A data warehouse saves time for companies that collect data on a large scale and ensures uniformity of definition of business information. When you want an integral understanding of your business, a data warehouse is important.

Data Lake

A data lake is a storage location in which large amounts of raw data are stored in its original structure. In terms of content, a Data Lake can contain structured and unstructured data. The data structure of individual files and how they should be accessed is not known until the data is used. It is important that a data lake is not seen as a replacement storage system, but as a place where analysis and research can be done with unprecedented freedom due to the relatively low cost of storage and the ease of scaling up. Data lakes are generally a good basis for reporting, visualizations, advanced analytics and machine learning.

(Un)Structured data?

Structured: data from databases, CSV, JSON, etc.
Unstructured: email, PDF, documents, video, audio, binary files

Data Hub

A data hub does not store data itself, but takes care of the flow of data between source systems and target systems and users. With a data hub you actually indicate exactly what needs to be done with the data. So you can link certain information from sensors to an automated order system. The power of the source system is used as much as possible to ensure optimal performance. Often, a data hub takes the form of a hub-and-spoke architecture where systems can distribute data through the Data Hub, rather than through point-to-point integration where every system is connected to any other system with which data needs to be shared.

In addition, a data hub provides organizations with insight to be able to interpret data properly. Because if you understand what you're looking at, it becomes easier to ensure the accuracy of data or adjust it where necessary. You can literally see how datasets are constructed up to the column and row level. Moreover, you can comply with laws and regulations, because you know exactly who has access to what data and where data is stored. The data in a data hub is not necessarily integrated and can contain different levels of detail side by side as opposed to a Data Warehouse. Set against a data lake, a data hub can offer data in different formats. Where data warehouses and data lakes are endpoints for data, a data hub is a node through which data flows.

Data Platform

A data platform, also known as data management platform, is an integrated solution that combines the functionalities of data lake, data warehouse, data hub and elements of a Business Intelligence (BI) Platform. Without a data platform, a separate tool or set of tools is usually used for each aspect. This creates a complex landscape where many tools need to be managed to make data flow from source to end user. A data platform centralizes these solutions in one tool and thus delivers a product that is a lot more manageable.

Differences

Schema on read/write?

Schema-on-read: Data is stored unchanged
Schema-on-write: data is transformed and stored in a predefined structure

Conclusion

The huge increase in data sources and volume and the different data needs of different users pose significant problems for BI/IT departments and others engaged in data for analytics, artificial intelligence (AI) and BI. Organizations use a variety of tools to process and manage data. There's another way. This is why E-mergo has chosen to partner with TimeXtender. The TimeXtender platform provides a cohesive data structure for on-premise technology and cloud. This allows you to connect to different data sources and catalog, model, move, and document data for analytics and AI purposes.

TimeXtender wants to change the traditional way of BI development by repeatedly automating work. When building a traditional data platform, there is a lot of repetitive and time-consuming work. With TimeXtender, you can make the transition to an integrated data platform that delivers data insights 5 to 10 times faster thanks to automation. This allows you to save up to 80% on management and develop 70% faster.

[1] https://www.gartner.com/en/documents/3980938

This blog post first appeared on the E-mergo website.

Want to know more about TimeXtender from E-mergo? Watch the platform in action during one of E-mergo's live demos via Microsoft Teams or check out one of their other resources.

LIVE DEMO RESOURCES

View full post