The term "Data Lake", "Data Warehouse" and "Data Mart" are often times used interchangbly. But what are exactly the differences between these things? This post attempts to help explain the similarity, the difference and when to use each.
A high-level comparison of these three constructs is as below:
A data lake is the place where you dump all forms of data generated in various parts of your business: structured data feeds, chat logs, emails, images (of invoices, receipts, checks etc.), and videos. The data collection routines does not filter any information out; data related to canceled, returned, and invalidated transactions will also be captured, for instance.
Why data lake?
A data lake is relevant in two contexts:
1- Your organization is so big and your product does so many functions that there are many possible ways to analyze data to improve the business. Thus, you need a cheap way to store different types of data in large quantities.
Eg. Twitter in the B2C space (They have text (Tweets), Images, Videos, Links, Direct Messages, Live Streams, etc.), and Square (B2B) (Transactions, Returns, Refunds, Customer Signatures, Logon IDs etc.).
2- You don't have a plan for what to do with the data, but you have a strong intent to use it at some point. Thus, you collect data first and analyze later.
Also, the volume is so high that traditional DBs might take hours if not days to run a single query. So, having it in a Massively Parallel Processor (MPP) infrastructure helps you analyze the data comparatively quickly.
A data warehouse usually only stores data that's already modeled/structured.
A Data Warehouse is multi-purpose and meant for all different use-cases. It doesn’t take into account the nuances of requirements from a specific business unit or function. As an example, let’s take a Finance Department at a company. They care about a few metrics, such as Profits, Costs, and Revenues to advise management on decisions, and not about others that Marketing & Sales would care about. Even if there are overlaps, the definitions could be different.
While a data-warehouse is a multi-purpose storage for different use cases, a data-mart is a subsection of the data-warehouse, designed and built specifically for a particular department/business function.
Some benefits of using a data-mart:
Isolated Security: Since the data-mart only contains data specific to that department, you are assured that no unintended data access (finance data, revenue data) are physically possible.
Isolated Performance: Similarly, since each data-mart is only used for particular department, the performance load is well managed and communicated within the department, thus not affecting other analytical workloads.
Subscribe to Holistics Blog
Get the latest posts delivered right to your inbox
From SQL Queries To Beautiful Charts
Connect to your database and build beautiful charts with Holistics BILearn More
"Holistics is the solution to the increasingly many and complex data requests from the operational teams"
Tang Yee Jie
Senior Data Analyst, Grab