5 Components of Data Warehouse Architecture
What are the 5 components of a data warehouse? The data warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load). Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.
As the data warehouse grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.
Here’s a quick high level architecture of the data warehouse:
- SQL Query Processing
- Data layer
As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.
Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.
SQL Query Processing
SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.
The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.
This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.
The Challenges with a Data Warehouse
Now that I’ve laid out the five key components of a data warehouse, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work.
We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:
The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads.
If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse.
A data warehouse is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is and how it’s used.
A relational database that is designed for query and analysis rather than for transaction processing. Learn more here.