Data Warehouse: A Comprehensive Guide
A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. It is typically composed of operational data that has been aggregated and organized in such a way that facilitates the requirements of the data teams. Data consumers need/want to be able to do their work at a very high speed to make decisions. By design, there is usually some level of latency involved in data appearing in a warehouse, keep that in mind when designing your systems and what the requirements are for your users. In this article, we’re going to review the data warehouse types, the different types of architecture, and the different warehouse model types.
Data Warehouse Architecture Types
The various data warehouse architecture types break down into three categories:
Single-tier architecture – The objective of this architecture is to dramatically reduce data duplication and produce a dense set of data. While this design keeps the volume of data as low as possible, it is not appropriate for complex data requirements that include numerous data sources.
Two-tier architecture – This architecture design splits the physical data from the warehouse itself, making use of a system and a database server. This design is typically used for a data mart in a small organization, and while efficient at data storage, it is not a scalable design and can only support a relatively small number of users.
Three-tier architecture – The three-tier architecture is the most common type of data warehouse as it provides a well-organized flow of your raw information to provide insights. It is comprised of the following components:
- Bottom tier – comprises the database of the warehouse servers. It creates an abstraction layer on the various information sources to be used in the warehouse.
- Middle tier – includes an OLAP server to provide an abstracted view of the database for the users. Being pre-built into the architecture, this tier can be used as an OLAP-centric warehouse.
- Top tier – comprises the client-level tools and APIs that are used for data analysis and reporting.
Data Warehouse Model Types
The data warehouse model types break down into four categories:
- Enterprise Data Warehouse
An EDW is a centralized warehouse that collects all the information on subjects across the entire organization. These tend to be a collection of databases as opposed to one monolith, that provides a unified approach to querying data by subject.
- Data Mart
Consisting of a subset of a warehouse that is useful for a specific group of users. Consider a marketing data mart that is populated with data from ads, analytics, social media engagement, email campaign data, etc. This enables the (marketing) department to rapidly analyze their data without the need to scan through volumes of unrelated data. A data mart can be further broken into “independent”, where the data stands alone, or “dependent” where the data is coming from the warehouse.
- Operational Data Store
The ODS might seem slightly counterintuitive to start with as it is used for operational reporting, and typically we don’t want to do reporting and analytic workloads on operational data. It is a synergistic component for the previously mentioned EDW and used for reporting on operational types of data. Low-velocity data that is managed in real-time, such as customer records or employee records, are typical of this kind of store.
- Virtual Warehouse
The Virtual Warehouse is maybe a questionable inclusion, but nonetheless important. This is implemented as a set of views over your operational database. They tend to be limited in what they can make available due to the relationships in the data, and the fact that you don’t want to destroy your operational database performance by having large numbers of analytic activities taking place on it at the same time.
A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for insights. As a data analyst is looking for trends or actionable insights, the ability to navigate through various data dimensions easily is paramount. The warehouse approach allows you to store and analyze vast amounts of information, which also comes at a cost for storage and compute. You can mitigate some of these costs by optimizing your warehouse for data retrieval. Picking a DW design and sticking with it, and ensuring that your data has been cleansed and standardized prior to loading.
Another option to the warehouse is the growing data lake approach, where information can be read in place from an object store such as (AWS) S3. Some advantages are reduced costs and latency as the load to the DW is no longer necessary. The Community Edition of the Presto managed service from Ahana is a great way to try out the data lake to test your requirements.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.
Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article