Data Warehouse Concepts for Beginners
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Typically a data warehouse contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from diverse data sources. It requires the process: Extract, Transform, and Load (ETL) from diverse data sources and create another copy within the data warehouse to support SQL queries and analysis.
Following data design techniques are required to facilitate data retrieval for analytical processing:
Star Schema: It is the foundational and simplest schema among data warehousing modeling. It contains one or more fact tables indexing to any number of dimensional tables. Its graphical representation looks like a star hence why we call it a star schema. Fact tables are usually very large compared to dimension tables; and dimension tables can contain redundant data as these are not required to be normalized.
Snowflake Schema: It is an extension of the star schema where a centralized fact table references the number of other dimension tables; however, those dimension tables are further normalized into multiple related tables. The entity-relationship diagram of this schema resembles a snowflake shape, hence we called it a snowflake schema.
Data Warehouse Example
Consider a fact table that stores sales quantities for each product and customer at a certain time. Sales quantities will be measured here and (Primary) keys from the customer, product, and time dimension tables will flow into the fact table. Additionally, all of the products can be further grouped under different product families and stored on a different table, the primary key of product family tables also goes into the product table as a foreign key. Such a construct is called a snow-flake schema as the product table is further snow-flaked into the product family.
Figure 1 explains the typical snowflake schema design.
ETL or ELT—Extracting, Transforming, and Loading Data
Besides the difference in data modeling and schemas, building a data warehouse involves the critical task of ETL – the compiling of data into a warehouse from other sources.
In data extraction, we move data out of source systems. It could be relational databases, NoSQL databases or streaming data sources. The challenge during this step is to identify the right data and manage access control.
In a data pipeline or batch workloads, we frequently move a large amount of data from different source systems to the data warehouse. Here the challenges are to plan a realistic SLA and to have a reliable and fast network and infrastructure.
In data transformation, we format data so that it can be represented consistently in the data warehouse. The original data might reside in different databases using different data types or in different table formats, or in different file formats in different file systems.
We load data into the fact tables correctly with an error-handling procedure in data loading.
Data Warehouse To Data Lake To Data Lakehouse
A data lake is a centralized file system or storage designed to store, process, and secure large amounts of structured, semistructured, or unstructured data. It can store data in its native format and process any variety of it. Examples of a data lake include HDFS, AWS S3, ADLS or GCS.
Data lakes use the ELT (Extract Load Transform) process while data warehouses use ETL (Extract Transform Load) process. With a SQL engine like Presto you can run interactive queries, reports, and dashboards from a data lake, without the need to create yet another data warehouse or copy of your data. and add an operational overhead.
A data lake is just one element of an Open Data Lakehouse, as it is taking the benefits from both: a data warehouse and a data lake. However, an Open Data Lakehouse is much more than that. It is the entire stack. In addition to hosting a data lake (AWS S3), and a SQL engine (presto), it also allows for governance (AWS Lake Formation), and ACID transactions. Transactionality or transaction support is achieved using technologies and projects such as Apache Hudi; while Presto is the SQL engine that then sits on top of the cloud data lake you’re querying. In addition to this, there is Ahana Cloud. Ahana is a managed service for Presto, designed to simplify the process of configuring and operating Presto.
As cloud data warehouses become more cost-prohibitive and limited by vendor lock-in, and the data mesh, or data federation, the approach is not performant, more and more companies are migrating their workloads to an Open Data Lakehouse. If all your data is going to end up in cloud-native storage like Amazon S3, ADLS Gen2, GCS. then the most optimized and efficient data strategy is to leverage an Open Data Lakehouse stack, which provides much more flexibility and remedies the challenges noted above. Taking on the task of creating an Open Data Lakehouse is difficult. As ab introduction to the process check out this on-demand presentation, How to build an Open Data Lakehouse stack. In it you’ll see how you can build your stack in more detail, while incorporating technologies like Ahana, Presto, Apache Hudi, and AWS Lake Formation.
In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.
A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.