Redshift Data Warehouse Architecture Explained
Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?)
You might already be familiar with Redshift basics – but in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.
Redshift Architecture and Main Components
As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow or through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.
Breaking Down the Redshift Cluster Components
- The leader node is Redshift’s ‘brain’ and manages communications with external client programs as well as the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
- Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
- The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
- Intermediate results are sent back to the leader node, which performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
- Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
- Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.
This AWS presentation offers more details about Redshift’s internal architecture, and a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:
Additional Performance Features
In addition to these core components, Redshift has multiple built-in features meant to improve performance:
- Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
- Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations.
- Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time, ensuring final aggregations are returned faster.
- Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries.
- Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..
Redshift vs Traditional Data Warehouses
While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:
- Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
- Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
- Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.
Redshift Costs and Performance
Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:
- On-demand instances are charged by the hour, with no long-term commitment or upfront fees.
- Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time.
- Serverless instances are charged based on usage, so customers only pay for the capacity they consume.
The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters.
Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query.
Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.
Reducing Redshift Costs with a Lakehouse Architecture
We’ve worked with many companies who started out using Redshift when they didn’t have much data but found it difficult and costly to scale as their needs evolved.
Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.
At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse to leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.
In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data – without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.