What is AWS Redshift Spectrum?
What is Redshift Spectrum? Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.
Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.
Benefits of Redshift Spectrum
When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.
Spectrum allows you to access your data lake files from within your Redshift data warehouse without having to go through an ingestion process. This makes data management easier, while also reducing data latency since you aren’t waiting for ETL jobs to be written and processed.
With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift, which means there are no new tools to learn and it allows you to leverage your existing skillsets. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.
Performance & Price
Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.
As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.
Redshift and Redshift Spectrum Use Case
An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse, but real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.
Amazon Redshift Spectrum provides a layer of functionality to Redshift that allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for companies who are using Redshift and need to stay there, but also need to make use of the data lake, or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.