Redshift vs Redshift Spectrum: A Complete Comparison
Amazon Redshift is a cloud-based data warehouse service offered by Amazon. Redshift is a columnar database which is optimized to handle the sort of queries now running in enterprise star schemas and snowflake schemas.
Redshift Spectrum is an extension of Amazon Redshift. Redshift Spectrum as a feature of Redshift, allows the user to query data available on S3. With Amazon Redshift Spectrum, you can continue to store and grow your data at S3 and use Redshift as one of the compute options to process your data (other options could be EMR, Athena or Presto.)
There are many differences between Amazon Redshift and Redshift Spectrum, here are some of them:
Amazon Redshift cluster is composed of one or more compute nodes. A cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. The client application interacts directly only with the leader node and the compute nodes are transparent to external applications.
Whereas Redshift Spectrum queries are submitted to the leader node of your Amazon Redshift cluster., The Amazon Redshift compute nodes generate multiple requests depending on the number of objects that need to be processed, and submits them concurrently to Redshift Spectrum. The Redshift Spectrum worker nodes scan, filter, and aggregate your data from Amazon S3 for processing back to your Amazon Redshift cluster. Then, the final join and merge operations are performed locally in your cluster and the results are returned to your client.
Redshift Spectrum is a service that uses dedicated servers to handle the S3 portion of your queries. The S3 Glue catalog service is used to maintain the definition of the external tables. Redshift loosely connects to S3 data by the following route:
External database, schema, and table definitions in Redshift use an IAM role to interact with the Presto catalog and Spectrum, which handles the S3 portion of the queries.
Amazon Redshift is a full-managed data warehouse that is efficient in storing historical data from various different sources. This tool is designed to ease the process of data warehouse and analytics.
Redshift Spectrum is used to perform analytics directly on the data in the Amazon S3 cluster using an Amazon Redshift node. This allows users to separate storage and compute. The user can scale them independently.
You can use Redshift Spectrum, which is an add-on to Amazon redshift, for its capability to query the data from the files of S3 with existing information from the Redshift data warehouse. In addition to querying the data in S3, you can join the data from S3 to tables residing in Redshift.
Because Amazon Redshift holds dominion over how data is stored, compressed and queried, it has a lot more options for optimizing a query. On other hand Redshift Spectrum only has control over how the data is queried (because it is up to AWS S3 how it’s stored). Performance of Redshift Spectrum depends on your Redshift cluster resources and optimization of S3 storage.
That said, Spectrum offers the convenience of not having to import your data into Redshift. Basically you’re trading performance for the simplicity of Spectrum. Lots of companies use Spectrum as a way to query infrequently accessed data and then move the data of interest into Redshift for more regular access.
This article provides a quick recap of the major differences between Amazon Redshift and Redshift Spectrum. It takes into consideration today’s data platform needs.
Simply, Amazon Redshift can be classified as a tool in the “Big Data as a Service” category, whereas Amazon Redshift Spectrum is grouped under “Big Data Tools”.
AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.