How to Use AWS Redshift Spectrum in AWS Lake Formation
As we’ve covered previously in What is Redshift Used For?, AWS Redshift is a cloud data warehouse used for online analytical processing (OLAP) and business intelligence (BI). Due to Redshift’s coupled architecture and relatively high costs at larger data volumes, businesses often seek to limit the workloads running on Redshift, while utilizing other analytic services including open-source Presto as part of a data lake house architecture.
Lake Formation makes it easier to set up the data lake, and to incorporate Redshift as part of the compute layer alongside other analytics tools and services. Developers can optimize their costs by using AWS Redshift for frequently accessed data and move less frequently accessed data to the Amazon S3 data lake, where it can be queried using serverless query engines such as Athena, Ahana, and Redshift Spectrum.
Two main reasons you would want to use Redshift with Lake Formation:
- Granting and revoking permissions: Within Lake Formation, there is an independent permissions model in addition to the general IAM permissions set on an AWS account. This enables granular control over who can read data from a lake. You can grant and revoke permissions to the Data Catalog objects, such as databases, tables, columns, and underlying Amazon S3 storage. With Redshift following the Lake Formation permissions model out-of-the-box, you can ensure that the users querying data in Redshift are only accessing data they are meant to access.
- Creating external tables and running queries: Amazon Redshift Spectrum can be used as a serverless query option to join data stored in Redshift with data residing on S3. Lake Formation allows you to create virtual tables that correspond to S3 file locations and register them in the Data Catalog. A Redshift Spectrum query would then be able to consume this data without additional configuration.
How to Integrate AWS Redshift in Lake Formation
Lake Formation relies on the AWS Glue Crawler to store table locations in the Glue Data Catalog, which can then be used to control access to S3 data for other analytics services, including Redshift. This AWS blog post suggests a reference architecture for connecting the various services involved:
- Data stored in an Amazon S3 lake is crawled using AWS Glue Crawler.
- Glue Crawler then stores the data in tables and databases in the AWS Glue Data Catalog.
- The S3 bucket is registered as the data lake location with Lake Formation. Lake Formation is natively integrated with the Glue Data Catalog.
- Lake Formation grants permissions at the database, table, and column level to the defined AWS Identity and Access Management (IAM) roles.
- Developers create external schemas within Amazon Redshift to manage access for other business teams.
- Developers provide access to the user groups to their respective external schemas and associate the appropriate IAM roles to be assumed.
- Users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.
- After the data is registered in the Data Catalog, each time users try to run queries, Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum, and the query runs.
Using Lake Formation as Part of an Open Data Lakehouse
One of the advantages of a data lake is its open nature, which allows businesses to use a variety of best-in-breed analytics tools for different workloads. This replaces database-centric architectures, which requires storing data in proprietary formats and getting locked-in with a particular vendor.
Implementing Lake Formation makes it easier to move more data into your lake, where you can store it in open-source file formats such as Apache Parquet and ORC. You can then use a variety of tools that interface with the Glue Data Catalog and read data directly from S3. This provides a high level of flexibility, provides vendor lock-in, and strongly decouples storage from compute, reducing your overall infrastructure costs. (You can read more about this topic in our new white paper: The SQL Data Lakehouse and Foundations for the New Data Stack.)
If you’re looking for a truly open and flexible option for serverless querying, you should check out Ahana Cloud. Ahana Cloud and AWS Lake Formationmake it easy build and query secure S3 data lakes. Using the native integration, data platform teams can seamlessly connect Presto with AWS Glue, AWS Lake Formation and AWS S3 while providing granular data security. Enabling the integration in Ahana Cloud is a single-click affair when creating a new Presto cluster.
Learn more about Ahana Cloud’s integration with AWS Lake Formation.
Redshift Data Warehouse Architecture Explained
Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.