Presto has evolved into a unified engine for SQL queries on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.
Presto’s Redshift connector allows conducting SQL queries on the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters.
How to Run SQL Queries in Redshift with Presto
Step 1: Setup a Presto cluster with Kubernetes
Step 2: Setup a Amazon Redshift cluster
Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with dataset and tables as described here.
Below screen shows Amazon Redshift cluster – “
Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.
You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules.
If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.
Step 3: Configure Presto Catalog for Amazon Redshift Connector
At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.
Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:
connection-password=secret connector.name=redshift connection-url=jdbc:postgresql://example.net:5439/database connection-user=root
This is how my catalog properties look like –
my_redshift.properties: | connector.name=redshift connection-user=awsuser connection-password=admin1234 connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev
Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database
After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data.
$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password
IN the below example you can see a new catalog for Redshift Database got initiated called “
presto> show catalogs; Catalog ------------- ahana_hive jmx my_redshift system tpcds tpch (6 rows) Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes Splits: 36 total, 36 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Further you can check all available schemas for your Amazon Redshift from Presto to work with.
presto> show schemas from my_redshift; Schema -------------------- catalog_history information_schema pg_catalog pg_internal public (5 rows) Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes Splits: 36 total, 36 done (100.00%) 0:01 [5 rows, 85B] [4 rows/s, 72B/s]
Here, I have used sample data that comes with Redshift Cluster setup. I have chosen the schema “public” which is a part of “dev” Redshift Database.
presto> show tables from my_redshift.public; Table ---------- category date event listing sales users venue (7 rows) Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes Splits: 36 total, 36 done (100.00%) 0:03 [7 rows, 151B] [2 rows/s, 56B/s]
Further, you can explore tables as “sales” in the below example.
presto> select * from my_redshift.public.sales LIMIT 2; salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission | saletime ---------+--------+----------+---------+---------+--------+---------+-----------+------------+------------------------- 33095 | 36572 | 30047 | 660 | 2903 | 1827 | 2 | 234.00 | 35.10 | 2008-01-01 01:41:06.000 88268 | 100813 | 45818 | 698 | 8649 | 1827 | 4 | 836.00 | 125.40 | 2007-12-31 23:26:20.000 (2 rows) Query 20210810_185527_00212_krtkp, FINISHED, 1 node Splits: 18 total, 18 done (100.00%) 0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]
Following are some more complex queries you can run against sample data:
presto:public> -- Find top 10 buyers by quantity ->SELECT firstname, lastname, total_quantity -> FROM (SELECT buyerid, sum(qtysold) total_quantity -> FROM sales -> GROUP BY buyerid -> ORDER BY total_quantity desc limit 10) Q, users -> WHERE Q.buyerid = userid -> ORDER BY Q.total_quantity desc; firstname | lastname | total_quantity -----------+----------+---------------- Jerry | Nichols | 67 Armando | Lopez | 64 Kameko | Bowman | 64 Kellie | Savage | 63 Belle | Foreman | 60 Penelope | Merritt | 60 Kadeem | Blair | 60 Rhona | Sweet | 60 Deborah | Barber | 60 Herrod | Sparks | 60 (10 rows) Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes Splits: 214 total, 214 done (100.00%) 0:10 [222K rows, 0B] [22.4K rows/s, 0B/s] presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales. -> SELECT eventname, total_price -> FROM (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile -> FROM (SELECT eventid, sum(pricepaid) total_price -> FROM sales -> GROUP BY eventid)) Q, event E -> WHERE Q.eventid = E.eventid -> AND percentile = 1 -> ORDER BY total_price desc; eventname | total_price ----------------------+------------- Adriana Lecouvreur | 51846.00 Janet Jackson | 51049.00 Phantom of the Opera | 50301.00 The Little Mermaid | 49956.00 Citizen Cope | 49823.00 Sevendust | 48020.00 Electra | 47883.00 Mary Poppins | 46780.00 Live | 46661.00 (9 rows) Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes Splits: 230 total, 230 done (100.00%) 0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]
Step 5: Run SQL queries to join data between different systems like Redshift and Hive
Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto.
Here, I have two catalogs “
ahana_hive” for Hive Database and “
my_redshift” for Amazon Redshift and each database has
ahana_hive.default.customer table respectively within their schema.
Following very simple SQL queries to join these tables, the same way you join two tables from the same database.
presto> show catalogs; presto> select * from ahana_hive.default.customer; presto> select * from my_redshift.public.users; presto> Select * from ahana_hive.default.customer x join my_redshift.public.users y on x.nationkey = y.userid;
Understanding Redshift’s Limitations
Running SQL queries on Redshift has its advantages, but there are some shortcomings associated with Amazon Redshift. If you are looking for more information about Amazon Redshift, check out the pros and cons and some of the limitations of Redshift in more detail.
Start Running SQL Queries on your Data Lakehouse
We made it simple to run SQL queries on Presto in the cloud.
Get started with Ahana Cloud and start running SQL in a few mins.
Amazon Redshift is a cloud data warehouse, permitting the execution of SQL queries, offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. Users can easily run SQL queries on Redshift, but there are some limitations.