Get started with PrestoDB on AWS with an easy to use Sandbox – 100% open-source and free. This tutorial will walk you through trying out Presto using the sandbox AMI on the AWS Marketplace.
The PrestoDB AMI in the AWS Marketplace configures the instance to be both the Presto co-ordinator and a Presto worker to get started easily. Let’s get started.
This AMI configures the instance to be both the Presto co-ordinator and a Presto worker. It also includes the Hive Metastore backed by PostgresSQL bundled in.
In this tutorial, you will create a table using data in an AWS S3 bucket and query it. Let’s get started!
Step 1 – Subscribe to the PrestoDB Sandbox Marketplace AMI
- Go to the AWS Marketplace listing https://aws.amazon.com/marketplace/pp/B08C21CGF6
- Click on “Accept Terms”
- After effective date and expiration date get updated, click on “Continue to Configuration”
Step 2 – Launch the AMI from Marketplace
- Select the delivery method as Amazon Machine Image (AMI)
- Choose Action to launch the AMI through EC2 and click on launch
- On “Step 2” of EC2 instance launch, choose r5.xlarge, then click “Next: Configure Instance Details” button. (Note: If you’re using a free tier AWS account, you can use a t2.xlarge instance.)
- On “Step 3” of EC2 instance launch, keep default settings and click on “Create a new IAM role”, then click “Review and Launch” button
Step 3 – Create a new IAM Role to access S3
- A new tab will be opened to create a new IAM Role. Click on “Create role”.
- In step 1 of role creation, click on EC2 as shown below, then click “Permissions”
- In step 2 of role creation, filter for S3 policies by typing in “s3” in the search box.
- Then select “AmazonS3ReadOnlyAccess” – We will create a role with this policy and attach the role to the cluster. This will allow us to let the Presto server read data from a public S3 bucket. Click “Next: Tags” to move to the next step.
- In step 3 of role creation, enter tags. This is optional and used for your own reference. You can add a key and value as shown below.
- In step 4 of role creation, give the role a name like “ahana-presto-s3-access”.
- Then click on “Create role”
Step 4 – Complete EC2 Instance Launch
- Now go back to the EC2 instance creation tab and click on the refresh button in the IAM role section
- You will see your newly created role in the drop down. Select that new role
- Next click “Review and Launch”
- Review the instance and click on “Launch”
- You will be asked to select a key pair or create a new one.
- You can create a new one, give it a name like “ahana-presto-sandbox” and download it
- Then click “Launch Instances”
- Your instance is now launched with a new role attached.
- Click on the instance ID shown below to go to your instance
Step 5 – Bring up the Presto Console
- Find the public DNS (public IP) of the instance you just launched
- Bring up the PrestoDB Console
Step 6 – Create a table in the Hive catalog using data stored on S3 as parquet
- SSH into the instance using terminal to run queries
- You may require to change the permissions on the new pem file that you created and downloaded
chmod 400 ahana-presto-sandbox.pem
ssh -i <KEYPAIR FILENAME>.pem ec2-user@<PUBLIC_DNS>
Example: ssh -i ahana-presto-sandbox.pem email@example.com
- Bring up the presto-cli to query the included catalogs and datasets. You will see the Hive catalog is already configured and datasets like TPC-DS and TPC-H are already included
$ presto-cli presto> show catalogs;
- Let’s use the HIVE catalog
presto> use hive.default;
- Next let’s create a table with external data stored in a public bucket “ahana-sandbox”
presto:default> CREATE TABLE airports ( iata varchar, airport varchar, city varchar, state varchar, country varchar, lat real, long real ) WITH ( format = 'PARQUET', external_location = 's3a://ahana-sandbox/airports/' );
- Presto will create a table in the Hive Metastore included and point to the S3 bucket that includes a parquet file for airport data. There is no need to ingest data as Presto understand parquet format as well as a range of other formats.
- Now let’s see created and query the table created.
presto:default> show tables; presto:default> select * from airports limit 10;
Step 7 – Query bundled in datasets like TPC-H
- Bring up the presto-cli to query the included datasets. Lets try to see the schemas in the tpch catalog
presto> show schemas from tpch;
- Bring up the presto-cli to query the included datasets
presto> use tpch.sf100;
presto> show tables;
presto:sf1> select * from customer limit 10;