Presto 105: Running Presto with AWS Glue as catalog on your Laptop

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto 104: Running Presto with Hive Metastore

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.

We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.

AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.

In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.   

This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.

Implementation steps

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2: 

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                          

e8f8aa933633: Pull complete                                                          
b7cf38297b9f: Pull complete                                                          
a4205d42b3be: Pull complete                                                          
81b659bbad2f: Pull complete                                                          
ef606708339: Pull complete                                                          
979857535547: Pull complete                                                          
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latest
docker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandboxd
b74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3
ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.

# yum install -y aws-cli

Step 9: 

Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access

aws iam create-user --user-name glueuser
{
    "User": {
        "Path": "/",
        "UserName": "glueuser",
        "UserId": "AXXXXXXXXXXXXXXXX",
        "Arn": "arn:aws:iam::XXXXXXXXXX:user/glueuser",
        "CreateDate": "2021-10-07T01:07:28+00:00"
    }
}

aws iam list-policies | grep AmazonS3FullAccess
            "PolicyName": "AmazonS3FullAccess",
            "Arn": "arn:aws:iam::aws:policy/AmazonS3FullAccess",

aws iam list-policies | grep AWSGlueConsoleFullAccess
            "PolicyName": "AWSGlueConsoleFullAccess",
            "Arn": "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess",

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

Step 10:

Create access key

% aws iam create-access-key --user-name glueuser
{
   "AccessKey": {
       "UserName": "glueuser",
        "AccessKeyId": "XXXXXXXXXXXXXXXXXX", 
       "Status": "Active",
        "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "CreateDate": "2021-10-13T01:50:45+00:00"
    }
}

Step 11:

Run aws configure and enter the access and secret key configured.

aws configure
AWS Access Key ID [None]: XXXXXXXXXXXXXAWS
Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Default region name [None]:
Default output format [None]

Step 12:

Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.

connector.name=hive-hadoop2
hive.metastore=glue
hive.non-managed-table-writes-enabled=true
hive.metastore.glue.region=us-east-2
hive.metastore.glue.aws-access-key=<your AWS key>
hive.metastore.glue.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use glue as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog glue

Step 15:

Create a schema using S3 location.

presto:default> create schema glue.demo with (location= 's3://Your_Bucket_Name/demo');
CREATE SCHEMA
presto:default> use demo;

Step 16:

Create table under glue.demo schema

presto:demo> create table glue.demo.part with (format='parquet') AS select * from tpch.tiny.part;
CREATE TABLE: 2000 rows
    
Query 20211013_034514_00009_6hkhg, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:06 [2K rows, 0B] [343 rows/s, 0B/s]

Step 17:

Run select statement on the newly created table.

presto:demo> select * from glue.demo.part limit 10; 
partkey |                   name                   |      mfgr      |  brand
---------+------------------------------------------+----------------+---------
       1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13
       2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13
       3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42
       4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34
       5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32
       6 | bisque cornflower lawn forest magenta    | Manufacturer#2 | Brand#24
       7 | moccasin green thistle khaki floral      | Manufacturer#1 | Brand#11
       8 | misty lace thistle snow royal            | Manufacturer#4 | Brand#44
       9 | thistle dim navajo dark gainsboro        | Manufacturer#4 | Brand#43
      10 | linen pink saddle puff powder            | Manufacturer#5 | Brand#54

Summary

In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.