This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:
Presto 101: Installing & Configuring Presto locally
Presto 102: Running a three node PrestoDB cluster on a laptop
Presto 103: Running a Prestodb cluster on GCP
Presto 104: Running Presto with Hive Metastore
Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.
We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.
What is AWS Glue?
AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.
In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.
This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.
Create a docker network namespace, so that containers could communicate with each other using the network namespace.
Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.
Start the instance of the the prestodb sandbox and name it as coordinator
We only want the coordinator to be running on this container without the worker node. So let’s edit the config.properties file and set the node-scheduler.include-cooridinator to false.
Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.
Create three more containers using ahanaio/prestodb-sandbox
Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.
Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.
Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access
Create access key
Run aws configure and enter the access and secret key configured.
Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.
Restart the coordinator and all worker containers
Run the presto-cli and use glue as catalog
Create a schema using S3 location.
Create table under glue.demo schema
Run select statement on the newly created table.
In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.