Hands-on Presto Tutorial: How to run Presto on Kubernetes

What is Presto?

Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started – What do I need?

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Graphical user interface, text, application, email

Description automatically generated

Figure 1: Creating a new nodegroup
Graphical user interface, application

Description automatically generated

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Graphical user interface, text, application, email

Description automatically generated

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

Graphical user interface, text, application, email

Description automatically generated

Figure 4: Assume role privileges

Graphical user interface, text, application, email

Description automatically generated

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated
  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are petty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

Text

Description automatically generated

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Getting Presto running on K8s

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

Text

Description automatically generated
  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
A screenshot of a computer

Description automatically generated with medium confidence
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

A screenshot of a computer

Description automatically generated with low confidence
  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

A screenshot of a computer

Description automatically generated with low confidence
  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

A screenshot of a computer

Description automatically generated with medium confidence
  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

A picture containing text, outdoor, sign

Description automatically generated
  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

Text

Description automatically generated
  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

Text

Description automatically generated
  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

A screenshot of a computer

Description automatically generated with low confidence
  1. What does the current deployment look like?
A picture containing text, outdoor, sign

Description automatically generated

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4
Graphical user interface, application

Description automatically generated

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

Text

Description automatically generated

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Ahana provides a fully managed Presto service (running on Kubernetes). We setup the cluster with intelligent defaults depending upon your workload and provide an interactive UI and API to manage your presto needs. Learn more at https://ahana.io/ahana-cloud/