Hands-on Presto Tutorial: How to run Presto on Kubernetes

What is Presto?

Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started – What do I need?

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Graphical user interface, text, application, email

Description automatically generated

Figure 1: Creating a new nodegroup
Graphical user interface, application

Description automatically generated

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Graphical user interface, text, application, email

Description automatically generated

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

Graphical user interface, text, application, email

Description automatically generated

Figure 4: Assume role privileges

Graphical user interface, text, application, email

Description automatically generated

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated
  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are petty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

Text

Description automatically generated

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Getting Presto running on K8s

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

Text

Description automatically generated
  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
A screenshot of a computer

Description automatically generated with medium confidence
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

A screenshot of a computer

Description automatically generated with low confidence
  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

A screenshot of a computer

Description automatically generated with low confidence
  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

A screenshot of a computer

Description automatically generated with medium confidence
  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

A picture containing text, outdoor, sign

Description automatically generated
  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

Text

Description automatically generated
  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

Text

Description automatically generated
  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

A screenshot of a computer

Description automatically generated with low confidence
  1. What does the current deployment look like?
A picture containing text, outdoor, sign

Description automatically generated

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4
Graphical user interface, application

Description automatically generated

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

Text

Description automatically generated

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Ahana provides a fully managed Presto service (running on Kubernetes). We setup the cluster with intelligent defaults depending upon your workload and provide an interactive UI and API to manage your presto needs. Learn more at https://ahana.io/ahana-cloud/

Presto substring operations: How do I get the X characters from a string of a known length?

Presto provides an overloaded substring function to extract characters from a string, your usage of the function may. We will use the string “Presto String Operations” to demonstrate the use

Extract last 7 characters:

presto> SELECT substring('Presto String Operations',-7) as result;

 result

---------

 rations

(1 row)

Query 20210706_225327_00014_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract last 10 characters:

presto> SELECT substring('Presto String Operations',-10) as result;

   result

------------

 Operations

(1 row)

Query 20210706_225431_00015_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the middle portion of the string:

presto> SELECT substring('Presto String Operations',8,6) as result;

 result

--------

 String

(1 row)

Query 20210706_225649_00020_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the beginning portion of the string:

presto> SELECT substring('Presto String Operations',1,6) as result;

 result

--------

 Presto

(1 row)

Query 20210706_225949_00021_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Comparing AWS Athena and PrestoDB Blog Series: Athena Alternatives

This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits
Part 3: AWS Athena vs. PrestoDB Blog Series: Athena Partition Limits

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Why would I not want to use Athena?

There are various reasons users look for alternative options to Athena, in spite of its advantages: 

  1. Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
  2. Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
  3. Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
  4. Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
  5. Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.

What are the typical alternatives to Athena?

Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:

DIY open-source PrestoDB

Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment. 

Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.

The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.

Managed Hadoop and Presto

Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.

AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.

This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.

You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.

Managed Presto Service

If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.

Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs. 

A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.

Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight. 

Summary

You have a wide variety of options regarding your use of PrestoDB. 

If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you. 

On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go. 

If simplicity and accelerated go-to-market are what you seek without needing to manage a complex infrastructure, then Ahana’s Presto managed service is the way to go. Sign up for our free trial today.

We also have a case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment. You can download it here.

Athena Limitations & AWS Athena Limits

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. It has a serverless architecture and Athena users pay per query (it’s priced at $5 per terabyte scanned). Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. Plus, AWS Athena is built on an old version of PrestoDB and only supports a subset of PrestoDB features.

An overview on AWS Athena limits

AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Some limits are hard, while some are soft quotas that you can request AWS to increase. One big limitation is around Athena’s limitations on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Ahena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100 buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

3.    Athena restricts each account to 100 databases, and databases cannot include over 100 tables.

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An AWS Athena alternative is deploying your own PrestoDB cluster. AWS Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service without the limits of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Up next: AWS Athena Query Limits