Presto 104: Running Presto with Hive Metastore on your Laptop

Introduction

This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.

Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular. 

In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.   

The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.

Implementation steps

Step 1

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                               e8f8aa933633: Pull complete                                                               b7cf38297b9f: Pull complete                                                               a4205d42b3be: Pull complete                                                               81b659bbad2f: Pull complete                                                               3ef606708339: Pull complete                                                               979857535547: Pull complete                                                              
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandbox
db74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3  ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install and configure hive on the coordinator container.

Install wget procps and tar 

# yum install -y wget procps tar less

Step 9:

Download and install hive and hadoop packages, set HOME and PATH for JAVA,HIVE and HADOOP 

#HIVE_BIN=https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
#HADOOP_BIN=https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz


#wget --quiet ${HIVE_BIN}
#wget --quiet ${HADOOP_BIN}


#tar -xf apache-hive-3.1.2-bin.tar.gz -C /opt
#tar -xf hadoop-3.3.1.tar.gz -C /opt
#mv /opt/apache-hive-3.1.2-bin /opt/hive
#mv /opt/hadoop-3.3.1 /opt/hadoop


#export JAVA_HOME=/usr
#export HIVE_HOME=/opt/hive
#export HADOOP_HOME=/opt/hadoop
#export PATH=$PATH:${HADOOP_HOME}:${HADOOP_HOME}/bin:$HIVE_HOME:/bin:.
#cd /opt/hive

Step 10:

Download additional jars needed to run with S3

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.10.6/aws-java-sdk-core-1.10.6.jar

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar

#wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.4/hadoop-aws-2.8.4.jar

#cp aws-java-sdk-core-1.10.6.jar /opt/hadoop/share/hadoop/tools/lib/
#cp aws-java-sdk-s3-1.10.6.jar  /opt/hadoop/share/hadoop/tools/lib/
#cp hadoop-aws-2.8.4.jar  /opt/hadoop/share/hadoop/tools/lib/

echo "export
HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.ja

r:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3
1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar" >>/opt/hive/conf/hive-env.sh

Step 11:

Configure and start hive

cp /opt/hive/conf/hive-default.xml.template /opt/hive/conf/hive-site.xml
mkdir -p /opt/hive/hcatalog/var/log
bin/schematool -dbType derby -initSchema
bin/hcatalog/sbin/hcat_server.sh start

Step 12:

Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.

If you choose to validate using AWS S3 bucket provide security credentials for the same.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Key>
hive.s3.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use hive as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog hive

Step 15:

Create schema using local or S3 location.

presto:default> create schema tpch with (location='file:///root');
CREATE SCHEMA
presto:default> use tpch;

If you have access to S3 bucket then use the following create command using s3 as destination 

presto:default> create schema tpch with (location='s3a://bucket_name');
CREATE SCHEMA
presto:default> use tpch;

Step 16:

Hive has option to create two types of table, they are

  • Managed tables 
  • External tables

Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.

External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.

The transactional semantics(ACID) is only supported on managed tables.

We will create managed table under hive.tpch schema

Create table under hive.tpch schema

presto:tpch> create table hive.tpch.lineitem with (format='PARQUET') AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows
Query 20210921_051649_00015_uvkq7, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1:48 [6M rows, 0B] [55.4K rows/s, 0B/s]

Step 17:

Do a desc table to see the table.

presto> desc hive.tpch.lineitem     
-> ;    
Column     |    Type     | Extra | Comment
---------------+-------------+-------+--------- 
orderkey      | bigint      |       | 
partkey       | bigint      |       | 
suppkey       | bigint      |       | 
linenumber    | integer     |       | 
quantity      | double      |       | 
extendedprice | double      |       | 
discount      | double      |       | 
tax           | double      |       | 
returnflag    | varchar(1)  |       | 
linestatus    | varchar(1)  |       | 
shipdate      | date        |       | 
commitdate    | date        |       | 
receiptdate   | date        |       | 
shipinstruct  | varchar(25) |       | 
shipmode      | varchar(10) |       | 
comment       | varchar(44) |       |
(16 rows)
Query 20210922_224518_00002_mfm8x, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:08 [16 rows, 1.04KB] [1 rows/s, 129B/s]

Summary

In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.