Presto Geospatial Functions Tutorial

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (

trip_id int, 

orig_long double, 

orig_lat double, 

dest_long double, 

dest_lat double, 

trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast(‘2021-03-02 09:00:00 UTC’ as timestamp));

insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast(‘2021-03-02 09:30:00 UTC’ as timestamp));

insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast(‘2021-03-02 09:45:00 UTC’ as timestamp));

insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast(‘2021-03-02 10:45:00 UTC’ as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (

geo_shape varchar, 

name varchar);

insert into city_table values (‘POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))’, ‘London, central’);

insert into city_table values(‘POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))’, ‘Los Angeles, downtown’);

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”) 

Now lets run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips FROM trips_table as t JOIN city_table as c ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) GROUP BY 1;

         

CityTrips
Los Angeles, downtown1
London, central
(2 rows)
2

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris. 

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto 

Presto’s Geospatial functions are listed in the Presto documentation.  

Check out our other Learning Center content  at https://ahana.io/learn/ where you’ll find useful tutorials and examples, from using Python with Presto, to Machine Learning and more.  We also cover comparisons between Presto and other technologies. 

Also check https://ahana.io/answers/ for answers to frequently asked questions on all things Presto.

How to Write a Python Script to Connect to a Presto Cluster

If you need a quick tutorial on how to write a python script to connect to a Presto cluster and run a query, you came to the right place.

Prerequisites

In this tutorial we’ll be using a Python3 environment, and the SQLAlchemy.  SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. 

  1. You will need a working python installation – check that you have with:
$ python3 --version
Python 3.9.1

With Python 3.4 or later you will also have pip3 installer, which we’ll use to install two modules.

2. Install the sqlalchemy module using pip3:

$ pip3 install SQLAlchemy
Collecting SQLAlchemy
  Downloading SQLAlchemy-1.3.23-cp39-cp39-macosx_10_14_x86_64.whl (1.2 MB)
Installing collected packages: SQLAlchemy
Successfully installed SQLAlchemy-1.3.23

3. Install the pyhive interface for Presto. PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive. We want the Presto interface:

$ pip3 install 'pyhive[presto]'
Collecting pyhive[presto]
 ...
Successfully built pyhive future
Installing collected packages: six, urllib3, python-dateutil, idna, future, chardet, certifi, requests, pyhive
Successfully installed certifi-2020.12.5 chardet-4.0.0 future-0.18.2 idna-2.10 pyhive-0.6.3 python-dateutil-2.8.1 requests-2.25.1 six-1.15.0 urllib3-1.26.3

Test Script

Now you have python, SQLAchemy and PyHive ready to roll, create this sample script and save it as presto-test.py.  In it we’ll run a simple SELECT query that just returns the Presto cluster’s system timestamp.  Just replace the  user ID, password, cluster name, and metastore (e.g. the Ahana integrated ahana_hive in this case) with your own. Here we’re using an HTTPS connection so the default port number is 443:

# presto-test.py
# simple python script to run a query on a presto cluster and display the result set 
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine("presto://user:password@myPrestoClusterName:443/ahana_hive", connect_args={'protocol': 'https'})
 
with engine.connect() as con:
    rs = con.execute('SELECT now()')
    for row in rs:
        print(row)

Invocation

The script runs the SELECT statement and displays the result set which is a single row:

$ python3 presto-test.py
('2021-02-10 13:16:47.247 UTC',)

There are other methods one can use, like the Presto python client which can be found at https://github.com/prestodb/presto-python-client 

We hope you enjoyed this Presto & Python tutorial.  All the above information works perfectly with Ahana Cloud for Presto too, the fully integrated, cloud-native, managed service for AWS.

Setting-up Presto Resource Groups

Before you start allowing users on your Presto cluster it’s best practice to configure resource groups.  A resource group is a query admission control and workload management mechanism that manages resource allocation. Once configured, Presto’s resource group manager places limits on CPU and/or memory resource usage, can enforce queueing policies on queries, or divide resources up among groups and sub-groups. Each query belongs to a single resource group, and consumes resources from that group (and its “ancestor” groups).  Presto tracks memory and cpu usage of queries and these statistics can be recorded in the jmx catalog. 

Set-up is fairly straight-forward but there are some subtleties to be aware of like for example once a query starts execution, the resource group manager has no control over the query.

There follows a worked example which sets up four resource groups with controls based on memory usage: 

  1. Group1 one can use up to 40% of the (available) cluster memory 
  2. Group2 can use up to 30% of memory
  3. An administration group which has high priority for anyone logging in as “admin”
  4. An adhoc group for all other queries. 

Step 1: You’ll need to add a  etc/resource-groups.properties file with the following contents to enable resource groups:

resource-groups.configuration-manager=file
resource-groups.config-file=etc/resource-groups.json

Step 2: Create your etc/resource-groups.json file with your group1 (40% mem) and group2 (30% mem). Some recommendations:

  • Generally, allow a max global mem usage of 80% for Presto user activity. The remaining 20% is the recommended overhead for Presto to use as workspace.
  • Configure your maxQueued parameter (required) to the maximum number of queued queries you want the cluster to tolerate. Once this limit is reached new queries are rejected with an error message.
  • Configure hardConcurrencyLimit – this is the maximum query concurrency across the cluster. 
  • Configure your softMemoryLimit (your maximum amount of distributed memory each group may use before new queries are queued) for each group. You can specify it as a percentage (i.e. 40%) of the cluster’s memory. 
  • We’ll leave the cpuLimit parameters alone, which are optional anyway. These can be looked at later depending on your workload and performance requirements. 
  • Consider setting  “jmxExport”: true to store the statistics allowing you to monitor the resource behaviour. 
  • I recommend a group for admin use.  I’ve called this simply “admin” in the example json file below.  
  • I recommend an adhoc group for all other queries – a “catch-all”. I added this to the example json file below.

Step 3: You also need to set-up Selectors which let the cluster know who or what you are so you can be assigned a resource group. This can be based on the id of the user running the query, or a “source” string literal that you provide on the CLI with the –source  option. There’s a way to pass this via JDBC calls too.  I have added a “group1” and a “group2” selector to the example configuration below, plus “admin”, and “adhoc” for everything else.  

Step 4: Test it! Examine Presto’s log when you start the cluster to make sure your json config is valid. If all is well run a query with the Presto CLI and specify the resource group like this:

$ presto --server localhost:8090 --source group1
presto> show catalogs;

Check that the group is being picked-up correctly in the Presto UI – here you can see the right group is displayed (“group1”) for my test query:

There is a great example and more detailed description of all the params in the docs at https://prestodb.io/docs/current/admin/resource-groups.html

Once you get the basic resource groups set up you can tune it. You can consider using the optional schedulingPolicy which controls how queued queries are selected to run next. Also if your two resource groups have differing importance you can set their schedulingWeight (default is 1) to control how their queued queries are selected for execution – higher weight = higher priority e.g. users’ adhoc/interactive queries might be set to 10, but batch /  etl job type queries may be left at 1.  You can also have Presto auto-recognise DDL and treat such queries differently with their own group.

Here’s a sample etc/resource-groups.json file with the four groups defined:

{
  "rootGroups": [
    {
      "name": "global"
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "jmxExport": true,
      "subGroups": [
        {
          "name": "group1",
          "softMemoryLimit": "40%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "group2",
          "softMemoryLimit": "30%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
        "name": "adhoc",
        "softMemoryLimit": "10%",
        "hardConcurrencyLimit": 50,
        "maxQueued": 1,
        "schedulingWeight": 10,
        "subGroups": [
          {
            "name": "other",
            "softMemoryLimit": "10%",
            "hardConcurrencyLimit": 2,
            "maxQueued": 1,
            "schedulingWeight": 10,
            "schedulingPolicy": "weighted_fair",
            "subGroups": [
              {
                "name": "${USER}",
                "softMemoryLimit": "10%",
                "hardConcurrencyLimit": 1,
                "maxQueued": 100
              }
            ]
          }
        ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
      {
      "user": "admin",
      "group": "admin"
    },
  {
      "source": ".*group1.*",
      "group": "global.group1"
    },
    {
      "source": ".*group2.*",
      "group": "global.group2"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ]
}

I hope that helps you get you up and running with Presto resource groups.

How To Connect To Presto

Connections to and from Presto fall into three main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. There are many connectors and they are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to data in HDFS and S3 stored in a variety of formats.  Other connectors provide access to relational databases like RedShift, Postgres and MySQL , and to NoSQL sources like Elastic search and Mongo.

2) Applications that work with Presto out of the box – these include Apache Superset, Airpal, and the Presto CLI.

3) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax: http://<prestoServerHost>:<port>/v1/query. The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u: http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above How To helps you Connect To Presto.

http://myHost:8080/v1/query/20200926_204458_00000_68x9u
http://<prestoServerHost>:<port>/v1/query
http://myHost:8080/v1/query/20200926_204458_00000_68x9u

How To Access Presto

There are several ways to access and use Presto. Here are 3 of the most common ways:

  1. Command Line Interface

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

Download presto-cli-0.nnn-executable.jar from Presto’s Downloads page at https://prestodb.io/download.html, rename it to presto, make it executable with chmod +x, then run it like this:

./presto --server localhost:8080 --catalog hive --schema default --user xxx --password

On a Mac you can also use Brew to install the Presto CLI.  Simply type brew install presto on the command line.

  1. Apache Superset

Superset enables users to consume data in many different ways: Writing SQL queries, creating tables, creating a visualizations and sophisticated dashboards.  Superset can be downloaded from https://superset.incubator.apache.org

  1.  JDBC/ODBC Drivers

A huge number of tools and apps can be connected to Presto via JDBC and ODBC.  JDBC driver details and download is here: https://prestodb.io/docs/current/installation/jdbc.html

For ODBC several free and paid-for options exist:

In addition some applications have drivers for Presto built-in – Tableau is one such  example.

More tools, APIs and information on how to access presto can be found on the Presto Resources page at https://prestodb.io/resources.html.

Machine Learning With Presto

Presto is an invaluable tool for accessing and combining data, and performing feature engineering and feature extraction, in a scalable way ready for machine learning and other analytics.  Doing these tasks using familiar SQL is the fastest, most efficient way to prepare data for ML. Presto’s connectors, SQL engine and query federation capability give data scientists ready access to the data they need. 

Presto also has functions specific to Machine Learning.  These functions provide machine learning functionality as an aggregation function, enabling you to train Support Vector Machine (SVM) based classifiers and regressors for supervised learning problems. Note SVM is the only supported model at this time. 

In this short article we’ll take a look at these Presto ML Functions: 

  • learn_classifier()  – Receives training data and generates the model.
  • classify() – Receives the model & test data and returns the prediction.
  • features() –  Constructs a feature from numerical values. A feature is a representation of a data set composed of a sequence of pairs of labels.

There follows a worked Classification example showing how these functions are used.

This is how the features function is used to construct a numbered sequence of elements composed of numerical values: 

presto:default> SELECT features(2.0, 4.0, 6.5) AS test_features;
     test_features     
-----------------------
 {0=2.0, 1=4.0, 2=6.5} 
(1 row)

The returned features (correct name for which is an “N-dimensional vector”) are represented as a map-type value, whose key is an index of each feature.
Let’s create some test data and demonstrate all the functions at work.  The advantage of using Presto is that the data could be stored practically anywhere, and for more complex problems there may be several datasets stored across different data stores which Presto can effortlessly combine for us.  In this (well known) flower dataset, species is a label and sepal_length, sepal_width etc are features. This data describes three species of Iris:

use ahana_hive.default;
 
CREATE TABLE iris (
  id int
, sepal_length double
, sepal_width double
, petal_length double
, petal_width double
, species varchar
);
 
INSERT INTO iris VALUES 
(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
(2, 4.9, 3, 1.4, 0.2, 'Iris-setosa'),
(3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
(4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'),
(5, 5, 3.6, 1.4, 0.2, 'Iris-setosa'),
(6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'),
(7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'),
(8, 5, 3.4, 1.5, 0.2, 'Iris-setosa'),
(9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa'),
(10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(11, 5.4, 3.7, 1.5, 0.2, 'Iris-setosa'),
(12, 4.8, 3.4, 1.6, 0.2, 'Iris-setosa'),
(13, 4.8, 3, 1.4, 0.1, 'Iris-setosa'),
(14, 4.3, 3, 1.1, 0.1, 'Iris-setosa'),
(15, 5.8, 4, 1.2, 0.2, 'Iris-setosa'),
(16, 5.7, 4.4, 1.5, 0.4, 'Iris-setosa'),
(17, 5.4, 3.9, 1.3, 0.4, 'Iris-setosa'),
(18, 5.1, 3.5, 1.4, 0.3, 'Iris-setosa'),
(19, 5.7, 3.8, 1.7, 0.3, 'Iris-setosa'),
(20, 5.1, 3.8, 1.5, 0.3, 'Iris-setosa'),
(21, 5.4, 3.4, 1.7, 0.2, 'Iris-setosa'),
(22, 5.1, 3.7, 1.5, 0.4, 'Iris-setosa'),
(23, 4.6, 3.6, 1, 0.2, 'Iris-setosa'),
(24, 5.1, 3.3, 1.7, 0.5, 'Iris-setosa'),
(25, 4.8, 3.4, 1.9, 0.2, 'Iris-setosa'),
(26, 5, 3, 1.6, 0.2, 'Iris-setosa'),
(27, 5, 3.4, 1.6, 0.4, 'Iris-setosa'),
(28, 5.2, 3.5, 1.5, 0.2, 'Iris-setosa'),
(29, 5.2, 3.4, 1.4, 0.2, 'Iris-setosa'),
(30, 4.7, 3.2, 1.6, 0.2, 'Iris-setosa'),
(31, 4.8, 3.1, 1.6, 0.2, 'Iris-setosa'),
(32, 5.4, 3.4, 1.5, 0.4, 'Iris-setosa'),
(33, 5.2, 4.1, 1.5, 0.1, 'Iris-setosa'),
(34, 5.5, 4.2, 1.4, 0.2, 'Iris-setosa'),
(35, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(36, 5, 3.2, 1.2, 0.2, 'Iris-setosa'),
(37, 5.5, 3.5, 1.3, 0.2, 'Iris-setosa'),
(38, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(39, 4.4, 3, 1.3, 0.2, 'Iris-setosa'),
(40, 5.1, 3.4, 1.5, 0.2, 'Iris-setosa'),
(41, 5, 3.5, 1.3, 0.3, 'Iris-setosa'),
(42, 4.5, 2.3, 1.3, 0.3, 'Iris-setosa'),
(43, 4.4, 3.2, 1.3, 0.2, 'Iris-setosa'),
(44, 5, 3.5, 1.6, 0.6, 'Iris-setosa'),
(45, 5.1, 3.8, 1.9, 0.4, 'Iris-setosa'),
(46, 4.8, 3, 1.4, 0.3, 'Iris-setosa'),
(47, 5.1, 3.8, 1.6, 0.2, 'Iris-setosa'),
(48, 4.6, 3.2, 1.4, 0.2, 'Iris-setosa'),
(49, 5.3, 3.7, 1.5, 0.2, 'Iris-setosa'),
(50, 5, 3.3, 1.4, 0.2, 'Iris-setosa'),
(51, 7, 3.2, 4.7, 1.4, 'Iris-versicolor'),
(52, 6.4, 3.2, 4.5, 1.5, 'Iris-versicolor'),
(53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'),
(54, 5.5, 2.3, 4, 1.3, 'Iris-versicolor'),
(55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'),
(56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'),
(57, 6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'),
(58, 4.9, 2.4, 3.3, 1, 'Iris-versicolor'),
(59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'),
(60, 5.2, 2.7, 3.9, 1.4, 'Iris-versicolor'),
(61, 5, 2, 3.5, 1, 'Iris-versicolor'),
(62, 5.9, 3, 4.2, 1.5, 'Iris-versicolor'),
(63, 6, 2.2, 4, 1, 'Iris-versicolor'),
(64, 6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'),
(65, 5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'),
(66, 6.7, 3.1, 4.4, 1.4, 'Iris-versicolor'),
(67, 5.6, 3, 4.5, 1.5, 'Iris-versicolor'),
(68, 5.8, 2.7, 4.1, 1, 'Iris-versicolor'),
(69, 6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'),
(70, 5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'),
(71, 5.9, 3.2, 4.8, 1.8, 'Iris-versicolor'),
(72, 6.1, 2.8, 4, 1.3, 'Iris-versicolor'),
(73, 6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'),
(74, 6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'),
(75, 6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(76, 6.6, 3, 4.4, 1.4, 'Iris-versicolor'),
(77, 6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'),
(78, 6.7, 3, 5, 1.7, 'Iris-versicolor'),
(79, 6, 2.9, 4.5, 1.5, 'Iris-versicolor'),
(80, 5.7, 2.6, 3.5, 1, 'Iris-versicolor'),
(81, 5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'),
(82, 5.5, 2.4, 3.7, 1, 'Iris-versicolor'),
(83, 5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'),
(84, 6, 2.7, 5.1, 1.6, 'Iris-versicolor'),
(85, 5.4, 3, 4.5, 1.5, 'Iris-versicolor'),
(86, 6, 3.4, 4.5, 1.6, 'Iris-versicolor'),
(87, 6.7, 3.1, 4.7, 1.5, 'Iris-versicolor'),
(88, 6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'),
(89, 5.6, 3, 4.1, 1.3, 'Iris-versicolor'),
(90, 5.5, 2.5, 4, 1.3, 'Iris-versicolor'),
(91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'),
(92, 6.1, 3, 4.6, 1.4, 'Iris-versicolor'),
(93, 5.8, 2.6, 4, 1.2, 'Iris-versicolor'),
(94, 5, 2.3, 3.3, 1, 'Iris-versicolor'),
(95, 5.6, 2.7, 4.2, 1.3, 'Iris-versicolor'),
(96, 5.7, 3, 4.2, 1.2, 'Iris-versicolor'),
(97, 5.7, 2.9, 4.2, 1.3, 'Iris-versicolor'),
(98, 6.2, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(99, 5.1, 2.5, 3, 1.1, 'Iris-versicolor'),
(100, 5.7, 2.8, 4.1, 1.3, 'Iris-versicolor'),
(101, 6.3, 3.3, 6, 2.5, 'Iris-virginica'),
(102, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(103, 7.1, 3, 5.9, 2.1, 'Iris-virginica'),
(104, 6.3, 2.9, 5.6, 1.8, 'Iris-virginica'),
(105, 6.5, 3, 5.8, 2.2, 'Iris-virginica'),
(106, 7.6, 3, 6.6, 2.1, 'Iris-virginica'),
(107, 4.9, 2.5, 4.5, 1.7, 'Iris-virginica'),
(108, 7.3, 2.9, 6.3, 1.8, 'Iris-virginica'),
(109, 6.7, 2.5, 5.8, 1.8, 'Iris-virginica'),
(110, 7.2, 3.6, 6.1, 2.5, 'Iris-virginica'),
(111, 6.5, 3.2, 5.1, 2, 'Iris-virginica'),
(112, 6.4, 2.7, 5.3, 1.9, 'Iris-virginica'),
(113, 6.8, 3, 5.5, 2.1, 'Iris-virginica'),
(114, 5.7, 2.5, 5, 2, 'Iris-virginica'),
(115, 5.8, 2.8, 5.1, 2.4, 'Iris-virginica'),
(116, 6.4, 3.2, 5.3, 2.3, 'Iris-virginica'),
(117, 6.5, 3, 5.5, 1.8, 'Iris-virginica'),
(118, 7.7, 3.8, 6.7, 2.2, 'Iris-virginica'),
(119, 7.7, 2.6, 6.9, 2.3, 'Iris-virginica'),
(120, 6, 2.2, 5, 1.5, 'Iris-virginica'),
(121, 6.9, 3.2, 5.7, 2.3, 'Iris-virginica'),
(122, 5.6, 2.8, 4.9, 2, 'Iris-virginica'),
(123, 7.7, 2.8, 6.7, 2, 'Iris-virginica'),
(124, 6.3, 2.7, 4.9, 1.8, 'Iris-virginica'),
(125, 6.7, 3.3, 5.7, 2.1, 'Iris-virginica'),
(126, 7.2, 3.2, 6, 1.8, 'Iris-virginica'),
(127, 6.2, 2.8, 4.8, 1.8, 'Iris-virginica'),
(128, 6.1, 3, 4.9, 1.8, 'Iris-virginica'),
(129, 6.4, 2.8, 5.6, 2.1, 'Iris-virginica'),
(130, 7.2, 3, 5.8, 1.6, 'Iris-virginica'),
(131, 7.4, 2.8, 6.1, 1.9, 'Iris-virginica'),
(132, 7.9, 3.8, 6.4, 2, 'Iris-virginica'),
(133, 6.4, 2.8, 5.6, 2.2, 'Iris-virginica'),
(134, 6.3, 2.8, 5.1, 1.5, 'Iris-virginica'),
(135, 6.1, 2.6, 5.6, 1.4, 'Iris-virginica'),
(136, 7.7, 3, 6.1, 2.3, 'Iris-virginica'),
(137, 6.3, 3.4, 5.6, 2.4, 'Iris-virginica'),
(138, 6.4, 3.1, 5.5, 1.8, 'Iris-virginica'),
(139, 6, 3, 4.8, 1.8, 'Iris-virginica'),
(140, 6.9, 3.1, 5.4, 2.1, 'Iris-virginica'),
(141, 6.7, 3.1, 5.6, 2.4, 'Iris-virginica'),
(142, 6.9, 3.1, 5.1, 2.3, 'Iris-virginica'),
(143, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(144, 6.8, 3.2, 5.9, 2.3, 'Iris-virginica'),
(145, 6.7, 3.3, 5.7, 2.5, 'Iris-virginica'),
(146, 6.7, 3, 5.2, 2.3, 'Iris-virginica'),
(147, 6.3, 2.5, 5, 1.9, 'Iris-virginica'),
(148, 6.5, 3, 5.2, 2, 'Iris-virginica'),
(149, 6.2, 3.4, 5.4, 2.3, 'Iris-virginica'),
(150, 5.9, 3, 5.1, 1.8, 'Iris-virginica');

Next we use this dataset and the classify(), features() and learn_classifier() functions to predict the species of flower based on some arbitrary features (measurements) that we provide.  The exact features used in the following example are not present in the iris dataset so it’s not a simple lookup – the model predicts the label (the flower species) for us:

select classify(features(5.8, 3, 5.1, 1.8), model) AS prediction 
FROM (
 SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width)) AS model FROM iris
 ) t;
 
   prediction   
----------------
 Iris-virginica 
(1 row)

The model predicts our measurements are an “Iris-virginica”. How does it work? The classify() function returns the predicted label by using the trained model. The trained model is constructed by the learn_classifier() function.  Using this approach means we cannot save the model so we pass it using the nested query (SELECT learn_classifier…) each time.

We can optionally view the trained model, but it does not make much sense to a human. Run the function to train the classification model again but omit the classify() function:

SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width) ) AS model
FROM iris;
 
                      model                      
-------------------------------------------------
 3c 43 6c 61 73 73 69 66 69 65 72 28 76 61 72 63 
 68 61 72 29 3e                                  
(1 row)

The trained model is returned in a serialized format. 

We hope the above Presto Machine Learning article and Classification example proves useful.  Stay tuned for a follow-up article introducing Regression – another type of supervised learning problem that Presto can help with.

How to Make The Presto Query Engine Run Fastest

How do you make the Presto query engine run as fast as possible?  The following list of things to consider, check and tune will help you get the most out of your Presto clusters:

  • How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger server/cloud instance, as more CPU and memory could help.

  • Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.

  • What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
  • Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
  1. Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
  2. Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies

  • Do you have a separate coordinator node? With PrestoDB you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)

  • Is memory exhausted? Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 

The good news is there is memory information available in at least two places:

1. Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking.

2. There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.

  • When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection.
  • Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be.

  • Does IO and CPU utilisation look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough. You can check these stats in the Presto Console’s UI “Cluster Overview”.
  • Have you checked your data volumes recently? An obvious one to check but data volumes can grow unexpectedly and sometimes growth peaks occur without you noticing. The queries may simply be taking longer because there is x% more data than last month.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. More information can be found in the PrestoDB documentation.

We hope the above information makes your Presto clusters run as fast as they can. Click here for more about Ahana

What Is A Presto Connection & How Does It Work?

A Presto connection, connections to and from Presto, fall into two main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. The connectors are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to HDFS and S3 data stored in a variety of formats.

2) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following Presto JDBC connector URL connection string formats are supported:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above information helps you with creating Presto Connections.

How to Presto Data Share

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables data sharing in two ways. 

1) With its broad list of data connectors, Presto provides a simple way to exploit data virtualization with federated SQL queries accessing data across multiple data sources.  Data in RDBMSs, NoSQL databases, legacy data warehouses, files in object storage can all be accessed,  and combined in a single SQL query if required.  Presto allows querying data where it lives, like in Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization by a variety of users and applications.  A complete list of Presto’s connectors can be found at  https://prestodb.io/docs/current/connector.html 

2) sharing access to data by connecting your Presto cluster to other systems, tools and applications is achieved using the Presto API or using SQL via client libraries, ODBC or JDBC interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query 

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries 

When it comes to Presto data usage and share, we hope the above information is useful.

Presto Performance: Speed, Optimization & Tuning

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Performance is particularly important for ad-hoc analytical queries so Presto has a number of design features to maximize Presto speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

Presto Performance Tips

The following recommendations can help you achieve maximum performance from your Presto clusters:  

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 


Finally, to assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service. 

We hope these Presto performance recommendations help you get the most out of Presto. If you’re interested in getting more information about Ahana, the easiest managed service for Presto on AWS, check out the case study comparing AWS Athena to Ahana.

Presto REST API

Presto’s HTTP API (Presto Rest API) is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Parquet

Parquet is a columnar storage format for Hadoop, supported by Presto.  Storing data as columns as opposed to rows enables Presto to fulfill queries more efficiently. By discarding unwanted data in rows, and preventing unnecessary scans, columnar storage saves disk space and improves query performance for larger data sets.

In Parquet, data is first horizontally partitioned into groups of rows, then within each group, data is vertically partitioned into columns. Data for a particular column is stored together via compression and encoding to save space and improve performance. Each Parquet Presto file has a footer that stores codecs, encoding information, as well as column-level statistics, e.g., the minimum and maximum number of column values. 

There have been several improvements made to Presto’s Parquet reader by the community, most notably by Uber, to enhance performance with features such as pushdown.  

For developers, a useful tool-set for working with Presto Parquet files is parquet-tools which allows you to see the schema contained in the file, view data, examine data types etc.  Parquet-tools can be installed by pip https://pypi.org/project/parquet-tools/ or installed on a Mac with brew install parquet-tools or cloned and built from https://github.com/apache/parquet-mr

Presto’s Parquet support is a popular choice for data storage.

Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design. In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 


To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more Presto speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service

Presto Tools

Several tools  and libraries are available, some are bundled with core Presto and others are downloadable extras. See the documentation Resources page at  https://prestodb.io/resources.html for details. 

Core Presto Tools:

  • Presto Console – Included with Presto, this browser-based UI allows administrators to monitor many aspects of their Presto deployment in real time, such as query workload, memory usage, execution plans etc. 

Community Contributed Tools 

Several community tools are listed at https://prestodb.io/resources.html including:

Other tools & Libraries 

Presto Documentation

Presto’s documentation is maintained by the Presto Foundation and the current version is available here: https://prestodb.io/docs/current/ 

Releases can be found at: https://github.com/prestodb and the Release Notes are at: https://prestodb.io/docs/current/release.html 

Currently the documentation site has no search function. Use Google to search the docs instead – for example if you want to find the Presto documentation page for window functions just google site:prestodb.io window functions and click the first match.

Some of the most popular and useful documentation pages are listed below for convenience: 

In addition the documentation for Ahana Cloud for Presto – a fully integrated, cloud-native managed service built for AWS – can be found at https://docs.ahana.cloud/docs

That’s your summary of Presto documentation. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Query Example

Since Presto is an ANSI SQL query engine, its SQL will be very familiar to practically anyone who has used a database, despite the fact Presto is not technically a database since it stores no data itself.

Starting with the SELECT statement, the full range of keywords that you would expect to use are supported including DISTINCT,  FROM, WHERE, GROUP BY, HAVING, UNION | INTERSECT | EXCEPT, ORDER B, LIMIT and so on.  More details can be found on the Presto SELECT doc page: https://prestodb.io/docs/current/sql/select.html 

Like most other database products, CTEs are also supported – Common Table Expressions allow you to define a temporary named result set that is available in the execution scope of a statement such as SELECT or INSERT.  Here’s the same query written without and then with a CTE:

--Without CTE
SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
 
--WITH CTE
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Here’s a CTE Presto query example:

presto> WITH my_time_data (the_time_now) as ( select now() ) 
SELECT the_time_now 
FROM my_time_data;
 
             the_time_now              
---------------------------------------
 2021-01-18 12:23:12.167 Europe/London 
(1 row)

The above Presto query example and documentation link should get you up and running writing Presto queries. 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Best Practices

We’ve put together a list of Presto best practices, based on our experience. We hope you find these helpful.

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 

We hope you find these Presto best practices useful. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Guide, Documentation & Support

Presto is a fast, flexible distributed SQL engine created and used by Facebook at scale. O’Reilly’s Learning and Operating Presto ebook is an approachable guide for using Presto and getting Presto into a production deployment. It is intended to make Presto more accessible to anyone who cares about their organization becoming data-driven. It will explain how to begin Presto operations at your organization to derive insights on datasets of any type and any size, wherever they reside. 

Regardless of whether you’re a current user of Presto, new to Presto, or going to provide Presto to your users, the O’Reilly Learning and Operating Presto guide book does an excellent job of explaining what you need to know.  By reading this Presto guide you will learn what Presto is, where it came from, and how it differs from other data warehousing solutions. You’ll discover why Facebook, Uber, Twitter, and cloud providers use Presto and how you can quickly deploy Presto in production. You’ll hear about who uses Presto and the ways they solve data challenges with Presto. 

You can supplement what you learn in this Presto guide by visiting the Presto website, reading the Ahana blog, and participating in Presto’s growing community, which is filled with experienced users, developers, and data architects. You can join the Slack channel, engage via Github Discussions, attend the Presto Virtual Meetups.  All these avenues will help guide you through Presto. You can also view official Presto documentation.

If you’re looking for official support for Presto, take a look at Ahana Cloud. We offer SaaS for Presto with pay as you go pricing and technical support.