Hive vs Presto vs Spark

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Commonalities

  • All three projects are community-driven open-source software released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.

Ahana Cloud for Presto Versus Amazon EMR

In this brief post, we’ll discuss some of the benefits of Ahana Cloud over Amazon Elastic MapReduce (EMR). While EMR offers optionality in the number of big data compute frameworks, that flexibility comes with operational and configuration burden. When it comes to low-latency interactive querying on big data that just works, Ahana Cloud for Presto offers much lower operational burden and Presto-specific optimizations.

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases. In fact, you can run federated queries across your data sources. Developed at Facebook, Presto is supported by the Presto Foundation, an independent nonprofit organization under the auspices of the Linux Foundation. Presto is used by leading technology companies, such as Facebook, Twitter, Uber, and Netflix.

Amazon EMR is a big data platform hosted in AWS. EMR allows you to provision a cluster with one or more big data technologies, such as Hadoop, Apache Spark, Apache Hive, and Presto. Ahana Cloud for Presto is the easiest cloud-native managed service for Presto, empowering data teams of all sizes. As a focused Presto solution, here are a few of Ahana Cloud’s benefits over Amazon EMR:

Less configuration. Born of the Hadoop era, Presto has several configuration parameters in several files to configure and tune to get right. With EMR, you have to configure these yourself. With Ahana Cloud, we tune more than 200 parameters out of the box, so when you spin up a cluster, you get excellent query performance from the get go. Out of the box, Ahana Cloud provides an Apache Superset sandbox for administrators to validate connecting to, querying and visualizing your data.

Easy-to-modify configuration. Ahana Cloud offers the ability to not only spin up and terminate clusters, but also stop and restart them—-allowing you to change the number of Presto workers and add or remove data sources. With EMR, any manual changes to the number of Presto workers and data sources require a new cluster or manually restarting the services yourself. Further, adding and removing data sources is done through a convenient user interface instead modifying low-level configuration files.

Optimizations. As a Presto managed service, Ahana Cloud will continually provide optimizations relevant to Presto. For example, Ahana recently released data lake I/O caching. Based on the RubiX open source project and enabled with a single click, the caching eliminates redundant reads from your data lake if the same data is read over and over. This caching results in up to 5x query performance improvement and up to 85% latency reductions for concurrent workloads. Finally, idle clusters processing no queries can automatically scale down to a single Presto worker to preserve costs while allowing for a quick warm up.

If you are experienced at tuning Presto and want full control of the infrastructure management, Amazon EMR may be the choice for you. If simplicity and accelerated go-to-market without needing to manage a complex infrastructure are what you seek, then Ahana Cloud for Presto is the way to go. Sign up for our free trial today.

Presto Geospatial Functions Tutorial

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (

trip_id int, 

orig_long double, 

orig_lat double, 

dest_long double, 

dest_lat double, 

trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast(‘2021-03-02 09:00:00 UTC’ as timestamp));

insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast(‘2021-03-02 09:30:00 UTC’ as timestamp));

insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast(‘2021-03-02 09:45:00 UTC’ as timestamp));

insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast(‘2021-03-02 10:45:00 UTC’ as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (

geo_shape varchar, 

name varchar);

insert into city_table values (‘POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))’, ‘London, central’);

insert into city_table values(‘POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))’, ‘Los Angeles, downtown’);

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”) 

Now lets run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips FROM trips_table as t JOIN city_table as c ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) GROUP BY 1;

         

CityTrips
Los Angeles, downtown1
London, central
(2 rows)
2

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris. 

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto 

Presto’s Geospatial functions are listed in the Presto documentation.  

Check out our other Learning Center content  at https://ahana.io/learn/ where you’ll find useful tutorials and examples, from using Python with Presto, to Machine Learning and more.  We also cover comparisons between Presto and other technologies. 

Also check https://ahana.io/answers/ for answers to frequently asked questions on all things Presto.

How to Write a Python Script to Connect to a Presto Cluster

If you need a quick tutorial on how to write a python script to connect to a Presto cluster and run a query, you came to the right place.

Prerequisites

In this tutorial we’ll be using a Python3 environment, and the SQLAlchemy.  SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. 

  1. You will need a working python installation – check that you have with:
$ python3 --version
Python 3.9.1

With Python 3.4 or later you will also have pip3 installer, which we’ll use to install two modules.

2. Install the sqlalchemy module using pip3:

$ pip3 install SQLAlchemy
Collecting SQLAlchemy
  Downloading SQLAlchemy-1.3.23-cp39-cp39-macosx_10_14_x86_64.whl (1.2 MB)
Installing collected packages: SQLAlchemy
Successfully installed SQLAlchemy-1.3.23

3. Install the pyhive interface for Presto. PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive. We want the Presto interface:

$ pip3 install 'pyhive[presto]'
Collecting pyhive[presto]
 ...
Successfully built pyhive future
Installing collected packages: six, urllib3, python-dateutil, idna, future, chardet, certifi, requests, pyhive
Successfully installed certifi-2020.12.5 chardet-4.0.0 future-0.18.2 idna-2.10 pyhive-0.6.3 python-dateutil-2.8.1 requests-2.25.1 six-1.15.0 urllib3-1.26.3

Test Script

Now you have python, SQLAchemy and PyHive ready to roll, create this sample script and save it as presto-test.py.  In it we’ll run a simple SELECT query that just returns the Presto cluster’s system timestamp.  Just replace the  user ID, password, cluster name, and metastore (e.g. the Ahana integrated ahana_hive in this case) with your own. Here we’re using an HTTPS connection so the default port number is 443:

# presto-test.py
# simple python script to run a query on a presto cluster and display the result set 
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine("presto://user:password@myPrestoClusterName:443/ahana_hive", connect_args={'protocol': 'https'})
 
with engine.connect() as con:
    rs = con.execute('SELECT now()')
    for row in rs:
        print(row)

Invocation

The script runs the SELECT statement and displays the result set which is a single row:

$ python3 presto-test.py
('2021-02-10 13:16:47.247 UTC',)

There are other methods one can use, like the Presto python client which can be found at https://github.com/prestodb/presto-python-client 

We hope you enjoyed this Presto & Python tutorial.  All the above information works perfectly with Ahana Cloud for Presto too, the fully integrated, cloud-native, managed service for AWS.

Setting-up Presto Resource Groups

Before you start allowing users on your Presto cluster it’s best practice to configure resource groups.  A resource group is a query admission control and workload management mechanism that manages resource allocation. Once configured, Presto’s resource group manager places limits on CPU and/or memory resource usage, can enforce queueing policies on queries, or divide resources up among groups and sub-groups. Each query belongs to a single resource group, and consumes resources from that group (and its “ancestor” groups).  Presto tracks memory and cpu usage of queries and these statistics can be recorded in the jmx catalog. 

Set-up is fairly straight-forward but there are some subtleties to be aware of like for example once a query starts execution, the resource group manager has no control over the query.

There follows a worked example which sets up four resource groups with controls based on memory usage: 

  1. Group1 one can use up to 40% of the (available) cluster memory 
  2. Group2 can use up to 30% of memory
  3. An administration group which has high priority for anyone logging in as “admin”
  4. An adhoc group for all other queries. 

Step 1: You’ll need to add a  etc/resource-groups.properties file with the following contents to enable resource groups:

resource-groups.configuration-manager=file
resource-groups.config-file=etc/resource-groups.json

Step 2: Create your etc/resource-groups.json file with your group1 (40% mem) and group2 (30% mem). Some recommendations:

  • Generally, allow a max global mem usage of 80% for Presto user activity. The remaining 20% is the recommended overhead for Presto to use as workspace.
  • Configure your maxQueued parameter (required) to the maximum number of queued queries you want the cluster to tolerate. Once this limit is reached new queries are rejected with an error message.
  • Configure hardConcurrencyLimit – this is the maximum query concurrency across the cluster. 
  • Configure your softMemoryLimit (your maximum amount of distributed memory each group may use before new queries are queued) for each group. You can specify it as a percentage (i.e. 40%) of the cluster’s memory. 
  • We’ll leave the cpuLimit parameters alone, which are optional anyway. These can be looked at later depending on your workload and performance requirements. 
  • Consider setting  “jmxExport”: true to store the statistics allowing you to monitor the resource behaviour. 
  • I recommend a group for admin use.  I’ve called this simply “admin” in the example json file below.  
  • I recommend an adhoc group for all other queries – a “catch-all”. I added this to the example json file below.

Step 3: You also need to set-up Selectors which let the cluster know who or what you are so you can be assigned a resource group. This can be based on the id of the user running the query, or a “source” string literal that you provide on the CLI with the –source  option. There’s a way to pass this via JDBC calls too.  I have added a “group1” and a “group2” selector to the example configuration below, plus “admin”, and “adhoc” for everything else.  

Step 4: Test it! Examine Presto’s log when you start the cluster to make sure your json config is valid. If all is well run a query with the Presto CLI and specify the resource group like this:

$ presto --server localhost:8090 --source group1
presto> show catalogs;

Check that the group is being picked-up correctly in the Presto UI – here you can see the right group is displayed (“group1”) for my test query:

There is a great example and more detailed description of all the params in the docs at https://prestodb.io/docs/current/admin/resource-groups.html

Once you get the basic resource groups set up you can tune it. You can consider using the optional schedulingPolicy which controls how queued queries are selected to run next. Also if your two resource groups have differing importance you can set their schedulingWeight (default is 1) to control how their queued queries are selected for execution – higher weight = higher priority e.g. users’ adhoc/interactive queries might be set to 10, but batch /  etl job type queries may be left at 1.  You can also have Presto auto-recognise DDL and treat such queries differently with their own group.

Here’s a sample etc/resource-groups.json file with the four groups defined:

{
  "rootGroups": [
    {
      "name": "global"
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "jmxExport": true,
      "subGroups": [
        {
          "name": "group1",
          "softMemoryLimit": "40%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "group2",
          "softMemoryLimit": "30%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
        "name": "adhoc",
        "softMemoryLimit": "10%",
        "hardConcurrencyLimit": 50,
        "maxQueued": 1,
        "schedulingWeight": 10,
        "subGroups": [
          {
            "name": "other",
            "softMemoryLimit": "10%",
            "hardConcurrencyLimit": 2,
            "maxQueued": 1,
            "schedulingWeight": 10,
            "schedulingPolicy": "weighted_fair",
            "subGroups": [
              {
                "name": "${USER}",
                "softMemoryLimit": "10%",
                "hardConcurrencyLimit": 1,
                "maxQueued": 100
              }
            ]
          }
        ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
      {
      "user": "admin",
      "group": "admin"
    },
  {
      "source": ".*group1.*",
      "group": "global.group1"
    },
    {
      "source": ".*group2.*",
      "group": "global.group2"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ]
}

I hope that helps you get you up and running with Presto resource groups.

How To Connect To Presto

Connections to and from Presto fall into three main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. There are many connectors and they are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to data in HDFS and S3 stored in a variety of formats.  Other connectors provide access to relational databases like RedShift, Postgres and MySQL , and to NoSQL sources like Elastic search and Mongo.

2) Applications that work with Presto out of the box – these include Apache Superset, Airpal, and the Presto CLI.

3) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax: http://<prestoServerHost>:<port>/v1/query. The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u: http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above How To helps you Connect To Presto.

http://myHost:8080/v1/query/20200926_204458_00000_68x9u
http://<prestoServerHost>:<port>/v1/query
http://myHost:8080/v1/query/20200926_204458_00000_68x9u

How To Access Presto

There are several ways to access and use Presto. Here are 3 of the most common ways:

  1. Command Line Interface

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

Download presto-cli-0.nnn-executable.jar from Presto’s Downloads page at https://prestodb.io/download.html, rename it to presto, make it executable with chmod +x, then run it like this:

./presto --server localhost:8080 --catalog hive --schema default --user xxx --password

On a Mac you can also use Brew to install the Presto CLI.  Simply type brew install presto on the command line.

  1. Apache Superset

Superset enables users to consume data in many different ways: Writing SQL queries, creating tables, creating a visualizations and sophisticated dashboards.  Superset can be downloaded from https://superset.incubator.apache.org

  1.  JDBC/ODBC Drivers

A huge number of tools and apps can be connected to Presto via JDBC and ODBC.  JDBC driver details and download is here: https://prestodb.io/docs/current/installation/jdbc.html

For ODBC several free and paid-for options exist:

In addition some applications have drivers for Presto built-in – Tableau is one such  example.

More tools, APIs and information on how to access presto can be found on the Presto Resources page at https://prestodb.io/resources.html.

Machine Learning With Presto

Presto is an invaluable tool for accessing and combining data, and performing feature engineering and feature extraction, in a scalable way ready for machine learning and other analytics.  Doing these tasks using familiar SQL is the fastest, most efficient way to prepare data for ML. Presto’s connectors, SQL engine and query federation capability give data scientists ready access to the data they need. 

Presto also has functions specific to Machine Learning.  These functions provide machine learning functionality as an aggregation function, enabling you to train Support Vector Machine (SVM) based classifiers and regressors for supervised learning problems. Note SVM is the only supported model at this time. 

In this short article we’ll take a look at these Presto ML Functions: 

  • learn_classifier()  – Receives training data and generates the model.
  • classify() – Receives the model & test data and returns the prediction.
  • features() –  Constructs a feature from numerical values. A feature is a representation of a data set composed of a sequence of pairs of labels.

There follows a worked Classification example showing how these functions are used.

This is how the features function is used to construct a numbered sequence of elements composed of numerical values: 

presto:default> SELECT features(2.0, 4.0, 6.5) AS test_features;
     test_features     
-----------------------
 {0=2.0, 1=4.0, 2=6.5} 
(1 row)

The returned features (correct name for which is an “N-dimensional vector”) are represented as a map-type value, whose key is an index of each feature.
Let’s create some test data and demonstrate all the functions at work.  The advantage of using Presto is that the data could be stored practically anywhere, and for more complex problems there may be several datasets stored across different data stores which Presto can effortlessly combine for us.  In this (well known) flower dataset, species is a label and sepal_length, sepal_width etc are features. This data describes three species of Iris:

use ahana_hive.default;
 
CREATE TABLE iris (
  id int
, sepal_length double
, sepal_width double
, petal_length double
, petal_width double
, species varchar
);
 
INSERT INTO iris VALUES 
(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
(2, 4.9, 3, 1.4, 0.2, 'Iris-setosa'),
(3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
(4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'),
(5, 5, 3.6, 1.4, 0.2, 'Iris-setosa'),
(6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'),
(7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'),
(8, 5, 3.4, 1.5, 0.2, 'Iris-setosa'),
(9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa'),
(10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(11, 5.4, 3.7, 1.5, 0.2, 'Iris-setosa'),
(12, 4.8, 3.4, 1.6, 0.2, 'Iris-setosa'),
(13, 4.8, 3, 1.4, 0.1, 'Iris-setosa'),
(14, 4.3, 3, 1.1, 0.1, 'Iris-setosa'),
(15, 5.8, 4, 1.2, 0.2, 'Iris-setosa'),
(16, 5.7, 4.4, 1.5, 0.4, 'Iris-setosa'),
(17, 5.4, 3.9, 1.3, 0.4, 'Iris-setosa'),
(18, 5.1, 3.5, 1.4, 0.3, 'Iris-setosa'),
(19, 5.7, 3.8, 1.7, 0.3, 'Iris-setosa'),
(20, 5.1, 3.8, 1.5, 0.3, 'Iris-setosa'),
(21, 5.4, 3.4, 1.7, 0.2, 'Iris-setosa'),
(22, 5.1, 3.7, 1.5, 0.4, 'Iris-setosa'),
(23, 4.6, 3.6, 1, 0.2, 'Iris-setosa'),
(24, 5.1, 3.3, 1.7, 0.5, 'Iris-setosa'),
(25, 4.8, 3.4, 1.9, 0.2, 'Iris-setosa'),
(26, 5, 3, 1.6, 0.2, 'Iris-setosa'),
(27, 5, 3.4, 1.6, 0.4, 'Iris-setosa'),
(28, 5.2, 3.5, 1.5, 0.2, 'Iris-setosa'),
(29, 5.2, 3.4, 1.4, 0.2, 'Iris-setosa'),
(30, 4.7, 3.2, 1.6, 0.2, 'Iris-setosa'),
(31, 4.8, 3.1, 1.6, 0.2, 'Iris-setosa'),
(32, 5.4, 3.4, 1.5, 0.4, 'Iris-setosa'),
(33, 5.2, 4.1, 1.5, 0.1, 'Iris-setosa'),
(34, 5.5, 4.2, 1.4, 0.2, 'Iris-setosa'),
(35, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(36, 5, 3.2, 1.2, 0.2, 'Iris-setosa'),
(37, 5.5, 3.5, 1.3, 0.2, 'Iris-setosa'),
(38, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(39, 4.4, 3, 1.3, 0.2, 'Iris-setosa'),
(40, 5.1, 3.4, 1.5, 0.2, 'Iris-setosa'),
(41, 5, 3.5, 1.3, 0.3, 'Iris-setosa'),
(42, 4.5, 2.3, 1.3, 0.3, 'Iris-setosa'),
(43, 4.4, 3.2, 1.3, 0.2, 'Iris-setosa'),
(44, 5, 3.5, 1.6, 0.6, 'Iris-setosa'),
(45, 5.1, 3.8, 1.9, 0.4, 'Iris-setosa'),
(46, 4.8, 3, 1.4, 0.3, 'Iris-setosa'),
(47, 5.1, 3.8, 1.6, 0.2, 'Iris-setosa'),
(48, 4.6, 3.2, 1.4, 0.2, 'Iris-setosa'),
(49, 5.3, 3.7, 1.5, 0.2, 'Iris-setosa'),
(50, 5, 3.3, 1.4, 0.2, 'Iris-setosa'),
(51, 7, 3.2, 4.7, 1.4, 'Iris-versicolor'),
(52, 6.4, 3.2, 4.5, 1.5, 'Iris-versicolor'),
(53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'),
(54, 5.5, 2.3, 4, 1.3, 'Iris-versicolor'),
(55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'),
(56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'),
(57, 6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'),
(58, 4.9, 2.4, 3.3, 1, 'Iris-versicolor'),
(59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'),
(60, 5.2, 2.7, 3.9, 1.4, 'Iris-versicolor'),
(61, 5, 2, 3.5, 1, 'Iris-versicolor'),
(62, 5.9, 3, 4.2, 1.5, 'Iris-versicolor'),
(63, 6, 2.2, 4, 1, 'Iris-versicolor'),
(64, 6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'),
(65, 5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'),
(66, 6.7, 3.1, 4.4, 1.4, 'Iris-versicolor'),
(67, 5.6, 3, 4.5, 1.5, 'Iris-versicolor'),
(68, 5.8, 2.7, 4.1, 1, 'Iris-versicolor'),
(69, 6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'),
(70, 5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'),
(71, 5.9, 3.2, 4.8, 1.8, 'Iris-versicolor'),
(72, 6.1, 2.8, 4, 1.3, 'Iris-versicolor'),
(73, 6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'),
(74, 6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'),
(75, 6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(76, 6.6, 3, 4.4, 1.4, 'Iris-versicolor'),
(77, 6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'),
(78, 6.7, 3, 5, 1.7, 'Iris-versicolor'),
(79, 6, 2.9, 4.5, 1.5, 'Iris-versicolor'),
(80, 5.7, 2.6, 3.5, 1, 'Iris-versicolor'),
(81, 5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'),
(82, 5.5, 2.4, 3.7, 1, 'Iris-versicolor'),
(83, 5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'),
(84, 6, 2.7, 5.1, 1.6, 'Iris-versicolor'),
(85, 5.4, 3, 4.5, 1.5, 'Iris-versicolor'),
(86, 6, 3.4, 4.5, 1.6, 'Iris-versicolor'),
(87, 6.7, 3.1, 4.7, 1.5, 'Iris-versicolor'),
(88, 6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'),
(89, 5.6, 3, 4.1, 1.3, 'Iris-versicolor'),
(90, 5.5, 2.5, 4, 1.3, 'Iris-versicolor'),
(91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'),
(92, 6.1, 3, 4.6, 1.4, 'Iris-versicolor'),
(93, 5.8, 2.6, 4, 1.2, 'Iris-versicolor'),
(94, 5, 2.3, 3.3, 1, 'Iris-versicolor'),
(95, 5.6, 2.7, 4.2, 1.3, 'Iris-versicolor'),
(96, 5.7, 3, 4.2, 1.2, 'Iris-versicolor'),
(97, 5.7, 2.9, 4.2, 1.3, 'Iris-versicolor'),
(98, 6.2, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(99, 5.1, 2.5, 3, 1.1, 'Iris-versicolor'),
(100, 5.7, 2.8, 4.1, 1.3, 'Iris-versicolor'),
(101, 6.3, 3.3, 6, 2.5, 'Iris-virginica'),
(102, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(103, 7.1, 3, 5.9, 2.1, 'Iris-virginica'),
(104, 6.3, 2.9, 5.6, 1.8, 'Iris-virginica'),
(105, 6.5, 3, 5.8, 2.2, 'Iris-virginica'),
(106, 7.6, 3, 6.6, 2.1, 'Iris-virginica'),
(107, 4.9, 2.5, 4.5, 1.7, 'Iris-virginica'),
(108, 7.3, 2.9, 6.3, 1.8, 'Iris-virginica'),
(109, 6.7, 2.5, 5.8, 1.8, 'Iris-virginica'),
(110, 7.2, 3.6, 6.1, 2.5, 'Iris-virginica'),
(111, 6.5, 3.2, 5.1, 2, 'Iris-virginica'),
(112, 6.4, 2.7, 5.3, 1.9, 'Iris-virginica'),
(113, 6.8, 3, 5.5, 2.1, 'Iris-virginica'),
(114, 5.7, 2.5, 5, 2, 'Iris-virginica'),
(115, 5.8, 2.8, 5.1, 2.4, 'Iris-virginica'),
(116, 6.4, 3.2, 5.3, 2.3, 'Iris-virginica'),
(117, 6.5, 3, 5.5, 1.8, 'Iris-virginica'),
(118, 7.7, 3.8, 6.7, 2.2, 'Iris-virginica'),
(119, 7.7, 2.6, 6.9, 2.3, 'Iris-virginica'),
(120, 6, 2.2, 5, 1.5, 'Iris-virginica'),
(121, 6.9, 3.2, 5.7, 2.3, 'Iris-virginica'),
(122, 5.6, 2.8, 4.9, 2, 'Iris-virginica'),
(123, 7.7, 2.8, 6.7, 2, 'Iris-virginica'),
(124, 6.3, 2.7, 4.9, 1.8, 'Iris-virginica'),
(125, 6.7, 3.3, 5.7, 2.1, 'Iris-virginica'),
(126, 7.2, 3.2, 6, 1.8, 'Iris-virginica'),
(127, 6.2, 2.8, 4.8, 1.8, 'Iris-virginica'),
(128, 6.1, 3, 4.9, 1.8, 'Iris-virginica'),
(129, 6.4, 2.8, 5.6, 2.1, 'Iris-virginica'),
(130, 7.2, 3, 5.8, 1.6, 'Iris-virginica'),
(131, 7.4, 2.8, 6.1, 1.9, 'Iris-virginica'),
(132, 7.9, 3.8, 6.4, 2, 'Iris-virginica'),
(133, 6.4, 2.8, 5.6, 2.2, 'Iris-virginica'),
(134, 6.3, 2.8, 5.1, 1.5, 'Iris-virginica'),
(135, 6.1, 2.6, 5.6, 1.4, 'Iris-virginica'),
(136, 7.7, 3, 6.1, 2.3, 'Iris-virginica'),
(137, 6.3, 3.4, 5.6, 2.4, 'Iris-virginica'),
(138, 6.4, 3.1, 5.5, 1.8, 'Iris-virginica'),
(139, 6, 3, 4.8, 1.8, 'Iris-virginica'),
(140, 6.9, 3.1, 5.4, 2.1, 'Iris-virginica'),
(141, 6.7, 3.1, 5.6, 2.4, 'Iris-virginica'),
(142, 6.9, 3.1, 5.1, 2.3, 'Iris-virginica'),
(143, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(144, 6.8, 3.2, 5.9, 2.3, 'Iris-virginica'),
(145, 6.7, 3.3, 5.7, 2.5, 'Iris-virginica'),
(146, 6.7, 3, 5.2, 2.3, 'Iris-virginica'),
(147, 6.3, 2.5, 5, 1.9, 'Iris-virginica'),
(148, 6.5, 3, 5.2, 2, 'Iris-virginica'),
(149, 6.2, 3.4, 5.4, 2.3, 'Iris-virginica'),
(150, 5.9, 3, 5.1, 1.8, 'Iris-virginica');

Next we use this dataset and the classify(), features() and learn_classifier() functions to predict the species of flower based on some arbitrary features (measurements) that we provide.  The exact features used in the following example are not present in the iris dataset so it’s not a simple lookup – the model predicts the label (the flower species) for us:

select classify(features(5.8, 3, 5.1, 1.8), model) AS prediction 
FROM (
 SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width)) AS model FROM iris
 ) t;
 
   prediction   
----------------
 Iris-virginica 
(1 row)

The model predicts our measurements are an “Iris-virginica”. How does it work? The classify() function returns the predicted label by using the trained model. The trained model is constructed by the learn_classifier() function.  Using this approach means we cannot save the model so we pass it using the nested query (SELECT learn_classifier…) each time.

We can optionally view the trained model, but it does not make much sense to a human. Run the function to train the classification model again but omit the classify() function:

SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width) ) AS model
FROM iris;
 
                      model                      
-------------------------------------------------
 3c 43 6c 61 73 73 69 66 69 65 72 28 76 61 72 63 
 68 61 72 29 3e                                  
(1 row)

The trained model is returned in a serialized format. 

We hope the above Presto Machine Learning article and Classification example proves useful.  Stay tuned for a follow-up article introducing Regression – another type of supervised learning problem that Presto can help with.

How to Make The Presto Query Engine Run Fastest

How do you make the Presto query engine run as fast as possible?  The following list of things to consider, check and tune will help you get the most out of your Presto clusters:

  • How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger server/cloud instance, as more CPU and memory could help.

  • Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.

  • What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
  • Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
  1. Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
  2. Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies

  • Do you have a separate coordinator node? With PrestoDB you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)

  • Is memory exhausted? Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 

The good news is there is memory information available in at least two places:

1. Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking.

2. There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.

  • When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection.
  • Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be.

  • Does IO and CPU utilisation look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough. You can check these stats in the Presto Console’s UI “Cluster Overview”.
  • Have you checked your data volumes recently? An obvious one to check but data volumes can grow unexpectedly and sometimes growth peaks occur without you noticing. The queries may simply be taking longer because there is x% more data than last month.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. More information can be found in the PrestoDB documentation.

We hope the above information makes your Presto clusters run as fast as they can. Click here for more about Ahana

What Is A Presto Connection & How Does It Work?

A Presto connection, connections to and from Presto, fall into two main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. The connectors are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to HDFS and S3 data stored in a variety of formats.

2) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following Presto JDBC connector URL connection string formats are supported:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above information helps you with creating Presto Connections.

How to Presto Data Share

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables data sharing in two ways. 

1) With its broad list of data connectors, Presto provides a simple way to exploit data virtualization with federated SQL queries accessing data across multiple data sources.  Data in RDBMSs, NoSQL databases, legacy data warehouses, files in object storage can all be accessed,  and combined in a single SQL query if required.  Presto allows querying data where it lives, like in Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization by a variety of users and applications.  A complete list of Presto’s connectors can be found at  https://prestodb.io/docs/current/connector.html 

2) sharing access to data by connecting your Presto cluster to other systems, tools and applications is achieved using the Presto API or using SQL via client libraries, ODBC or JDBC interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query 

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries 

When it comes to Presto data usage and share, we hope the above information is useful.

Presto Performance: Speed, Optimization & Tuning

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Performance is particularly important for ad-hoc analytical queries so Presto has a number of design features to maximize Presto speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

Presto Performance Tips

The following recommendations can help you achieve maximum performance from your Presto clusters:  

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 


Finally, to assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service. 

We hope these Presto performance recommendations help you get the most out of Presto. If you’re interested in getting more information about Ahana, the easiest managed service for Presto on AWS, check out the case study comparing AWS Athena to Ahana.

Presto REST API

Presto’s HTTP API (Presto Rest API) is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Parquet

Parquet is a columnar storage format for Hadoop, supported by Presto.  Storing data as columns as opposed to rows enables Presto to fulfill queries more efficiently. By discarding unwanted data in rows, and preventing unnecessary scans, columnar storage saves disk space and improves query performance for larger data sets.

In Parquet, data is first horizontally partitioned into groups of rows, then within each group, data is vertically partitioned into columns. Data for a particular column is stored together via compression and encoding to save space and improve performance. Each Parquet Presto file has a footer that stores codecs, encoding information, as well as column-level statistics, e.g., the minimum and maximum number of column values. 

There have been several improvements made to Presto’s Parquet reader by the community, most notably by Uber, to enhance performance with features such as pushdown.  

For developers, a useful tool-set for working with Presto Parquet files is parquet-tools which allows you to see the schema contained in the file, view data, examine data types etc.  Parquet-tools can be installed by pip https://pypi.org/project/parquet-tools/ or installed on a Mac with brew install parquet-tools or cloned and built from https://github.com/apache/parquet-mr

Presto’s Parquet support is a popular choice for data storage.

Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design. In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 


To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more Presto speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service

Presto Tools

Several tools  and libraries are available, some are bundled with core Presto and others are downloadable extras. See the documentation Resources page at  https://prestodb.io/resources.html for details. 

Core Presto Tools:

  • Presto Console – Included with Presto, this browser-based UI allows administrators to monitor many aspects of their Presto deployment in real time, such as query workload, memory usage, execution plans etc. 

Community Contributed Tools 

Several community tools are listed at https://prestodb.io/resources.html including:

Other tools & Libraries 

Presto Documentation

Presto’s documentation is maintained by the Presto Foundation and the current version is available here: https://prestodb.io/docs/current/ 

Releases can be found at: https://github.com/prestodb and the Release Notes are at: https://prestodb.io/docs/current/release.html 

Currently the documentation site has no search function. Use Google to search the docs instead – for example if you want to find the Presto documentation page for window functions just google site:prestodb.io window functions and click the first match.

Some of the most popular and useful documentation pages are listed below for convenience: 

In addition the documentation for Ahana Cloud for Presto – a fully integrated, cloud-native managed service built for AWS – can be found at https://docs.ahana.cloud/docs

That’s your summary of Presto documentation. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Query Example

Since Presto is an ANSI SQL query engine, its SQL will be very familiar to practically anyone who has used a database, despite the fact Presto is not technically a database since it stores no data itself.

Starting with the SELECT statement, the full range of keywords that you would expect to use are supported including DISTINCT,  FROM, WHERE, GROUP BY, HAVING, UNION | INTERSECT | EXCEPT, ORDER B, LIMIT and so on.  More details can be found on the Presto SELECT doc page: https://prestodb.io/docs/current/sql/select.html 

Like most other database products, CTEs are also supported – Common Table Expressions allow you to define a temporary named result set that is available in the execution scope of a statement such as SELECT or INSERT.  Here’s the same query written without and then with a CTE:

--Without CTE
SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
 
--WITH CTE
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Here’s a CTE Presto query example:

presto> WITH my_time_data (the_time_now) as ( select now() ) 
SELECT the_time_now 
FROM my_time_data;
 
             the_time_now              
---------------------------------------
 2021-01-18 12:23:12.167 Europe/London 
(1 row)

The above Presto query example and documentation link should get you up and running writing Presto queries. 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Best Practices

We’ve put together a list of Presto best practices, based on our experience. We hope you find these helpful.

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 

We hope you find these Presto best practices useful. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Architecture and Design

The Presto architecture and design seamlessly integrates with an existing data ecosystem without needing to modify it. It’s an additional faster data access interface that turbocharges your existing stack. Presto separates compute from storage; its federated query engine lets teams run analytics for a variety of use cases across a wide range of data lakes and databases in-place, with no need to move data. It doesn’t store the data, which has the big advantage of being able to scale resources for queries up and down based on the demand. 

Most of the cloud deployments leverage object storage, which is already separated from the compute layer, and auto-scale to optimize resource costs. So the separation of compute and storage make it very suitable for cloud environments–though Presto can be deployed both in the cloud and on premises. 

Presto’s architecture is very similar to a classic database management system using cluster computing (MPP). Presto’s architecture consists of one coordinator node working together with multiple worker nodes. Clients enter their SQL queries that then are parsed, analyzed and planned with parallel tasks scheduled out to workers. Workers jointly process rows from the data sources and produce results that the client receives back. Presto does not write intermediate results to disk as its precursor Apache Hive did, and so you get significantly faster query speed. For Presto programming, it’s written in the Java programming language.

A single Presto query can combine data from multiple sources. Presto uses connectors to access various data sources, some of the more popular combinations include Presto HDFS, Presto Amazon S3, Presto MySQL, Presto PostgreSQL, Presto Microsoft SQL Server, Presto Redis, Presto Kudu, Amazon Redshift, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. A connector is similar in concept to a driver for a database. Several of the connectors are built-in to Presto, namely for Hive, JMX, TPC-H benchmark data, and a system connector which provides access to built-in system tables. Presto accesses other data sources using third-party connectors, and more and more are becoming available.

Presto Guide, Documentation & Support

Presto is a fast, flexible distributed SQL engine created and used by Facebook at scale. O’Reilly’s Learning and Operating Presto ebook is an approachable guide for using Presto and getting Presto into a production deployment. It is intended to make Presto more accessible to anyone who cares about their organization becoming data-driven. It will explain how to begin Presto operations at your organization to derive insights on datasets of any type and any size, wherever they reside. 

Regardless of whether you’re a current user of Presto, new to Presto, or going to provide Presto to your users, the O’Reilly Learning and Operating Presto guide book does an excellent job of explaining what you need to know.  By reading this Presto guide you will learn what Presto is, where it came from, and how it differs from other data warehousing solutions. You’ll discover why Facebook, Uber, Twitter, and cloud providers use Presto and how you can quickly deploy Presto in production. You’ll hear about who uses Presto and the ways they solve data challenges with Presto. 

You can supplement what you learn in this Presto guide by visiting the Presto website, reading the Ahana blog, and participating in Presto’s growing community, which is filled with experienced users, developers, and data architects. You can join the Slack channel, engage via Github Discussions, attend the Presto Virtual Meetups.  All these avenues will help guide you through Presto. You can also view official Presto documentation.

If you’re looking for official support for Presto, take a look at Ahana Cloud. We offer SaaS for Presto with pay as you go pricing and technical support.

Presto Data Sources

Presto was designed and written from the ground up to efficiently query data against data sources of all sizes, ranging from gigabytes to petabytes. Presto connects to a wide variety of data sources, from HDFS to traditional relational databases, as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.

Presto is obviously a great fit for companies that have disparate data sources. For those organizations that can’t consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together. With Presto, you can leverage many data sources at once, which means Presto can handle very large volumes of data.

Some of popular combinations include Presto AWS S3, Presto Cassandra/cassandra presto, presto accumulo, and more. Here are more data sources Presto connects to:

Accumulo

Alluxio

Amazon Redshift

Amazon S3

Cassandra

Druid

Elastic

HDFS

Kafka

Kudu

Microsoft SQL Server

MongoDB

Phoenix

Pinto

RDS PostgreSQL

RDS MySQL

Redis

Teradata

If you want to get up and running with Presto quickly, check out Ahana Cloud which is SaaS for Presto.

Presto Server Types

When it comes to Presto Server Types, Coordinators and workers are the two types of Presto servers. Here’s a look at each server type in more detail.

A Presto coordinator parses statements, plans queries, and manages Presto worker nodes. A Presto installation is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.

The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages, which are then translated into a series of connected tasks running on a cluster of Presto workers.

A Presto worker is actually a server. Its basic job is to execute tasks assigned by the coordinator and to process data. When it starts working, it alerts the discovery server, which makes it available to the coordinator. The worker then obtains the data from connectors and exchanges data with the connectors. The coordinator then takes the results from the worker and sends those results back to the client.

Workers use an HTTP-based protocol to communicate with both the coordinator and other workers.

Is Presto Free?

Many folks may wonder “is Presto free”? In fact, PrestoDB is a free, open source federated, distributed SQL query engine used for ad hoc analytics. The PrestoDB AMI is 100% open source and available for use in production immediately. PrestoDB AMI is the first and only completely open source and completely free edition of the software.

At Ahana, we have a team of PrestoDB experts who are actively contributing to the open source project and community. Our mission is to help you be successful with your PrestoDB deployments. 

You can sign up for Ahana Cloud for Presto, the first full integrated, cloud-native managed service for Presto. It is the easiest Presto experience ever, and requires no installation, no AWS AMIs or CFTs and no configuration. You can be running SQL queries on your AWS data within 30 minutes. You can sign up for free here.

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:

  • EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

In this article we’ve tried to lay out the comparisons of Spark SQL vs Presto. When it comes to checking out Spark Presto, there are some differences to be aware of: 

Commonality: 

  • They are both open source, “big data” software frameworks
  • They are distributed, parallel, and in-memory
  • BI tools connect to them using JDBC/ODBC
  • Both have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud. They can be containerized

Differences:

  • Presto is an ANSI SQL:2003 query engine for accessing and unifying data from many different data sources.  It’s deployed as a middle-layer for federation.
  • Spark is a general-purpose cluster-computing framework. Core Spark does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark SQL is also ANSI SQL:2003 compliant (since Spark 2.0).
  • Presto is more commonly used to support interactive SQL queries.  Queries are usually analytical but can perform SQL-based ETL.  
  • Spark is more general in its applications, often used for data transformation and Machine Learning workloads. 
  • Presto supports querying data in object stores like S3 by default, and has many connectors available. It also works really well with Parquet and Orc format data.
  • Spark must use Hadoop file APIs to access S3 (or pay for Databricks features). Spark has limited connectors for data sources. 

Many users are today are learning about Presto Spark. This lays out many of the differences on Presto vs Spark SQL and how Spark and Presto can be compared.

If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific Presto connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

And here are five open source alternatives to Spark Streaming:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming. Learn more about Spark SQL vs Presto in our comparison article.

Presto Engine and Database Explained

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With the Presto sql engine, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto, and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem. You can do Presto data share with connectors.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Querying AWS S3 with Presto

If you’re looking for a query engine for AWS S3, Presto is a popular choice. Presto is an open source SQL query engine, well known for ad hoc analytics as well as reporting and dashboarding. While users may think of S3 buckets as a data source for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases. Presto only needs access to the path of the data it needs to scan; it does not need to know the details like bucket name, etc. as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath. 

https://prestodb.io/docs/current/connector/hive.html

How to query Presto AWS S3 bucket:

You can use presto query S3 data. Here’s how to access Presto: there are two primary ways to Presto clusters get access to data stored in S3-

  1. IAM role via the instance Presto servers are running on 
  2. Access Key / Secret Key provided via the Hive connector properties file

IAM Role: Query S3 with Presto (Recommended Approach)

If using IAM role, Presto needs to be configured using

hive.s3.use-instance-credentials=true 

With this setting, the Presto server will have access to all the buckets that are accessible using the  IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role. 

Access Key / Secret Key: Query S3 with Presto

If AK/SK is used, Presto Hive connector needs to be configured with: 

hive.s3.aws-access-keyDefault AWS access key to use.
hive.s3.aws-secret-keyDefault AWS secret key to use.

With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected. 

If you’re interested in getting started with Presto on S3, check out the Ahana Cloud platform, a managed service for Presto in AWS. You can add data sources with a click of a button and query your S3-based data lake, and run a Pesto AWS S3 query easily.

Optimize Presto on Amazon EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto EMR, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. It’s not really about Presto vs Hive EMR, more how to use it together. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization. A typical combination might be Hadoop Presto for analytics.

Using Presto on Amazon EMR

Presto and EMR Presto optimization allows users to run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. You can optimize Presto query on EMR as well. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto Amazon EMR is included automatically as part of the cluster software (when you optimize Presto on EMR). Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

For Amazon EMR Presto configuration, as a query engine Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. For EMR Presto optimization, with EMR 5.12.0 onwards by default EMR Presto versions uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.

Using Presto with Hadoop

How does Presto work with Hadoop? What is Presto Hadoop?

You use Presto to run interactive queries on Hadoop. The difference between using Presto versus something like Hive, for instance, is that Presto is optimized for fast performance – this is crucial for interactive queries.

Presto’s distributed system runs on Hadoop and uses an architecture that’s similar to a massively parallel processing (MPP) database management system. 

Presto does not have its own storage system, so it acts as complementary to Hadoop. It can be installed with any implementation of Hadoop, like Amazon’s EMR Hadoop distribution.

If you’re looking for a Presto Hadoop tutorial, check out our getting started page.

Can Presto connect to HDFS?

Yes, Presto connects with HDFS through the Hive connector. You can use the Hive connector to query data stored in HDFS (or AWS S3, for that matter). One of the big benefits of Presto is that you can query data files in varying formats, which makes it easy to analyze all of your HDFS data. You can check out more on the Hive connector in the prestodb docs.

Is Presto related to Hive? What are the differences?

Apache Hive was developed as a project at Facebook in 2008 so they could leverage SQL syntax in their Hadoop system. It simplifies complex Java MapReduce jobs into SQL-like queries while executing jobs at massive scale.

As their Hadoop deployments grew, the Facebook team found that Hive wasn’t optimized for the fast performance they needed for their interactive queries. So they built Presto in 2013, which could operate quickly at petabyte scale.

At a high level, Hive is optimized for ad-hoc analysis and ease of use with its SQL-like syntax while Presto is highly optimized for low latency and fast performance – it takes longer for Hive to complete a job versus Presto.

Presto on AWS

Presto and AWS

Presto is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources. AWS Presto is a powerful combination.

If you want to run Presto in AWS, it’s easy to spin up a managed Presto cluster either via the AWS Management Console, the AWS CLI, or the Amazon EMR API. It’s not too difficult to run AWS Presto CLI EMR.

You can also give Ahana Cloud a try, a managed service for Presto that takes care of the devops for you.

Running Presto AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does Presto work with AWS?

Some AWS services that work with Presto in AWS are Amazon EMR and Amazon Athena. They’re managed services that do the integration, testing, setup, configuration, and cluster tuning for you. AWS Athena Presto and EMR are widely used, but both come with some challenges.

There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, in minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune. Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or Presto. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for Presto on AWS.

What is Presto and How Does Presto Work?

What is Presto and how does Presto work?

How does Presto work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. Presto between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

When it comes to memory, how Presto works is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 
– Capturing data
– Storing data
– Analysis
– Search
– Sharing
– Transfer
– Visualization
– Querying
– Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to Presto and that you can query. There are a ton in the Presto ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.