Presto REST API

Presto’s HTTP API (Presto Rest API) is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Architecture and Design

The Presto architecture and design seamlessly integrates with an existing data ecosystem without needing to modify it. It’s an additional faster data access interface that turbocharges your existing stack. Presto separates compute from storage; its federated query engine lets teams run analytics for a variety of use cases across a wide range of data lakes and databases in-place, with no need to move data. It doesn’t store the data, which has the big advantage of being able to scale resources for queries up and down based on the demand. 

Most of the cloud deployments leverage object storage, which is already separated from the compute layer, and auto-scale to optimize resource costs. So the separation of compute and storage make it very suitable for cloud environments–though Presto can be deployed both in the cloud and on premises. 

Presto’s architecture is very similar to a classic database management system using cluster computing (MPP). Presto’s architecture consists of one coordinator node working together with multiple worker nodes. Clients enter their SQL queries that then are parsed, analyzed and planned with parallel tasks scheduled out to workers. Workers jointly process rows from the data sources and produce results that the client receives back. Presto does not write intermediate results to disk as its precursor Apache Hive did, and so you get significantly faster query speed. For Presto programming, it’s written in the Java programming language.

A single Presto query can combine data from multiple sources. Presto uses connectors to access various data sources, some of the more popular combinations include Presto HDFS, Presto Amazon S3, Presto MySQL, Presto PostgreSQL, Presto Microsoft SQL Server, Presto Redis, Presto Kudu, Amazon Redshift, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. A connector is similar in concept to a driver for a database. Several of the connectors are built-in to Presto, namely for Hive, JMX, TPC-H benchmark data, and a system connector which provides access to built-in system tables. Presto accesses other data sources using third-party connectors, and more and more are becoming available.

Presto Data Sources

Presto was designed and written from the ground up to efficiently query data against data sources of all sizes, ranging from gigabytes to petabytes. Presto connects to a wide variety of data sources, from HDFS to traditional relational databases, as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.

Presto is obviously a great fit for companies that have disparate data sources. For those organizations that can’t consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together. With Presto, you can leverage many data sources at once, which means Presto can handle very large volumes of data.

Some of popular combinations include Presto AWS S3, Presto Cassandra/cassandra presto, presto accumulo, and more. Here are more data sources Presto connects to:

Accumulo

Alluxio

Amazon Redshift

Amazon S3

Cassandra

Druid

Elastic

HDFS

Kafka

Kudu

Microsoft SQL Server

MongoDB

Phoenix

Pinto

RDS PostgreSQL

RDS MySQL

Redis

Teradata

If you want to get up and running with Presto quickly, check out Ahana Cloud which is SaaS for Presto.

Presto Server Types

When it comes to Presto Server Types, Coordinators and workers are the two types of Presto servers. Here’s a look at each server type in more detail.

A Presto coordinator parses statements, plans queries, and manages Presto worker nodes. A Presto installation is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.

The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages, which are then translated into a series of connected tasks running on a cluster of Presto workers.

A Presto worker is actually a server. Its basic job is to execute tasks assigned by the coordinator and to process data. When it starts working, it alerts the discovery server, which makes it available to the coordinator. The worker then obtains the data from connectors and exchanges data with the connectors. The coordinator then takes the results from the worker and sends those results back to the client.

Workers use an HTTP-based protocol to communicate with both the coordinator and other workers.

Is Presto Free?

Many folks may wonder “is Presto free”? In fact, PrestoDB is a free, open source federated, distributed SQL query engine used for ad hoc analytics. The PrestoDB AMI is 100% open source and available for use in production immediately. PrestoDB AMI is the first and only completely open source and completely free edition of the software.

At Ahana, we have a team of PrestoDB experts who are actively contributing to the open source project and community. Our mission is to help you be successful with your PrestoDB deployments. 

You can sign up for Ahana Cloud for Presto, the first full integrated, cloud-native managed service for Presto. It is the easiest Presto experience ever, and requires no installation, no AWS AMIs or CFTs and no configuration. You can be running SQL queries on your AWS data within 30 minutes. You can sign up for free here.

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific Presto connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.