How Presto Joins Data

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The Presto join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual Presto joins to take place.

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.