PrestoDB, now hosted under the Linux Foundation, is an open source distributed SQL engine
We’ve curated this page for the community with the best reads from the PrestoDB community
Watch: Why’s Presto so important and how does it help today’s user?
Hear from Presto Foundation panel members Facebook, Uber, Ahana, and Alibaba in a virtual roundtable.
Presto is a federated, distributed SQL query engine that runs on a cluster of machines. It enables interactive, ad-hoc analytics on large amounts of data. Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization.
A full Presto installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.
Presto Server Types
Presto Coordinator: Responsible for parsing statements, planning queries, and managing Presto worker nodes.
Presto Worker: Responsible for executing tasks and processing data.
Presto Data Sources
Connector: A connector adapts Presto to a data source such as Hive or a relational database. Presto contains several built-in connectors including JMX, a Hive connector, and a TPCH connector.
Catalog: A Presto catalog contains schemas and references a data source via a connector.
Schema: Use schemas to organize tables for querying.
Table: A set of unordered rows organized into named columns with types.
Query Execution Model
Presto executes SQL statements and turns these statements into queries that are executed across a distributed cluster of coordinator and workers.
Statement: When Presto parses a statement, it converts it into a query and creates a distributed query plan which is then realized as a series of interconnected stages running on Presto workers.
Stage: When Presto executes a query, it does so by breaking up the execution into a hierarchy of stages.
Task: Tasks are the “work horse” in the Presto architecture as a distributed query plan is deconstructed into a series of stages which are then translated to tasks which then act upon or process splits.
Split: Tasks operate on splits which are sections of a larger data set. Stages at the lowest level of a distributed query plan retrieve data via splits from connectors, and intermediate stages at a higher level of a distributed query plan retrieve data from other stages.
Driver: Drivers act upon data and combine operators to produce output that is then aggregated by a task and then delivered to another task in another stage.
Operator: An operator consumes, transforms and produces data.
Exchange: Exchanges transfer data between Presto nodes for different stages of a query.
Common Use Cases
Ad hoc querying
Use SQL to run ad hoc queries whenever you want, wherever your data resides.
Data lake analytics
Query data directly on a data lake without the need for transformation.
Query data across multiple sources like databases, data lakes, and more