Define Presto
Presto is an open source, distributed SQL query engine that enables analytics on large amounts of data. With Presto, users access and query data on many different data sources using ANSI SQL. Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. Data from multiple sources can be combined in a single query, so users can analyze data across an entire organization.
Presto is one of the fastest growing open source projects in data analytics today because it fits well with that data-driven paradigm shift. There are three primary reasons: 1) Presto is based on ANSI SQL so it’s easy for people to get running with it, 2) the Presto connector architecture enables the federated access of almost any data source, whether a database, data lake, or other data system, and 3) Presto can start from one node and scale to thousands.
A full Presto deployment includes a coordinator and multiple workers. Analysts can submit their queries via a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.
The Presto project that is community owned and driven is supported by the Presto Foundation, which is an independent, nonprofit organization that’s hosted under the Linux Foundation. Founding members of the Presto Foundation include Facebook, Alibaba, Twitter and Uber. These members are helping drive the future direction of the project, with the goal of making Presto the fastest and most reliable SQL engine for massively distributed data processing.
Presto supports pluggable connectors that provide data for queries. The requirements vary by connector. Connector examples include: Hive for HDFS or Object Stores (S3), MySQL, ElasticSearch, Cassandra, Kafka and more.