Presto Database & Engine Explained

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.