What is Presto and what are its frequently asked questions (FAQ)?
Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problems.
In the past, distributed query engines like Hive were designed to persist intermediate results to disk. As the below figure illustrates, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk. The data can reside in HDFS or any database or any data lake, and Presto performs the executions in-memory across your workers, shuffling data between workers as needed. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.
If this distributed in-memory model sounds familiar, that’s because Apache Spark uses the same basic concept to effectively replace MapReduce-based technologies. However, Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.
Presto was developed with the following design considerations:
• High performance with in-memory execution
• High scalability from 1 to 1000s of workers
• Flexibility to support a wide range of SQL use cases
• Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
• Federation of data sources via Presto connectors
• Seamless integration with existing SQL systems by adhering to the ANSI SQL standard
Portions of this answer are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/
Presto is used as the distributed query engine as part of an analytics platform. The presto engine is used for many different types of analytical workloads to address a variety of problems. Leveraging a separate storage and compute model, also known as “disaggregated,” Presto can query one or more data stores, metastores, and SQL tools.
1. Ad hoc querying
Presto was originally designed for interactive analytics, or ad hoc querying. That means results are returned in a matter of a few seconds so an analyst can drill-down and ask questions of large amounts of data interactively. In today’s “Internet era” competitive world and with data systems able to collect granular amounts of data in near-real-time, engineers, analysts, data scientists, and product managers all want the ability to quickly analyze their data and become data-driven organizations that make superior decisions, innovating quickly and improving their businesses.
They either simply type in simple queries by hand or use a range of visualization, dashboarding, and BI tools. Depending on the tools chosen, they can run 10s of complex concurrent queries against a Presto cluster. With Presto connectors and their in-place execution, platform teams can quickly provide access to the data sets that users want. Not only do analysts get access, but they can run queries in seconds and minutes–instead of hours–with the power of Presto, and they can iterate quickly on innovative hypotheses with the interactive exploration of any data set, residing anywhere.
2. Reporting and dashboarding
Because of the design and architecture of Presto and its ability to query across multiple sources, Presto is a great backend for reporting and dashboarding. Unlike the first-generation static reporting and dashboarding, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers and other users not only want to look at KPI’s, product statistics, telemetry data and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie. This requires the backend – the underlying system – to be able to process data fast wherever it may sit. To support this type of self-service analytics, platform teams are required to either consolidate data into one system via expensive pipelining approaches or test and support every reporting tool with every database, data lake and data system their end users want to access. Presto gives data scientists, analysts and other users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies the task of the data platform engineers by absorbing the integration testing and allowing them to have a single abstraction and end point for a range of reporting and dashboarding tools.
3. ETL using SQL
Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, Presto can be used to run resource-efficient and high throughput queries. ETL can process all the data in the warehouse; it generates tables that are used for interactive analysis or feeding various downstream products and systems.
Presto as an engine is not an end-to-end ETL system, nor is Hive or Spark. Some additional tools can be easily added to coordinate and manage numerous on-going time-based jobs, a.k.a. cron jobs, which take data from one system and move it into another store, usually with a columnar format. Users can use a workflow manage‐ ment system like open source Apache Airflow or Azkaban. These automate tasks that would normally have to be run manually by a data engineer. Airflow is an open source project that programmatically authors, schedules and monitors ETL work‐ flows, and was built by Airbnb employees who were former Facebook employees. Azkaban, another open source project, is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
The queries in batch ETL jobs are much more expensive in terms of data volume and CPU than interactive jobs. As such the clusters tend to be much bigger. So some companies will separate Presto clusters: one for ETL and another one for ad hoc queries. This is operationally advantageous since it is the same Presto technology and requires the same skills. For the former, it’s much more important that the throughput of the entire system is good versus latency for an individual query.
4. Data Lake Analytics
Data lakes have grown in popularity along with the rise of Amazon S3-compatible object storage which Amazon AWS has made popular. A data lake enables you to store all your structured and unstructured data as-is and run different types of analytics on it.
A data warehouse is optimized to analyze relational data, typically from transactional systems and business applications, where the data is cleaned, enriched, and transformed. A data lake is different from a data warehouse in that it can store all your data–the relational data as well as non-relational data from a variety of sources, such as from mobile apps, social media, time-series data–and you can derive more new insights from the analysis of that broader data set. Again you can do so without necessarily needing to process and copy that data beforehand.
Presto is used to query data directly on a data lake without the need for additional ingestion or copies. You can query any type of data in your data lake, including both structured and unstructured data. As companies become more data-driven and need to make faster, more informed decisions, the need for analytics on an increasingly larger amount of data has become a higher priority in order to do business.
5. Real-time analytics with real-time databases
Real-time analytics is becoming increasingly used in conjunction with consumer- facing websites and services. This usually involves combining data that is being captured in real time with historical or archived data. Imagine if an e-commerce site had a history of your activity archived in an object store like S3, but your current session activity is getting written to a real-time database like Apache Pinot. Your current session activity may not make it into S3 for hours until the next snapshot. By using Presto to unite data across both systems, that website could provide you with real- time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy.
Portions of this article are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/. This includes a chapter of Presto with Pinot, a real-time database.
Apache Presto is a misnomer in that it is not a project hosted under the Apache Software Foundation (ASF). Incubator and top level ASF projects are subject to the naming conventions “Apache [Project Name]” One example of this is Apache Cassandra. Instead, the open source presto project is with another well-known project hosting body, The Linux Foundation. They make sub-foundations and Presto is specifically under the Presto Foundation. This is similar to how Kubernetes is the project in Cloud Native Computing Foundation, which is a sub-foundation of The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which Presto is licensed with.
Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problem.