Presto Use Cases
No matter which version you’re using, Presto is used for a wide range of analytical use cases and is particularly good for interactive and ad hoc querying.
PrestoDB is for everyone!
It doesn’t matter if you’re operating at Facebook-like scale or at just a few nodes – PrestoDB can help with your interactive and ad hoc querying.
How Presto is used at Facebook
In 2012, Facebook developed and deployed Presto. Today they’re using Presto to power many different use cases, including:
Interactive analytics: Quickly analyze data and execute queries in seconds or minutes using visualizations, dashboards, notebooks, and BI tools.
Batch ETL: Populate and process data in the warehouse for better throughput and efficiency.
A/B testing: Choose which data to slice and dice and evaluate product changes via hypothesis testing, with results in hours (not days).
App Analytics: Build external-facing custom reporting tools with low latency and hundreds of highly available concurrent queries.
More Common Use Cases
Ad hoc querying
Use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows you to query data where it’s stored so you don’t have to ETL data into a separate system. With Presto connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have interest in. Along with that access comes the power of Presto to run queries in seconds instead of hours, and analysts can iterate quickly on innovative hypotheses with the interactive exploration of any dataset, residing anywhere.
Query data across multiple sources to build one Presto view of reports and dashboards for self-service BI business intelligence. Presto reporting gives data scientists and analysts the ability to query data across sources on their own so they’re not dependent on data platform engineers.
Reporting & dashboarding
Transformation using SQL (ETL)
Aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, use Presto to run resource- efficient and high throughput queries.
Query data directly on a data lake without the need for transformation. Presto enables you to query any type of data in your data lake, including both structured and unstructured data.
Data lake analytics
Federated querying across multiple data sources
Query data across many different data sources including databases, data lakes, lake houses, on-premises or in the cloud. Presto allows you to aggregate answers back in the Presto in-memory database.
Real world examples from the community
Presto powers analytics on Uber’s data lake and is so widely used that about half the company logs into Presto at least once a month to do something that’s important for their work.
Watch the 1 minute video from the Presto virtual roundtable.
Twitter developed a highly-scalable query predictor service through applying machine learning algorithms to ~10 million historical Presto queries to classify queries based on their CPU times and peak memory bytes. This service is helping to improve the performance of Presto clusters and provide expected execution statistics on Business Intelligence dashboards.
Alibaba leverages Presto for data lake analytics to provide serverless, federated analytics.
Watch the 1 minute video from the Presto virtual roundtable
Community Presentations & Resources
Apache Hudi is a fast growing data lake storage system that helps organizations build and manage petabyte-scale data lakes. Hudi brings stream style processing to batch-like big data by using primitives such as upserts, deletes and incremental pulls. These features help surface faster, fresher data on a unified serving layer. Hudi can be operated on the Hadoop Distributed File System (HDFS) or cloud stores and integrates well with popular query engines such as Presto, Apache Hive, Apache Spark and Apache Impala.
Hudi integrates with Presto to support different table/query types.
LinkedIn extended its Presto Hive Catalog with a smart logical abstraction layer that is capable of reasoning about logical views with UDFs by using two core components, Coral and Transport UDFs. Coral is a view virtualization library, powered by Apache Calcite, that represents views using their logical query plans.
LinkedIn leverages Coral abstractions to decouple view expression language from the execution engine, and hence execute non-Presto-SQL views inside Presto, and achieve on-the-fly query rewrite for data governance and query optimization.
Amundsen is an open-source data discovery and metadata platform which is part of LF AI & Data foundation. Amundsen integrates with Presto to power data preview and data exploration.
Amundsen can also be customized and extended to other companies’ data ecosystem.