Ashish Tadose, Co-founder and Principal Software Engineer, Ahana
The need for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow explosively. Data platform teams are increasingly using the federated SQL query engine PrestoDB to run such analytics for a variety of use cases across a wide range of data lakes and databases in-place, without the need to move data. PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter.
Let’s look at some important characteristics of Presto that account for its growing adoption.
- Easier integration with ecosystem
Presto was designed to seamlessly integrate with an existing data ecosystem without any modification needed to the on-going system. It’s like turbocharging your existing stack with an additional faster data access interface.
Presto provides an additional compute layer for faster analytics. It doesn’t store the data, which gives it the massive advantage of being able to scale resources for queries up and down f based on the demand.
This compute and storage separation makes the Presto query engine extremely suitable for cloud environments. Most of the cloud deployments leverage object storage, which is already disintegrated from the compute layer, and auto-scale to optimize resource costs.
- Unified SQL interface
SQL is by far the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker, etc.
Presto is a federated query engine that has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka.
The Facebook team developed Presto because Apache Hive was not suitable for interactive queries. Hive’s underlining architecture , which executes queries by executing multiple MapReduce and Tez jobs, works very well for large, complex jobs, but does not suffice for low-latency queries. The Hive project has recently introduced in-memory caching with Hive LLAP; however it workswell for certain kinds of queries, but it also makes Hive more resource-intensive.
Similarly, Apache Spark works very well for large, complex jobs using in-memory computation. However, it is not as efficient as Presto interactive BI queries.
Presto is built for high performance, with several key features and optimizations, such as code-generation,in-memory processing & pipelined execution. Presto queries share a long-lived Java Virtual Machine (JVM) process on worker nodes, which avoids overhead of spawning new JVM containers.
- Query Federation
Presto provides a single unified SQL dialect that abstracts all supported data sources. This is a powerful feature which eliminates the need for users to understand connections and SQL dialects of underlying systems.
- Design suitable for cloud
Presto’s fundamental design of running storage and compute separately makes it extremely convenient to operate in cloud environments. Since the Presto cluster doesn’t store any data, it can be auto-scaled depending on the load without causing any data loss.
As you can see Presto offers numerous advantages for interactive ad hoc queries. No wonder data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data.
To learn more about Presto, listen to the Tech Talk Series: Getting Started with Presto on demand at your convenience.