Why I’m betting on PrestoDB, and why you should too!

Presto Open Data Lake

By Dipti Borkar, Ahana Cofounder, Chief Product Officer & Chief Evangelist

I’ve been in open source software companies and communities for over 10 years now, and in the database industry my whole career. So I’ve seen my fair share of the good, the bad, and the ugly when it comes to open source projects and the communities that surround them. And I’d like to think that all this experience has led me to where I am today – cofounder of Ahana, a company that’s betting big on an open source project: PrestoDB. Let’s first talk about the problem we’re trying to solve.

The Big Shift 

Organizations have been using costly, proprietary data warehouses as the workhorse for analytics over many years. And in the last few years, we’ve seen a shift to cloud data warehouses, like Snowflake, AWS RedShift and BigQuery

Couple that with the fact that organizations have much more data (from terabytes to 10’s and 100’s of terabytes to even petabytes) and more different types of data (telemetry, behavioral, IoT, and event data in addition to enterprise data), there’s even greater urgency for users to not get locked in to proprietary formats and proprietary systems. 

These shifts along with AWS commoditizing the storage layer that is ubiquitous and affordable means that a lot more data is now in cloud object stores like AWS S3, Google Cloud Storage and Azure Blob Store. 

So how do you query the data stored in these data lakes?  How do you ask questions of the data and pull the answers into reports and dashboards? 

The “SQL on S3” Problem 

  • Data lakes are only storage – there is no intelligence in the data lake 
  • If the data is structured, think in tables and columns, SQL is hands down the best way to query it – hey it’s survived 50+ years 
  • If data is semi-structured, think nested like JSON etc. SQL still can be used to query with extensions to the language 
  • But SQL on data lakes was complicated. Hadoop tried it, we know that didn’t work. While it sort of solved this problem, trying to get 70+ different components and projects to integrate and work turned out to be a nightmare for data platform teams. 
  • There was really no simple yet performant way of querying S3

That is where Presto comes in. 

Presto is the best engine built to directly query open formats and data lakes. Presto replaced Hive at Facebook. Presto is the heart of the modern Open Data Lake Analytics Stack – an analytics stack that includes open source, open formats, open interfaces, and open cloud. (You can read more details about this in my Dataversity article.) 

The fact of the matter is that this problem can be solved with many different open source projects/engines. At Ahana, we chose Presto – the engine that runs at Facebook, Uber and Twitter. 

Why PrestoDB?

1. Crazy good tech 

Presto is in memory, scalable and built like a database. And with the new innovations getting added, Presto is only becoming bigger and better. 

Like I mentioned earlier, I’ve been in open source and the database space for a long time and built multiple database engines – both structured and semi-structured (aka NoSQL). I believe that PrestoDB is the query engine most aligned with the direction the analytics market is headed. At Ahana, we’re betting on PrestoDB and have built a managed service around it that solves the problem for Open Data Lake Analytics – the SQL on S3 problem. 

No other open source project has come close. Some that come to mind –

  • Apache Drill unfortunately lost its community. It had a great start being based on the Dremel paper (published by Google) but over time didn’t get the support it should have from the vendor behind it and the community fizzled. 
  • SparkSQL (built on Apache Spark) isn’t built like a database engine but instead is bolted on top of a general purpose computation engine
  • Trino, a hard fork of Presto, is largely focused on a different problem of broad access across many different data sources versus the data lake. I fundamentally believe that all data sources are NOT equal and that data lakes will be the most important data source over time, overtaking the data warehouses. Data sources are not equal for a variety of reasons: 
    • Amount of data stored 
    • Type of information stored 
    • Type of analysis that can be supported on it 
    • Longevity of data stored
    • Cost of managing and processing the data 

Given that 80-90% of the data lives on S3, I’ve seen that 80-90% of analytics will be run on this cheapest data source. That data source is the data lake. And the need to perform a correlation across more data sources comes up only 5-10% of the time for a window of time until the data from those data sources also gets ingested into the data lake. 

As an example: MySQL is the workhorse for operational transactional systems. A complex analytical query with multi-way joins from a federated engine could bring down the operational system. And while there may be a small window of time when the data in MySQL is not available in the data lake, it will eventually be moved over to the lake, and that’s where the bulk of the analysis will happen. 

2. Vendor-neutral Open Source, not Single-vendor Open Source

On top of being a great SQL query engine, Presto is open source and most importantly part of Linux Foundation – governed with transparency and neutrality. On the other hand, Trino, the hard fork of Presto, is a single-vendor project, with most of the core contributors being employees of the vendor. This is problematic for any company planning on using a project as a core component of their data infrastructure, moreso for a vendor like Ahana that needs to be able to support its customers and contribute and enhance the source code. 

Presto is hosted under Linux Foundation in the Presto Foundation, similar to how Kubernetes is hosted by Cloud Native Computing Foundation (CNCF) under the Linux Foundation umbrella. Per the bylaws of the Linux Foundation, Presto will always stay open source and vendor neutral. It was a very important consideration for us that we could count on the project remaining free and open forever, given so many examples where we have seen single-vendor projects changing their licenses to be more restrictive over time to meet the vendor’s commercialization needs. 

For those of you that follow open source, you most likely saw the recent story on Elastic changing its open-source license which created quite a ripple in the community. Without going into all the details, it’s clear that this move prompted a backlash from a good part of the Elastic community and its contributors. This wasn’t the first “open source” company to do this (see MongoDB, Redis, etc.), nor will it be the last. I believe that over time, users will always pick a project driven by a true gold-standard foundation like Linux Foundation or Apache Software Foundation (ASF) over a company-controlled project. Kubernetes over time won the hearts of engineers over alternatives like Apache Mesos and Docker Swarm and now has one of the biggest, most vibrant user communities in the world. I see this happening with Presto.

3. Presto runs in production @ Facebook, runs in production @ Uber & runs in production @ Twitter. 

The most innovative companies in the world run Presto in production for interactive SQL workloads. Not Apache Drill, not SparkSQL, not Apache Hive and not Trino. The data warehouse engine at Facebook is Presto. Ditto that for Uber, likewise for Twitter. 

When a technology is used at the scale these giants run at, you know you are not only getting  technology created by the brightest minds but also tested at internet scale. 

The Conclusion 

And that’s why I picked Presto, that’s PrestoDB and there’s only 1 Presto. 

In summary, I believe that Presto is the de facto standard for SQL analytics on data lakes. It is the heart of the modern open analytics stack and will be the foundation of the next 10 years of open data lake analytics. 

Join us on our mission to make PrestoDB the open source, de facto standard query engine for everyone.

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:

  • EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

In this article we’ve tried to lay out the comparisons of Spark SQL vs Presto. When it comes to checking out Spark Presto, there are some differences to be aware of: 

Commonality: 

  • They are both open source, “big data” software frameworks
  • They are distributed, parallel, and in-memory
  • BI tools connect to them using JDBC/ODBC
  • Both have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud. They can be containerized

Differences:

  • Presto is an ANSI SQL:2003 query engine for accessing and unifying data from many different data sources.  It’s deployed as a middle-layer for federation.
  • Spark is a general-purpose cluster-computing framework. Core Spark does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark SQL is also ANSI SQL:2003 compliant (since Spark 2.0).
  • Presto is more commonly used to support interactive SQL queries.  Queries are usually analytical but can perform SQL-based ETL.  
  • Spark is more general in its applications, often used for data transformation and Machine Learning workloads. 
  • Presto supports querying data in object stores like S3 by default, and has many connectors available. It also works really well with Parquet and Orc format data.
  • Spark must use Hadoop file APIs to access S3 (or pay for Databricks features). Spark has limited connectors for data sources. 

Many users are today are learning about Presto Spark. This lays out many of the differences on Presto vs Spark SQL and how Spark and Presto can be compared.

If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific Presto connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.

Where to Find Presto Source Code, and How to Work With it

The main branch: PrestoDB source code

Presto is an open source project that is developed in the open on the public Github repository: https://github.com/prestodb/presto. The prestodb repo is the original, master repo from when it was first developed at Facebook and subsequently open sourced the code base on Github in 2013 under the Apache 2.0 license, a permissive license which lets anyone download and use the code. 

Whether you call it master, trunk, upstream, or mainline, those all refer to the prestodb repo: https://github.com/prestodb/presto. That means it is the single, shared, current state of the software project. Many wonder how to work with Presto. Whenever you wish to start a new piece of work, you would pull code from this origin into your local repository and begin coding. The master repo is the single shared, codeline and represents the central repo and single point of record for the project. Most Presto clones or forks come from prestodb. 

It is worth noting that the Facebook team continues to develop and run the prestodb project in production at Facebook at scale. So the community benefits from the development and testing that Facebook and other companies who run presto. There are numerous other contributors of course, many of them working within organisations that have deployed PrestoDB at scale.

What about other forks?

As an open source project, community members can download the source code and work on it in their own public or private repos. Those members can decide to contribute changes back through the traditional github development process of pull requests, reviews, and commits. If you starting anew, it is generally recommended to always pull source code from the master, mainline prestodb open source repo:  https://github.com/prestodb/presto

Some members can decide to not contribute back and then that Presto version becomes a fork of the code. There are always a number of forks out in the community and as time goes on, the development tends to diverge away from the original codeline. The fork misses out on upstream changes and testing that companies like Facebook and others do, unless the fork is merged back with upstream. 

A close up of a map

Description automatically generated
Source: https://martinfowler.com/articles/branching-patterns.html

What about PrestoSQL source code?

PrestoSQL is a fork of the original Presto project. It has a separate github repository here: https://github.com/prestosql/presto

PrestoDB is the main project of Linux Foundation’s Presto Foundation. That Foundation has a wide ranging set of industry members including Facebook, Uber, Twitter, Alibaba. In addition, Starburst is another industry member who has joined. As such, PrestoDB is the project repo of Presto both today and in the future. 

How can I work with the Presto source code?

The easiest way is thru your own github account. If you don’t yet have a github account, it is free and easy to sign up. Then get to https://github.com/prestosql/presto and clone the repository into your github account. If you’d like to work off your laptop, you can use a wide variety of free tools, like atom or github desktop. Now you can get coding and compiling! Feel free join the PrestoDB slack channel to ask questions of the community  (or answer questions too!). There is also the Presto Users google group at https://groups.google.com/g/presto-users and Presto developers are active on Stack Overflow https://stackoverflow.com/questions/tagged/presto

How can I contribute back to the PrestoDB source?

When you’re ready to contribute your code back to the community, you can use Github to raise pull request and get your code reviewed. You’ll need to e-sign a Contributor License Agreement (CLA) that is part of all Linux Foundation projects. 

Note, that there are other ways to contribute back to the community. The code base is wide and varied, there’s opportunities to write or improve presto connectors, or other parts of the presto SQL engine. In addition to writing code, you could write documentation. 

Wrapping Up

Now you know how to find the Presto source code and understand the different forks that are out there. Hope to see you in the community!

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

And here are five open source alternatives to Spark Streaming:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming. Learn more about Spark SQL vs Presto in our comparison article.

Presto Concepts

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Presto Database & Engine Explained

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With the Presto sql engine, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto, and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem. You can do Presto data share with connectors.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Using Presto with Hadoop

How does Presto work with Hadoop? What is Presto Hadoop?

You use Presto to run interactive queries on Hadoop. The difference between using Presto versus something like Hive, for instance, is that Presto is optimized for fast performance – this is crucial for interactive queries.

Presto’s distributed system runs on Hadoop and uses an architecture that’s similar to a massively parallel processing (MPP) database management system. 

Presto does not have its own storage system, so it acts as complementary to Hadoop. It can be installed with any implementation of Hadoop, like Amazon’s EMR Hadoop distribution.

If you’re looking for a Presto Hadoop tutorial, check out our getting started page.

Can Presto connect to HDFS?

Yes, Presto connects with HDFS through the Hive connector. You can use the Hive connector to query data stored in HDFS (or AWS S3, for that matter). One of the big benefits of Presto is that you can query data files in varying formats, which makes it easy to analyze all of your HDFS data. You can check out more on the Hive connector in the prestodb docs.

Is Presto related to Hive? What are the differences?

Apache Hive was developed as a project at Facebook in 2008 so they could leverage SQL syntax in their Hadoop system. It simplifies complex Java MapReduce jobs into SQL-like queries while executing jobs at massive scale.

As their Hadoop deployments grew, the Facebook team found that Hive wasn’t optimized for the fast performance they needed for their interactive queries. So they built Presto in 2013, which could operate quickly at petabyte scale.

At a high level, Hive is optimized for ad-hoc analysis and ease of use with its SQL-like syntax while Presto is highly optimized for low latency and fast performance – it takes longer for Hive to complete a job versus Presto.

Presto on AWS

Presto and AWS

Presto is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources. AWS Presto is a powerful combination.

If you want to run Presto in AWS, it’s easy to spin up a managed Presto cluster either via the AWS Management Console, the AWS CLI, or the Amazon EMR API. It’s not too difficult to run AWS Presto CLI EMR.

You can also give Ahana Cloud a try, a managed service for Presto that takes care of the devops for you.

Running Presto AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does Presto work with AWS?

Some AWS services that work with Presto in AWS are Amazon EMR and Amazon Athena. They’re managed services that do the integration, testing, setup, configuration, and cluster tuning for you. AWS Athena Presto and EMR are widely used, but both come with some challenges.

There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, in minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune. Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or Presto. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for Presto on AWS.

What Is Presto? An Introduction To Presto

What is Presto and how does Presto work?

How does Presto work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. Presto between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

When it comes to memory, how Presto works is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 
– Capturing data
– Storing data
– Analysis
– Search
– Sharing
– Transfer
– Visualization
– Querying
– Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to Presto and that you can query. There are a ton in the Presto ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.