Using Presto with Hadoop

How does Presto work with Hadoop? What is Presto Hadoop?

You use Presto to run interactive queries on Hadoop. The difference between using Presto versus something like Hive, for instance, is that Presto is optimized for fast performance – this is crucial for interactive queries.

Presto’s distributed system runs on Hadoop and uses an architecture that’s similar to a massively parallel processing (MPP) database management system. 

Presto does not have its own storage system, so it acts as complementary to Hadoop. It can be installed with any implementation of Hadoop, like Amazon’s EMR Hadoop distribution.

If you’re looking for a Presto Hadoop tutorial, check out our getting started page.

Can Presto connect to HDFS?

Yes, Presto connects with HDFS through the Hive connector. You can use the Hive connector to query data stored in HDFS (or AWS S3, for that matter). One of the big benefits of Presto is that you can query data files in varying formats, which makes it easy to analyze all of your HDFS data. You can check out more on the Hive connector in the prestodb docs.

Is Presto related to Hive? What are the differences?

Apache Hive was developed as a project at Facebook in 2008 so they could leverage SQL syntax in their Hadoop system. It simplifies complex Java MapReduce jobs into SQL-like queries while executing jobs at massive scale.

As their Hadoop deployments grew, the Facebook team found that Hive wasn’t optimized for the fast performance they needed for their interactive queries. So they built Presto in 2013, which could operate quickly at petabyte scale.

At a high level, Hive is optimized for ad-hoc analysis and ease of use with its SQL-like syntax while Presto is highly optimized for low latency and fast performance – it takes longer for Hive to complete a job versus Presto.