Optimize Presto on Amazon EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto EMR, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. It’s not really about Presto vs Hive EMR, more how to use it together. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization. A typical combination might be Hadoop Presto for analytics.

Using Presto on Amazon EMR

Presto and EMR Presto optimization allows users to run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. You can optimize Presto query on EMR as well. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto Amazon EMR is included automatically as part of the cluster software (when you optimize Presto on EMR). Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

For Amazon EMR Presto configuration, as a query engine Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. For EMR Presto optimization, with EMR 5.12.0 onwards by default EMR Presto versions uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

Screen Shot 2020 07 09 at 12.48.55 PM

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.