Presto vs Spark With EMR Cluster
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:
- EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
- Spark is a general-purpose cluster-computing framework that can process data in EMR. Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.
There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data. And PrestoDB is included in Amazon EMR release version 5.0.0 and later.
A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications. EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.
EMR comes with some disadvantages:
- EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need.
- EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
- When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes). The only workaround is to store all your data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications.
- On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
- EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
- EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process.
- EMR is unable to automatically replace unhealthy nodes.
- The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries.
- Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.
When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve.