Hive vs Presto vs Spark for Data Analysis

Presto SQL Engine

Apache Hive, Apache Spark, and Presto are all popular open-source tools for working with data lakes and data lakehouses. However, these tools typically serve different functions – and while some of these overlap, there are also many differences, typically making them complimentary rather than competitive. Let’s look at the Presto vs Hive vs Spark, and see how each of these tools can be used for large-scale data analysis.

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data analysis and query. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto vs Hive vs Spark: The Comparison

Commonalities

  • All three projects are community-driven open-source software released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Hive vs Presto

Both Presto and Hive are used to query data in distributed storage, but Presto is more focused on analytical querying whereas Hive is mostly used to facilitate data access. Hive provides a virtual data warehouse that imposes structure on semi-structured datasets, which can then be queried using Spark, MapReduce, or Presto itself. Presto is a compute and querying layer that can connect to the Hive Metastore or other data catalogs such as Apache Iceberg.

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.

Ahana Cloud for Presto Versus Amazon EMR

In this brief post, we’ll discuss some of the benefits of Ahana Cloud over Amazon Elastic MapReduce (EMR). While EMR offers optionality in the number of big data compute frameworks, that flexibility comes with operational and configuration burden. When it comes to low-latency interactive querying on big data that just works, Ahana Cloud for Presto offers much lower operational burden and Presto-specific optimizations.

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases. In fact, you can run federated queries across your data sources. Developed at Facebook, Presto is supported by the Presto Foundation, an independent nonprofit organization under the auspices of the Linux Foundation. Presto is used by leading technology companies, such as Facebook, Twitter, Uber, and Netflix.

Amazon EMR is a big data platform hosted in AWS. EMR allows you to provision a cluster with one or more big data technologies, such as Hadoop, Apache Spark, Apache Hive, and Presto. Ahana Cloud for Presto is the easiest cloud-native managed service for Presto, empowering data teams of all sizes. As a focused Presto solution, here are a few of Ahana Cloud’s benefits over Amazon EMR:

Less configuration. Born of the Hadoop era, Presto has several configuration parameters in several files to configure and tune to get right. With EMR, you have to configure these yourself. With Ahana Cloud, we tune more than 200 parameters out of the box, so when you spin up a cluster, you get excellent query performance from the get go. Out of the box, Ahana Cloud provides an Apache Superset sandbox for administrators to validate connecting to, querying and visualizing your data.

Easy-to-modify configuration. Ahana Cloud offers the ability to not only spin up and terminate clusters, but also stop and restart them—-allowing you to change the number of Presto workers and add or remove data sources. With EMR, any manual changes to the number of Presto workers and data sources require a new cluster or manually restarting the services yourself. Further, adding and removing data sources is done through a convenient user interface instead modifying low-level configuration files.

ahana data sources

Optimizations. As a Presto managed service, Ahana Cloud will continually provide optimizations relevant to Presto. For example, Ahana recently released data lake I/O caching. Based on the RubiX open source project and enabled with a single click, the caching eliminates redundant reads from your data lake if the same data is read over and over. This caching results in up to 5x query performance improvement and up to 85% latency reductions for concurrent workloads. Finally, idle clusters processing no queries can automatically scale down to a single Presto worker to preserve costs while allowing for a quick warm up.

Screen Shot 2021 03 18 at 4.45.32 PM

If you are experienced at tuning Presto and want full control of the infrastructure management, Amazon EMR may be the choice for you. If simplicity and accelerated go-to-market without needing to manage a complex infrastructure are what you seek, then Ahana Cloud for Presto is the way to go. Sign up for our free trial today.

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:

  • EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

In this article we’ve tried to lay out the comparisons of Spark SQL vs Presto. When it comes to checking out Spark Presto, there are some differences to be aware of: 

Commonality: 

  • They are both open source, “big data” software frameworks
  • They are distributed, parallel, and in-memory
  • BI tools connect to them using JDBC/ODBC
  • Both have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud. They can be containerized

Differences:

  • Presto is an ANSI SQL:2003 query engine for accessing and unifying data from many different data sources.  It’s deployed as a middle-layer for federation.
  • Spark is a general-purpose cluster-computing framework. Core Spark does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark SQL is also ANSI SQL:2003 compliant (since Spark 2.0).
  • Presto is more commonly used to support interactive SQL queries.  Queries are usually analytical but can perform SQL-based ETL.  
  • Spark is more general in its applications, often used for data transformation and Machine Learning workloads. 
  • Presto supports querying data in object stores like S3 by default, and has many connectors available. It also works really well with Parquet and Orc format data.
  • Spark must use Hadoop file APIs to access S3 (or pay for Databricks features). Spark has limited connectors for data sources. 

Many users are today are learning about Presto Spark. This lays out many of the differences on Presto vs Spark SQL and how Spark and Presto can be compared.

If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.

You can see our previous guide to compare the Spark execution engine vs Presto.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

spark streaming logo

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

Here are five open source alternatives to Spark Streaming

  • Apache Flink
  • Apache Apex
  • Apache Beam
  • Apache Samza
  • Apache Storm

Details about each alternative:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming. Learn more about Spark SQL vs Presto in our comparison article, or learn about using the invoking the Spark engine from Presto.