Hive vs Presto vs Spark

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Commonalities

  • All three projects are community-driven open-source software released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.