Spark SQL vs Presto
When it comes to comparing Spark SQL vs Presto there are some differences to be aware of:
- They are both open source, “big data” software frameworks
- They are distributed, parallel, and in-memory
- BI tools connect to them using JDBC/ODBC
- Both have been tested and deployed at petabyte-scale companies
- They can be run on-prem or in the cloud. They can be containerized
- Presto is an ANSI SQL:2003 query engine for accessing and unifying data from many different data sources. It’s deployed as a middle-layer for federation.
- Spark is a general-purpose cluster-computing framework. Core Spark does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark SQL is also ANSI SQL:2003 compliant (since Spark 2.0).
- Presto is more commonly used to support interactive SQL queries. Queries are usually analytical but can perform SQL-based ETL.
- Spark is more general in its applications, often used for data transformation and Machine Learning workloads.
- Presto supports querying data in object stores like S3 by default, and has many connectors available. It also works really well with Parquet and Orc format data.
- Spark must use Hadoop file APIs to access S3 (or pay for Databricks features). Spark has limited connectors for data sources.
If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.
After that, see this tutorial on how to manage individual Presto clusters.