Using Spark’s Execution Engine With Presto
Learn when to use Spark as an additional engine alongside open-source Presto, and how you can configure and invoke a Spark job from Presto.
Executing Presto Spark queries is possible, but why leverage Spark as an execution framework for Presto’s queries when Presto is itself an efficient execution engine? The fact that both are in-memory execution engines tends to add further confusion.
Presto and Spark are different kinds of engines – Presto specializes in analytical query execution; Spark’s emphasis is on calculation. The following should help clarify why Presto users would or would not want to use Spark as an additional engine.
A Quick Presto/Spark Comparison
Executing Presto queries on Spark can be useful for some very specific workloads, such as queries that we want to run on thousands of nodes, or for queries requiring 10s or 100s of TBs of memory, queries that may consume many CPU years, or if we need extra execution resilience. So if you have a particularly ugly, complex and very long-running query, this may be an option.
Spark provides several features that can be desirable for certain workloads, like:
- Resource isolation
- Fine-grained resource management
- Spark’s scalable materialized exchange mechanism
- In terms of memory usage, both are in-memory execution engines but with one important difference: Spark will write/spill data to disk when memory is exhausted.
- Spark has data processing fault tolerance, so if a worker node fails and its data partition goes away,that partition will be re-computed.
There are also some downsides, or at least considerations, when using Spark with Presto:
- Spark commits tasks and applies for resources dynamically as it needs them. So at each stage tasks will grab their required resources, a strategy that can slow down processing speed. Presto on the other hand does this “upfront” on the most part..
- The way in which data is processed by the two engines differs: Spark needs data to be fully processed before progressing to the next processing stage. Presto uses a pipeline processing approach where data is sent to the next task as soon as possible which can reduce total execution time.
- Spark SQL supports a subset of the ANSI SQL standard. Many features are missing, and for this reason many developers avoid Spark SQL.
When To Use Spark’s Execution Engine With Presto
It is recommended that you:
- Do not use Spark for interactive queries. This is not what Spark was designed for, so performance will be poor.
- Do use Spark for long-running, SQL-based batch ETL queries, provided Spark SQL supports the features and functions you need. You can run ETL/batch type queries in Presto, but you may choose to use Spark as the execution engine with Presto if a user / developer wants to test their query in adhoc mode using Presto first and later convert it to Spark when they want to run a batch pipeline in production.
Another scenario where Presto and Spark can be used together is if there is no connector available in Spark for the data source you want to access then use Presto and its connectors to access the data then access Presto via JDBC from Spark.
Setting up Presto and Spark and submitting jobs is well documented in PrestoDB’s docs at https://prestodb.io/docs/current/installation/spark.html and it is straightforward to set-up. You will need Java and Scala installed. You need a working Presto cluster, and a Spark cluster of course. Then there are two packages to download, and a file to configure. You then execute your queries from the Spark cluster, by running spark-submit.
Tip: If you are using a Mac, standalone Spark can be installed very simply with brew install apache-spark. Similarly Presto can be installed with brew install prestodb.
Verify your Spark cluster by running spark-shell on the command line.
Presto’s config.properties file needs to have its task.concurrency, task.max-worker-threads and task.writer-count parameters set to values appropriate for your hardware. Ensure these values also match the parameters used with the spark-submit command.
Use the spark-submit command to invoke Spark, passing the sql query file as an argument. See https://prestodb.io/docs/current/installation/spark.html for an example spark-submit command.
The above article should help you understand the how, why and when of executing Presto queries on Spark.