Data Lakehouse

What is Spark SQL?

Spark is a general purpose computation engine for large-scale data processing. At Spark’s inception, the primary abstraction was a resilient distributed dataset (RDD), an immutable distributed collection of data. Since then, higher level abstractions—called DataFrames and Datasets—that more closely resemble classic database tables have been introduced to work with structured data. Spark SQL is the Spark module for working with these abstractions and structured data.

In addition to DataFrames and Datasets, Spark SQL also exposes SQL to interact with data stores and DataFrames/Datasets. For example, let’s say we wanted to return all records from a table called people with the basic SQL query: SELECT * FROM people. To do so with Spark SQL, we could programmatically express this in Python as follows:

people_dataframe = spark.sql(“SELECT * FROM people”)

spark is a SparkSession class, the main Spark entry point for structured data abstractions, and the statement would return a Spark DataFrame. With the SQL API, you can express SQL queries and get back DataFrames, and from DataFrames, you can create tables by which you can execute SQL queries on top of. Because the SQL language is widely known, it allows a broader range of data practitioner personas, such as SQL analysts, to perform data processing on top of Spark.

>> Learn more about using Spark’s execution engine with Presto.

Since Spark 3.0, Spark SQL introduced experimental options to be strictly ANSI compliant instead of being Hive compliant. Prior, Spark SQL supported both ANSI SQL and HiveQL.  Please consult the official Apache Spark SQL Reference if you are interested in the specifics of supported syntax, semantics, and keywords.

Regardless of whether you express data processing directly with DataFrame/Dataset methods or SQL, Spark SQL runs the same execution engine under the hood.  Further, through Spark SQL, the structured nature of the data and processing provide additional context to Spark about the data itself—such as the column types—and the workload.  This additional context allows for additional optimization, often resulting in better performance.

While Spark SQL is a general-purpose engine, you might want to consider Presto if your target use cases are predominantly interactive, low-latency queries on structured data. We compare Spark SQL and Presto in this short article.