Data Lakehouse

AWS Athena vs AWS Glue: What Are The Differences?

If you’re looking to improve your cloud architecture, you need to check out Ahana. Rather than struggling with the limitations of other interactive query tools, you can build your analytics foundation on the same open-source SQL engine that powers petabyte-scale queries at Meta and Uber – and use Ahana to hit the ground running with a managed platform. Find out why companies choose Ahana over Athena by scheduling a call with an Ahana solution architect.

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

FeaturesAWS AthenaAWS Glue
ImplementationServerless implementation of PrestoEcosystem of tools for schema discovery and ETL
QueryingPrimarily used as a query tool for analyticsMore of a transformation and data movement tool
ComponentsOnly includes AthenaIncludes Glue Metastore and Glue ETL
MetadataUses AWS Glue Catalog as a metadata catalogUsed as a central hive-compatible metadata catalog
Data FormatSupports structured and unstructured dataSupports CSV, parquet, orc, avro, or json
PricingCosts $5 per terabyte scannedPriced purely at instance hours

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

  col1 INT,
  col2 INT,
  str1 STRING,
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

CSV table answers AthenaVGlue

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.