AWS Athena vs AWS Glue: What Are The Differences?
Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?
What is AWS Athena?
What is AWS Glue?
AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.
AWS Athena vs AWS Glue
Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.
Some examples of how Glue and Athena can work together would be:
CREATE EXTERNAL TABLE sampleTable ( col1 INT, col2 INT, str1 STRING, ) STORED AS AVRO TBLPROPERTIES ( 'classification'='avro')
- Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
- Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.
Query S3 Using Athena & Glue
Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:
We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:
SELECT * FROM my_table WHERE att2 = 'Z';
From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.
AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows.
At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources.
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/