Data analysts and data engineers need simpler ways to access business data stored on Amazon S3. Amazon Athena and S3 Select are two services that allow you to retrieve records on S3 using regular SQL. What are the differences, and when should you use one vs the other?
S3 Select vs Athena: What’s the Difference?
The short answer:
Both services allow you to query S3 using SQL. Athena is a fully-featured query engine that supports complex SQL and works across multiple objects while S3 Select is much more limited, and used to retrieve a subset of data from a single object in S3 using simple SQL expressions.
The long answer:
S3 Select is more appropriate for simple filtering and retrieval of specific subsets of data from S3 objects using basic SQL statements, with reduced data transfer costs and latency. Amazon Athena, on the other hand, is suitable for running complex, ad-hoc queries across multiple paths in Amazon S3, offering more comprehensive SQL capabilities, improved performance, and optimization options. Athena supports more file formats, compression types, and optimizations, while S3 Select is limited to CSV, JSON, and Parquet formats.
An alternative to Amazon Athena is Ahana Cloud, a managed service for Presto that offers up to 10x better price performance.
Here is a detailed comparison between the two services:
- S3 Select operates on a single object in S3, retrieving a specific subset of data using simple SQL expressions.
- Amazon Athena can query across multiple paths, including all files within those paths, making it suitable for more complex queries and aggregations.
- S3 Select supports basic SQL statements for filtering and retrieving data, with limitations on SQL expression length (256 KB) and record length (1 MB).
- Athena offers more comprehensive ANSI SQL compliant querying, including group by, having, window and geo functions, SQL DDL, and DML.
Data Formats and Compression:
- S3 Select works with CSV, JSON, and Parquet formats, supporting GZIP and BZIP2 (only for CSV and JSON) compression.
- Athena supports a wider range of formats, including CSV, JSON, Apache Parquet, Apache ORC, and TSV, with broader compression support.
Integration and Accessibility:
- S3 Select can be used with AWS SDKs, the SELECT Object Content REST API, the AWS CLI, or the Amazon S3 console.
- Athena is integrated with Amazon QuickSight for data visualization and AWS Glue Data Catalog for metadata management. It can be queried directly from the management console or SQL clients via JDBC.
Performance and Optimization:
- S3 Select is a rudimentary query service mainly focused on filtering data, reducing data transfer costs and latency.
- Athena offers various optimization techniques, such as partitioning and columnar storage, which improve performance and cost-efficiency.
- S3 Select queries are ad hoc and don’t require defining a data schema before issuing queries.
- Athena requires defining a data schema before running queries.
- According to the first source provided, the cost of S3 Select depends on three factors: the number of SELECT requests, the data returned, and the data scanned. As of Dec 2020, the cost for region US-EAST(Ohio) with Standard Storage is:
- Amazon S3 Select — $0.0004 per 1000 SELECT requests
- Amazon S3 Select data returned cost — $0.0007 per GB
- Amazon S3 Select data scanned cost — $0.002 per GB
- With Athena, you are charged $5.00 per TB of data scanned, rounded up to the nearest megabyte, with a 10MB minimum per query.
An alternative to AWS Athena and S3 Select is Ahana Cloud, which gives you the ability to run complex queries at better price performance than Athena. Get a demo today.
|Query Scope||Single object (e.g., single flat file)||Multiple objects, entire bucket|
|Use Cases||Ad-hoc data retrieval||Log processing, ad-hoc analysis, interactive queries, joins|
|SQL Capabilities||Basic queries, filtering||Complex, ANSI-compliant SQL queries, aggregations, joins|
|File Formats||CSV, JSON, Parquet||CSV, JSON, Parquet, TSV, ORC, and more|
|Integration||Serverless apps, Big Data frameworks||AWS Glue Data Catalog, ETL capabilities|
|Query Interface||S3 API (e.g., Python boto3 SDK)||Management Console, SQL clients via JDBC|
|Performance Optimization||Limited, basic filtering||Partitioning, columnar storage, and more|
|Schema Definition||Not required||Required|
When should you use Athena, and when should you use S3 select?
You should choose Amazon Athena for complex queries, analysis across multiple S3 paths, and integration with other AWS services, such as AWS Glue Data Catalog and Amazon QuickSight. Opt for S3 Select when you need to perform basic filtering and retrieval of specific subsets of data from a single S3 object.
Example 1: Log Analysis for a Web Application – Use Athena
Imagine you operate a web application, and you want to analyze log data stored in Amazon S3 to gain insights into user behavior and troubleshoot issues. In this scenario, you have multiple log files across different S3 paths, and you need to join and aggregate the data to derive meaningful insights.
In this case, you should use Amazon Athena because it supports complex SQL queries, including joins and aggregations, and can query across multiple paths in S3. With Athena, you can take advantage of its optimization features like partitioning and columnar storage to improve query performance and reduce costs.
Example 2: Filtering Customer Data for a Marketing Campaign – Use S3 Select
Suppose you have a customer data file stored in Amazon S3, and you want to retrieve a subset of records for a targeted marketing campaign. The data file is in JSON format, and you need to filter records based on specific criteria, such as customer location or spending habits.
In this scenario, S3 Select is the better choice, as it is designed for simple filtering and retrieval of specific subsets of data from a single S3 object using basic SQL expressions. Using S3 Select, you can efficiently retrieve the required records, reducing data transfer costs and latency.
Is S3 Select Faster than Athena?
Both S3 Select and Athena are serverless and rely on pooled resources provisioned by Amazon at the time the query is run. Neither is generally faster than the other. However, S3 Select can be faster than Athena for specific use cases, where retrieving a subset of the data is more efficient than processing the entire object. In cases where you only need the capabilities of S3 Select, it can also be easier to run compared to Athena, which requires a table schema to be defined.
Need a better SQL query engine for Amazon S3?
Ahana provides a managed Presto service that lets you run ad-hoc queries, interactive analytics, and BI workloads over your S3 storage. Learn more about Ahana or get a demo.
Sources used in this article:
https://ahana.io/answers/aws-s3-select-limitations/ https://aws.amazon.com/s3/pricing/ https://aws.amazon.com/athena/pricing/ https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/ https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 https://stackoverflow.com/questions/49102577/what-is-difference-between-aws-s3-select-and-aws-athena https://repost.aws/questions/QU1_wCZSxES6-QHh7QBTDYDA/s-3-select-vs-athena