Querying Amazon S3 Data Using AWS Athena
The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. In this article we’ll look more closely at AWS S3 and AWS Athena.
How Does AWS Athena work with Amazon S3
In AWS land, AWS S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.
The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.
How to query S3 using AWS Athena
The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option.
From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:
- Set up your Database, Table, and Folder Names & Locations
- Choose the data format you’ll be querying
- Define your columns so Athena understands your data schema
- Set up your Data Partitions if needed
Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.
What happens when AWS Athena hits its limits
While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS.
Here’s a quick comparison between the two offerings:
Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.
We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.
What is an Open Data Lake in the Cloud?
Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.
Building an Open Data Lakehouse with Presto, Hudi and AWS S3
Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse