Querying Parquet Files using AWS Amazon Athena
Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc. Here are steps to quickly get set up to query your parquet files with a service like Amazon Athena.
- Sample Parquet Data – https://ahana-labs.s3.amazonaws.com/movielens/ratings/ratings.parquet
- AWS Account and Role with access to below services:
- AWS S3
- AWS Glue (Optional but highly recommended)
- AWS Athena
Setting up the Storage
For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket.
Log in to your AWS account and select the S3 service in the Amazon Console.
- Click on Create Bucket
- Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created.
- Scroll to the bottom and click Create Bucket
- Click on your newly created bucket
- Create a folder in the S3 bucket called ‘test-data’
- Click on the newly created folder
- Choose Upload Data and upload your parquet file(s).
Running a Glue Crawler
Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.
- On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
- Click Add Crawler
- Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
- In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
- In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
- In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
- For Frequency leave the setting as default and choose Next
- For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
- Review and then choose Finish.
- AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
- Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.
Querying the Parquet file from AWS Athena
Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.
- Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
- Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
- Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
- The final result should look something like this: in the S3 storage step.
Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations.
Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.