
How to Query Your JSON Data Using Amazon Athena
AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.
Using Athena to Query Nested JSON
To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:
[
{
"name": "Sam",
"age": 45,
"cars": {
"car1": {
"make": "Honda"
},
"car2": {
"make": "Toyota"
},
"car3": {
"make": "Kia"
}
}
},
{
"name": "Sally",
"age": 21,
"cars": {
"car1": {
"make": "Ford"
},
"car2": {
"make": "SAAB"
},
"car3": {
"make": "Kia"
}
}
},
{
"name": "Bill",
"age": 68,
"cars": {
"car1": {
"make": "Honda"
},
"car2": {
"make": "Porsche"
},
"car3": {
"make": "Kia"
}
}
}
]
We want to retrieve all “name”, “age” and “car2” values out of the array:
SELECT name, age, cars.car2.make FROM the_table;
name | age | cars.car2 |
Sam | 45 | Toyota |
Sally | 21 | SAAB |
Bill | 68 | Porsche |
That is a pretty simple use case of retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.
How to Query a JSON Array with Athena
Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:
{
"dealership": "Family Honda",
"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}
We have to unnest the array and connect it to the original table:
SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealership | models |
Family Honda | Civic |
Family Honda | Accord |
Family Honda | Odyssey |
Family Honda | Brio |
Family Honda | Pilot |
Finally we will show how to query nested JSON with an array of key values.
Query Nested JSON with an Array of Key Values
Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:
dealership:= Family Honda
employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]
To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:
select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealership | employee_unnested |
Family Honda | {name=Allan, dept=service, age=45} |
Family Honda | {name=Bill, dept=sales, age=52} |
Family Honda | {name=Karen, dept=finance, age=32} |
Family Honda | {name=Terry, dept=admin, age=27} |
By using the “.key”, we can now retrieve a specific column:
select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealership | employee_unnested.name | employee_unnested.dept | employee_unnested.age |
Family Honda | Allen | service | 45 |
Family Honda | Bill | sales | 52 |
Family Honda | Karen | finance | 32 |
Family Honda | Terry | admin | 27 |
Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows.
At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources.
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/