Data Lakehouse

How to Query Your JSON Data Using Amazon Athena

Querying semi-structured data in Athena can get expensive. Ahana provides the same SQL-based interactive analytics on your S3 data, with predictable pricing and performance. Take control of your analytics – learn more about Ahana on our website or in a quick, no-strings-attached call with a solution architect.

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.


JSON Data

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous AWS Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Related Articles

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.