Data Lakehouse

How to Query Your JSON Data Using Amazon Athena

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/