How do I query a data lake with Presto?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics. 

To query this data in data lakes, the following technologies are needed. 

  1. A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others. 
  2. A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service. 
  3. Buckets in the data lake like AWS S3 

What types of data can be queried by Presto? 

The following file formats can be queried by Presto 

  1. ORC.
  2. Parquet.
  3. Avro.
  4. RCFile.
  5. SequenceFile.
  6. JSON.
  7. Text.
  8. CSV

How does it work? 

First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the create table statement and map the file to a table and columns. 

Example:

SHOW CATALOGS; 
USE ahana_hive.default; 

CREATE TABLE user ( 
	registration_dttm 	timestamp, 
	id 					int,
	first_name 			varchar,
	last_name 			varchar,
	email 				varchar,
	gender 				varchar,
	ip_address 			varchar,
	cc 					varchar,
	country 			varchar,
	birthdate 			varchar,
	salary 				double,
	title 				varchar,
	comments 			varchar
) WITH ( 
format = CSV, 
skip_header_line_count = 1,
external_LOCATION = 's3a://ahana/userdata/' );

Once the table is created and mapped to the external location, it can immediately be queried. 

Example: 

Select * from user; 

You can run these commands in Presto using the Presto-cli. More information in the docs.

If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the CREATE TABLE AS command. More information on this here

Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started

Additional resources: 

Presto Docker Container 
Presto-cli Docker Container 
Presto Sandbox Docker Container