How do I query a data lake with Presto?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics.
To query this data in data lakes, the following technologies are needed.
- A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others.
- A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service.
- Buckets in the data lake like AWS S3
What types of data can be queried by Presto?
The following file formats can be queried by Presto
How does it work?
First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the
create table statement and map the file to a table and columns.
SHOW CATALOGS; USE ahana_hive.default; CREATE TABLE user ( registration_dttm timestamp, id int, first_name varchar, last_name varchar, email varchar, gender varchar, ip_address varchar, cc varchar, country varchar, birthdate varchar, salary double, title varchar, comments varchar ) WITH ( format = CSV, skip_header_line_count = 1, external_LOCATION = 's3a://ahana/userdata/' );
Once the table is created and mapped to the external location, it can immediately be queried.
Select * from user;
You can run these commands in Presto using the Presto-cli. More information in the docs.
If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the
CREATE TABLE AS command. More information on this here.
Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started