Data Lakehouse

Querying AWS S3 with Presto

If you’re looking for a query engine for AWS S3, Presto is a popular choice. Presto is an open source SQL query engine, well known for ad hoc analytics as well as reporting and dashboarding. While users may think of S3 buckets as a data source for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases.

PrestoDB only needs access to the path of the data it needs to scan; it does not need to know the details like bucket name, etc. as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath. 

How to query Presto AWS S3 bucket:

You can use presto query S3 data. Here’s how to access Presto: there are two primary ways to Presto clusters get access to data stored in S3-

  1. IAM role via the instance Presto servers are running on 
  2. Access Key / Secret Key provided via the Hive connector properties file

IAM Role: Query S3 with Presto (Recommended Approach)

If using IAM role, Presto needs to be configured using

hive.s3.use-instance-credentials=true 

With this setting, the Presto server will have access to all the buckets that are accessible using the  IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role. 

Access Key / Secret Key: Query S3 with Presto

If AK/SK is used, Presto Hive connector needs to be configured with: 

hive.s3.aws-access-keyDefault AWS access key to use.
hive.s3.aws-secret-keyDefault AWS secret key to use.

With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected. 

If you’re interested in getting started with Presto on S3, check out the Ahana Cloud platform, a managed service for Presto in AWS. You can add data sources with a click of a button and query your S3-based data lake, and run a Pesto AWS S3 query easily.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.