Querying AWS S3 with Presto
In AWS, query S3 data with a technology like open source Presto. While users may think of S3 buckets as a datasource for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases. To Presto query S3 access to the path for the data it needs to scan is the only necessary information. It does not need to know the details like bucket name etc as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath.
How to query Presto AWS S3 bucket:
You can use presto query S3 data. Here’s how to access Presto: there are two primary ways to Presto clusters get access to data stored in S3-
- IAM role via the instance Presto servers are running on
- Access Key / Secret Key provided via the Hive connector properties file
IAM role – recommended approach
If using IAM role, Presto needs to be configured using
With this setting, the Presto server will have access to all the buckets that are accessible using the IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role.
Access Key / Secret Key
If AK/SK is used, Presto Hive connector needs to be configured with:
|hive.s3.aws-access-key||Default AWS access key to use.|
|hive.s3.aws-secret-key||Default AWS secret key to use.|
With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected.
If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS. You can add data sources with a click of a button and query your S3-based data lake, and run a Pesto AWS S3 query easily.