Querying AWS S3 data with Presto
While users may think of S3 buckets as a datasource for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases. Presto only needs to have access to the path for the data it needs to scan. It does not need to know the details like bucket name etc as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath.
https://prestodb.io/docs/current/connector/hive.html
There are two primary ways to Presto clusters get access to data stored in S3:
- IAM role via the instance Presto servers are running on
- Access Key / Secret Key provided via the Hive connector properties file
IAM role – recommended approach
If using IAM role, Presto needs to be configured using
hive.s3.use-instance-credentials=true
With this setting, the Presto server will have access to all the buckets that are accessible using the IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role.
Access Key / Secret Key
If AK/SK is used, Presto Hive connector needs to be configured with:
hive.s3.aws-access-key | Default AWS access key to use. |
hive.s3.aws-secret-key | Default AWS secret key to use. |
With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected.
If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS. You can add data sources with a click of a button and query your S3-based data lake easily.
We also have a free hands-on virtual lab on how to Query Data in AWS S3 with PrestoDB.