Querying AWS Presto S3 data

While users may think of S3 buckets as a datasource for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases. Presto only needs to have access to the path for the data it needs to scan. It does not need to know the details like bucket name etc as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath. 

https://prestodb.io/docs/current/connector/hive.html

There are two primary ways to Presto clusters get access to data stored in S3:

  1. IAM role via the instance Presto servers are running on 
  2. Access Key / Secret Key provided via the Hive connector properties file

IAM role – recommended approach

If using IAM role, Presto needs to be configured using

hive.s3.use-instance-credentials=true 

With this setting, the Presto server will have access to all the buckets that are accessible using the  IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role. 

Access Key / Secret Key 

If AK/SK is used, Presto Hive connector needs to be configured with: 

hive.s3.aws-access-keyDefault AWS access key to use.
hive.s3.aws-secret-keyDefault AWS secret key to use.

With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected.