Creating tables in a S3 Bucket gives “Query failed External location must be a directory”

So here’s why you are here. You are using Presto’s Hive connector and you want to create a new table using an S3 folder that already exists. You have full read/write access to the folder, you can see the folder, but Presto tells you “External location must be a directory”.

Whiskey Tango Foxtrot – The directory isn’t a directory? 

In this example, you have already created your target directory myDir using the AWS S3 console, in a bucket called myBucket,  but attempts to create the customer table in that directory fail:

presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) ) 
WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );

Query 20210104_110546_00015_mhsym failed: External location must be a directory

The issue here is not immediately obvious, read on further to learn more.

Why is the issue occurring?

S3 is an immutable object store and not a traditional file system. Objects are stored as key-value pairs internally, the key is the “Path” and the value is the actual object. There is no such thing as a folder in the traditional filesystem sense of the world. All there is are objects and associated metadata. If the path doesn’t have the appropriate metadata associated with it (in this case content-type: ‘application/x-directory’) Presto will complain that the path is not a directory

You have 2 primary options of addressing the issue:

  1. Manually associate the metadata via the AWS Console UI
  2. Copy the object over again using S3 command line or python script

Note: Since S3 is an immutable store, so there is no way to simply “update the metadata”. Under the covers, if you use the AWS Console, the UI is performing a clone of the object and updating the metadata. If you need to do it via API you have to perform a copyObject and update the metadata. 

See snippet from the AWS docs:  https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

“You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.

So how do I fix the issue?

Option 1: Change metadata using the AWS Console UI

Choose “Edit metadata” for the path in question

EditMetadata
Choose Edit metadata for the Path

Add ‘Content-Type’ as ‘application/x-directory’ as a system defined metadata

Add Content Type
Set Content-Type to ‘application/x-directory’

Option 2: Change metadata using tools (copy objects over again)

AWS Cli

aws s3 cp --content-type 'application/x-directory' s3://<bucket>/<object> s3://<bucket>/<object> --metadata-directive REPLACE

S3 API:

#Put with metadata
╰$ aws s3api put-object --bucket mybucket --key folder_name_xxx/ --content-type application/x-directory
{
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
}
 
#Verify the metadata
 
╰$ aws s3api head-object --bucket mybucket --key folder_name_xxx/
{
    "AcceptRanges": "bytes",
    "LastModified": "Fri, 23 Apr 2021 16:03:21 GMT",
    "ContentLength": 0,
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ContentType": "application/x-directory",
    "Metadata": {}
}

Python Script

import boto3
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key')
s3_object.metadata.update({'Content-Type':'application/x-directory'})
s3_object.copy_from(CopySource={'Bucket':'bucket-name', 'Key':'key'}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
 
Alternatively:
s3 = boto3.client('s3')
s3.put_object(Bucket=BUCKET, Key=KEY, ContentType='application/x-directory')

The above-mentioned steps should help you resolve the issue and your create table should work fine without complaining about the ‘External location must be a directory’ error.

--Now CREATE TABLE works fine
presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) )
 WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );
CREATE TABLE