Data Lakehouse

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.