Athena Partition Limits
This is our 3rd blog in the comparing AWS Athena to PrestoDB series. If you missed the first two, you can find them here:
Partitioning is a great way to increase performance, but AWS Athena partitioning limitations could lead to poor performance, query failures, and wasted time trying to diagnose query problems. Athena is an AWS serverless interactive service to query AWS data lakes on Amazon S3 using regular SQL. It is a low-cost service; you only pay for the queries you run. Athena engine v2 is built on an older version of Presto DB (v 0.217), and developers use Athena for analytics on data lakes and across data sources in the cloud. A common limitation that could lead to poor performance is Athena’s partitioning.
Partitioning splits your table into parts and keeps the related data together based on column values. Partitions are like virtual columns that help the system to scan less data per query. Queries that constrain on the partitioning column(s) will run substantially faster because the system can reduce the volume of data scanned by the query when using filters based on the partition.
Athena Hive partitioning
AWS Athena supports Apache Hive partitioning. You can partition your data by one or more columns. For example you can partition based on time, which can lead to a multi-level partitioning scheme. It means that you then have your data organized by key values that may look like this: …/month=01/day=01/year=2021/… This is perfect if your queries constrain on year and/or month and/or day.
Athena has the MSCK REPAIR TABLE command which updates the partition metadata stored in the catalog. It is an inefficient command when there are a large number of partitions however. The more partitions you have, the slower this command runs. This command gathers metadata by scanning all the partitions in the specified table and updates the metadata in the Hive metastore accordingly so the system knows about any new partitions. However the command could time out after 30 minutes if it has a lot of work to do, this being Athena’s default query time limit. The command should run to completion so that all the partitions are discovered and cataloged, and it should be run every time new partitions are added e.g. after each ETL/data ingest cycle.
AWS Glue partitioning
AWS Glue is an Extract-Transform-and-Load (ETL) service that has a central metadata repository called AWS Glue Data Catalog. You can use it to perform ETL operations and store metadata to enable data lake querying. Also, it helps you to partition your data; you can create a Glue table of an Amazon S3 folder, which is a partitioned table that has key values in its path like Hive. Automated Glue crawlers help you to keep the Glue catalog in sync with Amazon S3, which is something you can’t do with Hive.
AWS Athena partition limits
If you are using Hive metastore as your catalog with Athena, the max number of partitions per table is 20,000. You can request a quota increase from AWS.
If you are using AWS Glue with Athena, the Glue catalog limit is 1,000,000 partitions per table.
AWS Athena alternatives with no partitioning limitations
Open Source PrestoDB
Deploying PrestoDB on your own is one way to avoid Athena’s partitioning limitations. PrestoDB doesn’t have a hard partition limit, which helps boost your performance, and you are free to deploy the latest version of Presto and so benefit from all its features.
PrestoDB has the Hive system.sync_partition_metadata function to update partitions in metastore; it works better than the MSCK REPAIR TABLE command that AWS Athena uses. Because AWS Athena built on an older version of PrestoDB, it doesn’t have this function.
If you’re ready to take advantage of the full power of PrestoDB, you can use Ahana Cloud – the managed service for Presto in the cloud. Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes on Amazon S3 and many other data sources using PrestoDB.
Ahana runs as a SaaS offering and gives you the ability to deploy and manage Presto at any scale without having to manage PrestoDB or the underlying containerization, and without the restrictions of Athena’s partitioning or concurrency limits. You can easily and quickly deploy multiple Presto clusters, attach data sources, and scale clusters in and out on demand in seconds. You can use Hive or Glue catalogs with Ahana Cloud, and it has its own integrated, fully managed Hive metastore for convenience.
Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.