Athena Partition Limits

This is our 3rd blog in the comparing AWS Athena to PrestoDB series. If you missed the first two, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits

Partitioning is a great way to increase performance, but AWS Athena partitioning limitations could lead to poor performance, query failures, and wasted time trying to diagnose query problems. Athena is an AWS serverless interactive service to query AWS data lakes on Amazon S3 using regular SQL. It is a low-cost service; you only pay for the queries you run. Athena engine v2 is built on an older version of Presto DB (v 0.217), and developers use Athena for analytics on data lakes and across data sources in the cloud. A common limitation that could lead to poor performance is Athena’s partitioning. 

Partitioning data

Partitioning splits your table into parts and keeps the related data together based on column values. Partitions are like virtual columns that help the system to scan less data per query. Queries that constrain on the partitioning column(s) will run substantially faster because the system can reduce the volume of data scanned by the query when using filters based on the partition. 

Athena Hive partitioning 

AWS Athena supports Apache Hive partitioning. You can partition your data by one or more columns. For example you can partition based on time, which can lead to a multi-level partitioning scheme. It means that you then have your data organized by key values that may look like this: …/month=01/day=01/year=2021/… This is perfect if your queries constrain on year and/or month and/or day. 

Athena has the MSCK REPAIR TABLE command which updates the partition metadata stored in the catalog. It is an inefficient command when there are a large number of partitions however. The more partitions you have, the slower this command runs. This command gathers metadata by scanning all the partitions in the specified table and updates the metadata in the Hive metastore accordingly so the system knows about any new partitions. However the command could time out after 30 minutes if it has a lot of work to do, this being Athena’s default query time limit.  The command should run to completion so that all the partitions are discovered and cataloged, and it should be run every time new partitions are added e.g. after each ETL/data ingest cycle.

AWS Glue partitioning 

AWS Glue is an Extract-Transform-and-Load (ETL) service that has a central metadata repository called AWS Glue Data Catalog. You can use it to perform ETL operations and store metadata to enable data lake querying. Also, it helps you to partition your data; you can create a Glue table of an Amazon S3 folder, which is a partitioned table that has key values in its path like Hive. Automated Glue crawlers help you to keep the Glue catalog in sync with Amazon S3, which is something you can’t do with Hive.

AWS Athena partition limits

If you are using Hive metastore as your catalog with Athena, the max number of partitions per table is 20,000. You can request a quota increase from AWS.

If you are using AWS Glue with Athena, the Glue catalog limit is 1,000,000 partitions per table. 

AWS Athena alternatives with no partitioning limitations

Open Source PrestoDB

Deploying PrestoDB on your own is one way to avoid Athena’s partitioning limitations. PrestoDB doesn’t have a hard partition limit, which helps boost your performance, and you are free to deploy the latest version of Presto and so benefit from all its features.

PrestoDB has the Hive system.sync_partition_metadata function to update partitions in metastore; it works better than the MSCK REPAIR TABLE command that AWS Athena uses. Because AWS Athena built on an older version of PrestoDB, it doesn’t have this function.

If you’re ready to take advantage of the full power of PrestoDB, you can use Ahana Cloud – the managed service for Presto in the cloud. Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes on Amazon S3 and many other data sources using PrestoDB. 

Ahana runs as a SaaS offering and gives you the ability to deploy and manage Presto at any scale without having to manage PrestoDB or the underlying containerization, and without the restrictions of Athena’s partitioning or concurrency limits. You can easily and quickly deploy multiple Presto clusters, attach data sources, and scale clusters in and out on demand in seconds. You can use Hive or Glue catalogs with Ahana Cloud, and it has its own integrated, fully managed Hive metastore for convenience. Try Ahana Cloud today in a free trial.

Athena Query Limits

Welcome to the 2nd blog in our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service.

See our first blog in this series on AWS Athena Limitations.

Taking a look at AWS Athena Query Limits

Amazon Athena allows users to perform ad-hoc analytics on data lakes without the need for time-consuming Extract Transform Load (ETL) cycles. In order to maintain the availability of the service when processing such vast and diverse data, certain design decisions and limitations were introduced. Some of these are:

Inherent limits

  • Query string limit: 262,144 bytes.
  • Some statements like EXPLAIN, MERGE, and UPDATE are not supported.
  • Only one query per user is permitted, and 5 to 200 queries per account.
  • Queries can timeout on tables with thousands of partitions.
  • Cross-region querying is limited to sixteen regions.
  • AWS Athena partition limits happen because a Create Table As (CTAS) or INSERT INTO query can create up to a maximum of 100 partitions in a destination table. To work around this limitation you must use a series of INSERT INTO or CTAS statements that insert up to 100 partitions each.
  • Query timeouts can occur on tables with many thousands of partitions. This can happen when the table has many partitions that are not of type string.
  • Athena concurrent queries limit occurs due to limits imposed on accounts to avoid users from overwhelming the regional service.

Configurable quotas

  • Queries can be queued or canceled by the Athena service when exceeding a data threshold.
  • An adjustable quota of 20 DDL (Data Definition Language, like CREATE TABLE) active queries (600-minute timeout) and 20-25 DML (Data Manipulation Language – like SELECT and CREATE TABLE AS) active queries (30-minute timeout) is permitted.
  • Maximum 20,000 partitions per table. 
  • Maximum number of databases is 100. Each database can contain a maximum of 100 tables.

Other limitations

  • Full-table scans are necessary due to the lack of indices.
  • AWS Athena’s federated queries cannot be used with views, and the required Athena Data Source Connectors (that run on AWS Lambda) are complicated to setup.
  • Nodes can run out of memory when sorting columns.
  • There is no default support for stored procedures (but can be added with an adapter).
  • Partition creation is restricted to 100 at once (workarounds exist).
  • The same server pools run queries for multiple accounts, so queries tend to get enqueued based on the service load.
  • ..and much more. You can read more details about AWS Athena’s limitations in my other blog post on the topic.

Configuring and fine-tuning Presto or working around the limitations of Amazon Athena takes a lot of experience. A solution like Ahana Cloud can help.

Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources. Unlike Athena, it uses the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering with an easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud. You can try Ahana Cloud today as a free trial.

Up next: AWS Athena Partition Limits

Building Managed Services: Architecting Ahana Cloud for Presto with the In-VPC Deployment Model

Ahana Cloud for Presto

Gary Stafford, Solutions Architect, AWS & James Mesney, Solutions Engineer, Ahana

Note: This article originally appeared on the AWS Startups blog.

Ahana is the startup that provides the first cloud-native managed service for Presto, the fast-growing, open source distributed SQL engine. Backed by GV (formerly known as Google Ventures) and Lux Ventures, the Ahana team includes experts in Presto, AWS, and big data. This blog post discusses how AWS users have evolved their big data requirements and how the team architected our managed service offering, highlighting the best practice of providing an “In-VPC” deployment. We hope other infrastructure software startups can benefit from sharing some of the key learnings that led to the launch of Ahana Cloud for Presto on AWS.

For some more background, Presto is an open source system for federated data analytics. Federation means the system can map multiple data stores. It enables users to access data where it lives in a wide variety of sources via federated plug-in connectors without moving or copying the data. Presto was originally developed by Facebook. Today, it’s deployed in large-scale production at some of the world’s most data-driven companies, including Uber and Twitter. Presto addresses the business need of leveraging all data within an organization to generate insights and drive decision-making faster than ever before. Presto also leads in delivering on the technology trends of today: disaggregation of storage and compute, resulting in the rise of Amazon S3-based data lakes and on-demand cloud computing. You can learn more on the AWS Presto page.

While the SQL engine is the main component of an interactive ad hoc analytics system, the other components, such as the metadata catalog, the data sources, and the visualization tools or notebooks, require integration. Deploying and managing complex software in AWS can be challenging. Presto administrators must set-up, configure, and connect one data store for Presto’s metadata. Typically, this is Apache Hive or AWS Glue. They must also create and configure their connectors to access their data sources and then configure catalog entries for each data source. Presto requires the admins to deal with many properties files to achieve this, which is both laborious and error-prone.

Ahana Cloud for Presto addresses these complexities and more with an easy-to-use cloud-native managed service. In 60 minutes or less, Ahana allows users to build an end-to-end deployment: multiple clusters of Presto, their Glue or Hive metadata catalogs, their AWS data sources, and user-facing tooling. Customers get the power of Presto with the capabilities of AWS for faster, more iterative, and interactive data discovery—without the complexity. Analysts, data engineers, and data scientists enjoy the freedom to rapidly use any data in the organization and do so in a more self-service way. Additionally, AWS customers can procure services the way they’re used to—quickly and easily—on an hourly pay-as-you-go (PAYGO) listing on AWS Marketplace, simply billed to their AWS accounts.

The “In-VPC” Deployment Approach that Data-Driven Customers Want

As cloud service adoption has grown, the way companies store and analyze their data has evolved. Early adopters were focused around innovation: building and deploying applications quickly with AWS and other public cloud providers. Most of the mission-critical data was still produced and analyzed in data centers, mainly due to control-related concerns of the data, such as where that data could be copied, how it could be used, and who could access it. Now, as cloud adoption has become mainstream, we see companies with the majority of their data both created and stored in the cloud, especially in cost-efficient Amazon S3-based data lakes. Along with this shift, so have the concerns related to how and where data is sent, its use, and access controls. Users do not want to lose control of their data; they prefer to not have to ingest it to other environments. They want data to remain in their own Virtual Private Cloud (VPC).

A new cloud-native architecture model has emerged for data-focused managed services like Ahana. We call it the “In-VPC” deployment model, separating the control plane from the compute and data planes.

The Role of the Control Plane

Ahana Cloud has two major “planes,” the control plane which is delivered as a SaaS, and the compute plane where Presto clusters run, which is delivered as a managed service. The Ahana Control Plane, just as it sounds, oversees, orchestrates, and manages the rest of the environment. The control plane runs in its own VPC, in the Ahana account separate from the customer account VPC, where the compute plane and data live. This makes management much easier without the customers having to share control of user data with Ahana. This is important as users want their data to remain in their own VPC and not be ingested in any other environment (e.g., some 1st gen cloud data warehouse services). In fact, the Ahana control plane running in the Ahana VPC never sees any of the customer’s data; it is totally separate from the customer’s “In-VPC” compute plane deployment.

Integrated Metastore

For further ease-of-use, Ahana pre-integrates an Apache Hive metastore/catalog, which is automatically created, so it’s not essential to set-up other components like AWS Glue. But if users have an existing metastore including Glue, they can use that if they prefer.

Connectors Included

In terms of connectors, Ahana initially ships with support for AWS data services like Amazon S3 and Amazon RDS for MySQL and PostgreSQL, and others. More connectors for sources like MongoDB and Amazon Redshift will follow soon. Ahana automates the creation of connections and catalogs, removes the need to juggle configuration files, and eliminates the need for Presto restarts. Catalogs can be created once and used by multiple clusters.

Presto Ahana In VPC model diagram

In the diagram, there are two core components, both created and managed by Ahana Cloud:

1. The Ahana control plane (top) and its UI orchestrates the Presto environment. There’s consolidated application logging, query logging, and monitoring, which means users have full and easy management and control. There are security and access controls and pay-as-you-go hourly billing and support.

  • The control plane runs in the Ahana Amazon account, external to the user’s environment.
  • Ahana and its employees have no access to the user’s data.
  • It is multi-tenant to scale with customer accounts.
  • The control plane supports SSO with Amazon Cognito, LDAP authentication, and SQL-based authorization for Presto (RBAC). In the future, there will be Apache Ranger support.

2. The Ahana compute plane (bottom) runs in each user’s VPC, deployed as a single-tenant environment within the user’s account. The control plane first creates a dedicated VPC for the compute plane. It then deploys Amazon EKS for a highly elastic, highly available environment to create Presto clusters. Once the control plane completes the initial set-up of the compute plane, users can create and manage any number of Presto clusters, which then get provisioned into the compute plane in Amazon EKS.

  • The compute plane, and the user data it interacts with, runs in the user’s account.
  • Each cluster is created in an individual node group to utilize the most advanced autoscaling and high-availability capabilities EKS provides.
  • Each Presto cluster comes pre-integrated with a Hive Metastore to store metadata for schemas and tables generated via Presto and an Amazon S3 data lake where data inserted into tables gets stored.
  • In addition to the pre-integrated catalog and Amazon S3 bucket, users can attach external Hive Metastores or AWS Glue catalogs pre-populated with metadata for structured data stored in Amazon S3 and databases running on Amazon RDS for MySQL or PostgreSQL.

This separation of the control, compute, and data planes is enabled by Amazon’s recommended approach of cross-account access via external ID – a mechanism that uses trusted secure token exchange. Users simply update their policy to include the Ahana ARNs (Amazon Resource Names).  The In-VPC deployment approach offers greater security and cleaner management to users. For further details, we recommend this AWS blog on architecting successful SaaS services.

Summary

Ahana Cloud for Presto is the fully managed, end-to-end environment for Presto. It gives users an interactive multi-cluster UI with single-click cluster and data source management. It provides automatic set-up, security features, and resilience features. It leverages the “In-VPC” deployment, which separates the control, compute, and data planes for customers. Finally, Ahana is procured using a simple and affordable pay-as-you-go usage-based licensing model on AWS.

Sign up for free Early Access for Ahana Cloud today!