Comparing AWS Athena and PrestoDB Blog Series: Athena Alternatives

This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits
Part 3: AWS Athena vs. PrestoDB Blog Series: Athena Partition Limits

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Why would I not want to use Athena?

There are various reasons users look for alternative options to Athena, in spite of its advantages: 

  1. Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
  2. Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
  3. Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
  4. Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
  5. Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.

What are the typical alternatives to Athena?

Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:

DIY open-source PrestoDB

Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment. 

Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.

The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.

Managed Hadoop and Presto

Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.

AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.

This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.

You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.

Managed Presto Service

If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.

Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs. 

A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.

Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight. 

Summary

You have a wide variety of options regarding your use of PrestoDB. 

If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you. 

On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go. 

If simplicity and accelerated go-to-market are what you seek without needing to manage a complex infrastructure, then Ahana’s Presto managed service is the way to go. Sign up for our free trial today.

Athena Limitations & AWS Athena Limits

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. It has a serverless architecture and Athena users pay per query (it’s priced at $5 per terabyte scanned). Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. Due to these limitations, AWS Athena can run slowly and increase operational costs. Plus, AWS Athena is built on an old version of PrestoDB and only supports a subset of PrestoDB features.

An overview on AWS Athena limits

AWS Athena query service has many different limitations that can cause problems, and many data engineering teams have spent hours trying to diagnose them. Some limits are hard, while some are soft quotas that you can request AWS to increase. One big limitation is around Athena’s limitations on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Ahena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100 buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

3.    Athena restricts each account to 100 databases, and databases cannot include over 100 tables.

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An AWS Athena alternative is deploying your own PrestoDB cluster. AWS Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as query limits, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service without the limits of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud. You can try Ahana Cloud today as a free trial.

Up next: AWS Athena Query Limits