How do I get deterministic performance out of Amazon Athena?
What is Athena?
Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. This approach is advantageous for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse.
What is Athena great for?
Users love Athena for the simplicity, ease of getting started, and the fact that no servers need to be managed. You only pay for the amount of data scanned. Athena is closely integrated with glue, so if you are already using the glue catalog for your ETL workloads, all your tables are already defined and accessible. You can then make these tables available for interactive analytics to your extended team.
What are the trade-offs?
The simplicity of the deployment architecture of Amazon Athena, however, comes at a price of inconsistent performance with scale, as many users of Athena have already experienced.
There are two primary trade-offs with Athena. Firstly, transparency – “you get what you get” – you have no visibility to the underlying infrastructure serving your queries nor do you have the ability to control or tweak that infrastructure for your workloads. The approach works for cases where the performance/latency is not critical, but it can be a non-starter for users who need control. Secondly, despite all its advantages, shared services serverless models have disadvantages related to performance predictability.
AWS has documented best practices in their performance tuning guide to get the most out of Athena and to avoid typical errors that users encounter such as “Query exceeded local memory limit” and “Query exhausted resources at this scale factor” these include using optimized formats such as parquet and orc as well as avoid small files and partitioning the data appropriately.
How does the serverless architecture impact my query performance?
Athena at its core is a shared serverless service per region – to protect the usage spiraling out of control by a handful of customers Amazon has placed restrictions on the usage, size, concurrency, and scale of the service on a per-customer basis and within the region overall.
These limits and restrictions are well documented in the service limits such as 20 active DDL and DML Queries each in most regions, 30-minute max query timeouts, API limits, and throttling among others (Some of these limits can be tweaked by working with Amazon support). These limits are guard rails around the service so that the usage of one customer doesn’t adversely affect the experience of another. These guardrails are however far from perfect since there are only a finite number of resources per region to be shared across customers in the region. Any excessive load due to seasonal or unexpected spikes at Amazon’s scale will easily consume the shared resources causing contention and queuing.
In addition to the shared query infrastructure, Athena’s federated query connectors are based on Lambda, which is again serverless. Lambda scales out well and can be performant once warmed-up, however consistent performance comes only with consistent use. Depending upon the usage of a particular connector in the region and the available capacity of the backend infrastructure you could run into latencies caused by cold-starts, especially if you are using connectors that are not accessed frequently e.g. custom connectors.
If a large number of users end up using Athena at the same time, especially for large-scale queries, they often observe extended queuing of their queries. Though the eventual query execution might not take time once resources are available, the queueing significantly impacts the end-user experience for interactive query workloads. Users have also at times reported inconsistency of execution times of the same query from one run to another which ties back into the shortcomings of the shared resources model.
So can I get deterministic performance out of Athena?
If your queries are not complex, latency-sensitive and your infrastructure is in a less crowded region, you may not encounter performance predictability issues frequently. However, your mileage entirely depends upon several factors such as when you are running the query, which region are you running the query, the volume of the data you are accessing, your account service-limit configurations, to just name a few.
What are my options?
If your interactive query workload is latency-sensitive and you want to deterministically control the performance of your queries and the experience of your end-users, you need dedicated managed infrastructure. A managed Presto service gives you the best of both worlds – It abstracts the complexity of managing a distributed query service at the same time giving you the knobs to tweak the service to your workload needs.
Ahana provides a managed Presto service that can scale up and down depending on your performance needs. You can segregate workloads into different clusters or choose to share the cluster. You can also choose beefier infrastructure for more business and time-critical workloads and also set up separate clusters for less critical needs. You make that choice of price, performance, and flexibility depending upon business objectives.