In this article we’ll look at two different technologies in the data space and share more about which to use based on your use case and workloads.
The High Level Overview
To set the stage, it’s important to note that Databricks and Amazon Athena are two different beasts so a comparison is not really very helpful due to the breadth of functionality provided by each tool. For the purposes of this article, we’ll give an overview of each and share more on when it makes sense to use each tool.
AWS Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Databricks is an ETL, data science, and analytics platform which offers a managed version of Apache Spark. Databricks is widely known for its data lakehouse approach which gives you the data management capabilities of the warehouse coupled with the flexibility and affordability of the data lake.
One could conceivably use both tools within the same deployment, although there will be some overlap around data warehousing and ad-hoc workloads. This overlap might have grown larger recently with the release of Amazon Athena for Apache Spark.
An alternative to these offerings is Ahana Cloud, a managed service for Presto that gives you a prescriptive approach to building an open data lakehouse using open source technologies and open formats.
|What is Databricks?|
Databricks is a unified analytics platform built on open-source Apache Spark, which combines data science, engineering, and business analysis in an integrated workspace.
|What is Amazon Athena?|
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
|What is Ahana Cloud?|
Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.
Try Ahana for Superior Price-Performance
Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.
We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.
According to the vendor:
Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance.
The Databricks website claims that Databricks offers world-record-setting performance directly on data in the data lake, and that it is up to 12x better price/performance than traditional cloud data warehouses.
The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets.
Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.
According to user reviews:
Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally view both Databricks and Athena as tools that provide good performance for big data workloads, but with some drawbacks when it comes to ongoing management.
Users mention that Databricks has good performance for big data workloads, and quick lakehouse deployment. Some users have noted that Databricks makes it hard to profile code inside the platform. Additionally, some users have mentioned issues with logging for jobs, job scheduling, and job portability.
Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.
We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.
According to the vendor:
Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale.
The Databricks website claims that Databricks is highly scalable and comes with various enterprise readiness features such as security and user access control, as well as the ability to integrate with other parts of the user’s ecosystem.
Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.
According to user reviews:
Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Both tools offer auto-scaling, although Databricks can provide dedicated clusters which might provide more consistent performance.
Users were happy with Databricks’s ability to autoscale clusters. They also note its open source technologies, and the ability to use different programming languages in the platform. Some users have mentioned challenges around security, user access control, and integration with other parts of the ecosystem. Users also note that Databricks is not compatible with some AI/ML libraries, difficult to secure and control access, and can get expensive.
Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.
Usability, Ease of Use and Configuration
We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.
According to the vendor:
Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use.
The Databricks website claims that Databricks is simple to install and operate, and that it uses familiar languages and syntaxes such as SQL, making it easy to use.
The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.
According to user reviews:
Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability.
Multiple reviews mentioned that Databricks provides a good user experience and has a relatively simple setup process. On the other hand, users have mentioned that Databricks has a steep learning curve, which could make it difficult to use for those without specialized knowledge. Additionally, some users have noted that the UI can be confusing or repetitive.
Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.
- Athena charges a flat price of $5 per terabyte of data scanned. As your datasets and workloads grow, your Athena costs can grow quickly which can lead to sticker-shock. That’s why many Ahana customers were previous Athena users who were seeing unpredictable costs associated with their Athena usage – due to Athena’s serverless nature, you can never predict how many resources will be available.
- Databricks pricing is based on compute usage. The cost of using Databricks is calculated by multiplying the amount of DBUs (Databricks Units) that you consumed with a corresponding $ rate. This rate is influenced by the cloud provider you’re working with (e.g., the cost AWS charges for EC2 machines), geographical region, subscription tier, and compute type.
- Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.
While you can find some figures on Databricks’s pricing page, understanding how much you will end up paying can be quite difficult as it will depend on the type and volume of data, as well as whatever discount you could negotiate with AWS. Many of the user reviews mention the price of running Databricks as prohibitive, especially when compared to open-source Apache Spark.
Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized.
Ahana’s pricing is much simpler and also very opaque with the pricing calculator. Similar to Athena, the pricing will just be part of your AWS bill.
Need a better alternative?
Get a demo of Ahana to learn how we deliver superior price/performance, control and usability as compared to Amazon Athena. Ahana will give you the starting blocks needed to build your Open Data Lakehouse.