AWS Athena vs Snowflake

The High Level Overview

Snowflake and Amazon Athena are both cloud analytics tools, but are significantly different in terms of their architecture. Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Athena.

What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Snowflake’s performance, but note the high cost, and have generally positive opinions about Athena’s performance, but note potential performance issues and inability to scale the service.

Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Athena

– Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. 

– Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Athena

The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users note potential limitations in certain features for both tools, although both are capable of querying large datasets.

Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.
Athena

– Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.

– However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Users generally have positive opinions about Snowflake’s ease of use and configuration, while they are happy with the ease of deploying Athena in their AWS account, but mention drawbacks such as lack of support for stored procedures and unclear error messages when debugging queries.

Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.
Athena

– Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.

– However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for your data lake and lakehouse architecture. Ahana gives you SQL on S3 with better price performance than Athena and no vendor-lock in as compared to Snowflake.

Sources

Starburst vs Snowflake

The High Level Overview

Starburst and Snowflake are both in the data analytics space but are significantly different in terms of their architecture and use cases. Starburst is the corporate entity behind a fork of Presto called Trino, a SQL query engine whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to Starburst is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Amazon Athena.

What is Starburst?

Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. It enables users to query, analyze, and process data from multiple sources.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst

Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Ahana Cloud

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Starburst’s performance, but find it difficult to customize and integrate with external databases; Snowflake’s performance is seen as an advantage, but users note that it is expensive for some use cases.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Ahana Cloud

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst

The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards.
Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Ahana Cloud

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Overall, users generally think that both Starburst and Snowflake are capable of handling larger volumes of data, but may have other potential issues with scalability.

Starburst

– Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalable

– Potential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. 

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst

The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Ahana Cloud

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Overall, users generally find both Starburst and Snowflake to be easy to use, although they have noted some areas of improvement for each.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.

Cost

  • Starburst’s pricing is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana Cloud is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Starburst’s pricing can be difficult to predict based on the information available online, but the company is clearly leaning towards an enterprise pricing model that looks at annual commitment rather than pay-as-you-go.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto in the cloud.

Sources

S3 Select vs. AWS Athena – The Quick Comparison

logo amazon athena

Data analysts and data engineers need simpler ways to access business data stored on Amazon S3. Amazon Athena and S3 Select are two services that allow you to retrieve records on S3 using regular SQL. What are the differences, and when should you use one vs the other?

S3 Select vs Athena: What’s the Difference?

The short answer:

Both services allow you to query S3 using SQL. Athena is a fully-featured query engine that supports complex SQL and works across multiple objects while S3 Select is much more limited, and used to retrieve a subset of data from a single object in S3 using simple SQL expressions. 

The long answer:

S3 Select is more appropriate for simple filtering and retrieval of specific subsets of data from S3 objects using basic SQL statements, with reduced data transfer costs and latency. Amazon Athena, on the other hand, is suitable for running complex, ad-hoc queries across multiple paths in Amazon S3, offering more comprehensive SQL capabilities, improved performance, and optimization options. Athena supports more file formats, compression types, and optimizations, while S3 Select is limited to CSV, JSON, and Parquet formats.

An alternative to Amazon Athena is Ahana Cloud, a managed service for Presto that offers up to 10x better price performance.

Here is a detailed comparison between the two services:

Query Scope:

  • S3 Select operates on a single object in S3, retrieving a specific subset of data using simple SQL expressions.
  • Amazon Athena can query across multiple paths, including all files within those paths, making it suitable for more complex queries and aggregations.

SQL Capabilities:

  • S3 Select supports basic SQL statements for filtering and retrieving data, with limitations on SQL expression length (256 KB) and record length (1 MB).
  • Athena offers more comprehensive ANSI SQL compliant querying, including group by, having, window and geo functions, SQL DDL, and DML.

Data Formats and Compression:

  • S3 Select works with CSV, JSON, and Parquet formats, supporting GZIP and BZIP2 (only for CSV and JSON) compression.
  • Athena supports a wider range of formats, including CSV, JSON, Apache Parquet, Apache ORC, and TSV, with broader compression support.

Integration and Accessibility:

  • S3 Select can be used with AWS SDKs, the SELECT Object Content REST API, the AWS CLI, or the Amazon S3 console.
  • Athena is integrated with Amazon QuickSight for data visualization and AWS Glue Data Catalog for metadata management. It can be queried directly from the management console or SQL clients via JDBC.

Performance and Optimization:

  • S3 Select is a rudimentary query service mainly focused on filtering data, reducing data transfer costs and latency.
  • Athena offers various optimization techniques, such as partitioning and columnar storage, which improve performance and cost-efficiency.

Schema Management:

  • S3 Select queries are ad hoc and don’t require defining a data schema before issuing queries.
  • Athena requires defining a data schema before running queries.

Pricing:

  • According to the first source provided, the cost of S3 Select depends on three factors: the number of SELECT requests, the data returned, and the data scanned. As of Dec 2020, the cost for region US-EAST(Ohio) with Standard Storage is:
    • Amazon S3 Select — $0.0004 per 1000 SELECT requests
    • Amazon S3 Select data returned cost — $0.0007 per GB
    • Amazon S3 Select data scanned cost — $0.002 per GB
  • With Athena, you are charged $5.00 per TB of data scanned, rounded up to the nearest megabyte, with a 10MB minimum per query. 

An alternative to AWS Athena and S3 Select is Ahana Cloud, which gives you the ability to run complex queries at better price performance than Athena. Get a demo today.

Summary

FeatureS3 SelectAthena
Query ScopeSingle object (e.g., single flat file)Multiple objects, entire bucket
Use CasesAd-hoc data retrievalLog processing, ad-hoc analysis, interactive queries, joins
SQL CapabilitiesBasic queries, filteringComplex, ANSI-compliant SQL queries, aggregations, joins
File FormatsCSV, JSON, ParquetCSV, JSON, Parquet, TSV, ORC, and more
IntegrationServerless apps, Big Data frameworksAWS Glue Data Catalog, ETL capabilities
Query InterfaceS3 API (e.g., Python boto3 SDK)Management Console, SQL clients via JDBC
Performance OptimizationLimited, basic filteringPartitioning, columnar storage, and more
Schema DefinitionNot requiredRequired

When should you use Athena, and when should you use S3 select?

You should choose Amazon Athena for complex queries, analysis across multiple S3 paths, and integration with other AWS services, such as AWS Glue Data Catalog and Amazon QuickSight. Opt for S3 Select when you need to perform basic filtering and retrieval of specific subsets of data from a single S3 object.

Example Scenarios

Example 1: Log Analysis for a Web Application – Use Athena

Imagine you operate a web application, and you want to analyze log data stored in Amazon S3 to gain insights into user behavior and troubleshoot issues. In this scenario, you have multiple log files across different S3 paths, and you need to join and aggregate the data to derive meaningful insights.

In this case, you should use Amazon Athena because it supports complex SQL queries, including joins and aggregations, and can query across multiple paths in S3. With Athena, you can take advantage of its optimization features like partitioning and columnar storage to improve query performance and reduce costs.

Example 2: Filtering Customer Data for a Marketing Campaign – Use S3 Select

Suppose you have a customer data file stored in Amazon S3, and you want to retrieve a subset of records for a targeted marketing campaign. The data file is in JSON format, and you need to filter records based on specific criteria, such as customer location or spending habits.

In this scenario, S3 Select is the better choice, as it is designed for simple filtering and retrieval of specific subsets of data from a single S3 object using basic SQL expressions. Using S3 Select, you can efficiently retrieve the required records, reducing data transfer costs and latency.

Is S3 Select Faster than Athena?

Both S3 Select and Athena are serverless and rely on pooled resources provisioned by Amazon at the time the query is run. Neither is generally faster than the other. However, S3 Select can be faster than Athena for specific use cases, where retrieving a subset of the data is more efficient than processing the entire object. In cases where you only need the capabilities of S3 Select, it can also be easier to run compared to Athena, which requires a table schema to be defined.

Need a better SQL query engine for Amazon S3?

Ahana provides a managed Presto service that lets you run ad-hoc queries, interactive analytics, and BI workloads over your S3 storage. Learn more about Ahana or get a demo.

Sources used in this article:

https://ahana.io/answers/aws-s3-select-limitations/ 

https://aws.amazon.com/s3/pricing/

https://aws.amazon.com/athena/pricing/ 

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html

https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/

https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428

https://stackoverflow.com/questions/49102577/what-is-difference-between-aws-s3-select-and-aws-athena

https://repost.aws/questions/QU1_wCZSxES6-QHh7QBTDYDA/s-3-select-vs-athena

4 Trino Alternatives for Better Price / Performance

logo presto lg

Trino, a distributed SQL query engine, is known for its ability to process large amounts of semi-structured data using familiar SQL semantics. However, there are situations where an alternative may be more suitable. In this article, we explore four Trino alternatives that offer better price/performance for specific use cases.

What is Trino?

Trino is a distributed SQL query engine that supports various data sources, including relational and non-relational sources, through its connector architecture. It is a hard fork of the original Presto project, which was started at Facebook and later open-sourced in 2013. 

The creators of Presto, who later became cofounders/CTOs of Starburst, began the hard fork named Trino in early 2019. 

Trino has since diverged from Presto, and many of the innovations that the community is driving in Presto are not available in Trino. Trino is not hosted under the Apache Software Foundation (ASF) or Linux Foundation, but rather under the Trino Software Foundation, a non-profit corporation controlled by the cofounders of Starburst.

Learn more: 

When Should You Use Trino?

Presto-based services – including Trino and PrestoDB – are designed for ad-hoc querying and analytical processing over data lakes, and allow developers to run interactive analytics against massive amounts of semi-structured data. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations. 

Trino or Presto should be used when a user wants to perform fast queries against large amounts of data from different data sources using familiar SQL semantics. It is suitable for organizations that want to use their existing SQL skills to query data without having to learn new complex languages.

Other Trino use cases that have been mentioned in the context of data science workloads are for running a specific federated query that requires high performance and when you need to connect to data via Apache Hive as the backend. 

When Should You Look at Alternatives to Trino?

While Trino is a powerful and popular framework, there are situations where you might want to consider an alternative. These include:

  • If you’re looking for an open-source project that has a strong governance structure and charter, then Trino is not the best choice since it is a vendor-controlled non-profit corporation. Users who prefer to use a project hosted under a well-known project hosting organization like ASF or The Linux Foundation may choose to use another tool instead of Trino.
  • If you are looking for services and support from vendors, you should compare the functionality and price/performance provided by Trino to alternative tools such as Ahana, Amazon Athena, or Dremio.
  • If you’re looking for a database management system that stores and manages data, then Trino is not suitable. Similarly to Presto, Trino is a SQL query engine that queries the connected data stores and does not store data (although both tools have the option to write the results of a query back to object storage)

4 Alternatives to Trino 

If you’re looking for an alternative to Trino, consider one of the following:

  1. Open Source PrestoDB 
  2. Ahana, managed service for Presto on AWS
  3. Amazon Athena, serverless service for Presto/Trino on AWS
  4. Dremio

1. PrestoDB – the original Presto distribution used at Facebook

As mentioned above, Trino is originally a hard fork of PrestoDB. Trino was previously known as PrestoSQL before being rebranded in December 2020. The Presto Software Foundation was also rebranded as Trino Software Foundation to reflect the fact that these are two separate and divergent projects. 

While Trino and PrestoDB share a common history, they have different development teams and codebases, and may have different features, optimizations, and bug fixes.

Some key differences between PrestoDB and Trino:

  • PrestoDB is tested by and used by Facebook, Uber, Bytedance, and other internet-scale companies, while Trino is not. 
  • Presto is one of the fastest-growing open-source projects in the data analytics space.
  • The Presto Foundation (part of The Linux Foundation) oversees PrestoDB, whereas Trino is mainly steered by a single company (Starburst). 
  • Presto offers access to recent and current innovations in PrestoDB including Project Aria, Project Presto Unlimited, additional user-defined functions, Presto-on-Spark, Disaggregated Coordinator, and RaptorX Project. 

See the full comparison: Presto vs Trino.

There are several ways you can get started with open source Presto, including running it on-premises, through a Docker container, and more (check out our getting started with Presto page). 

2. Ahana Cloud: managed service for Presto on AWS

Ahana, a member of the Presto Foundation and contributor to the PrestoDB project, offers a managed, cloud-native version of open-source Presto – Ahana Cloud. It gives you a managed service offering for Presto by taking care of the hundreds of configurations and tuning parameters under the hood while still giving you more control and flexibility as compared to a serverless offering.

Ahana also includes some features like Data Lake Caching for better performance and AWS Lake Formation integration to take advantage of granular data security.

Check out a demo of Ahana Cloud.

3. Amazon Athena: managed Presto/Trino service provisioned by AWS

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL. Originally based off of PrestoDB, Athena now incorporates features from both Presto and Trino.

In our comparison between Athena and Trino-based Starburst, we concluded that:

  • Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3, but there are some key differences. 
  • Starburst has features like Cached Views and pushdown capabilities, while Athena is optimized for fast performance with Amazon S3 and executes queries automatically in parallel. 
  • Users generally regard both Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues. 
  • Users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging. 
  • In terms of cost, Athena charges a flat price of $5 per terabyte of data scanned, while Starburst’s pricing is more complex.

4. Dremio: serverless query engine based on Apache Arrow

Dremio, which is built on Apache Arrow, is another query engine that enables high-performance analytics directly on data lake storage. 

According to Dremio’s website, Dremio offers interactive analytics directly on the lake and is often used for BI dashboards, whereas Starburst primarily supports ad-hoc workloads only. Dremio provides self-service with a shared semantic layer for all users and tools, while Starburst lacks a semantic layer and data curation capabilities. 

On the other hand, Starburst touts a cost-based optimizer that helps define an optimal plan based on the table statistics and other info it receives from plugins. Starburst’s custom connectors are optimized to be run in parallel, taking advantage of Trino’s MPP architecture. 

While both platforms offer similar products, Dremio seems to be more focused on BI-oriented workloads reading from data lakes, whereas Starburst might be better suited for ad-hoc and federated queries.

Try Ahana Cloud’s managed Presto for free

If you’re evaluating SQL query engines, you’re in the right place. The easiest way to get started is with Ahana Cloud for Presto. You can try it for yourself, but we recommend scheduling a quick, no-strings-attached call with our solutions engineering team to understand your requirements and set up the environment. Get started now

Exploring Data Warehouse, Data Mesh, and Data Lakehouse: What’s right for you?

Screen Shot 2023 03 16 at 12.11.25 PM

We’re hosting a free hands-on lab on building your own Data Lakehouse in AWS. You’ll get trained by Presto and Apache Hudi experts.

When it comes to data management, there are various approaches and architectures for storing, processing, and analyzing data. In this article we’ll discuss three of the more popular approaches in the market today – the data warehouse, data mesh, and data lakehouse. 

Each approach has its own unique features, advantages, and disadvantages, and understanding the differences between them is crucial for organizations to make informed decisions about their data strategy. We’ll take you through each one and help you determine which approach is best suited for your organization’s data needs.

Data Warehouse: Centralized but Inflexible

A Data Warehouse is a centralized repository that stores structured data from various sources for analysis and reporting. Typically it’s a relational database and optimized for read-heavy workloads with a schema-on-write approach. 

Advantages of a data warehouse are that it’s a single source of truth for structured data, it provides high performance querying and dashboarding/reporting capabilities, and it supports business intelligence and analytics use cases. 

On the other hand, some of its disadvantages are that it requires data to be pre-processed and structured, it has limited flexibility in handling unstructured data and new data types, and it can be expensive to implement and maintain.

Learn more about choosing between data warehouse and data lake.

Data Mesh: Flexible but Complicated

A Data Mesh is a distributed and decentralized approach to data architecture that focuses on domain-driven design and self-service data access. Key features include decentralized data ownership and control, data that’s organized by domains rather than centralized by function, data is emphasized as a product that is discoverable and reusable, and data access is self-service for domain teams.

Advantages of a data mesh are that it offers agility and flexibility in handling complex and evolving data environments, it facilitates collaboration between data teams and domain teams, and it promotes data democratization and data-driven culture.

Disadvantages are that it requires a cultural shift and new ways of working to implement, distributed data ownership involves data governance and security challenges, and it requires strong data lineage and metadata management to ensure data quality and consistency. Performance can also be a problem if you’re doing joins across many data sources, because your query will only be as fast as your slowest connection.

Data Lakehouse: Hybrid Approach

A Data Lakehouse is a hybrid approach that combines the best features of data warehouses and data lakes. Those features include support for both structured and unstructured data, support for both read and write-heavy workloads, and a schema-on-read approach.

Advantages of a data lakehouse are that it offers flexibility in handling both structured and unstructured data, it supports real-time analytics and machine learning use cases, and it’s cost-effective compared to traditional data warehouses. They’re designed to handle both batch processing and real-time processing of data.

Disadvantages are that it requires data governance and management policies to prevent data silos and ensure data quality, complex data integration and transformation may require specialized skills and tools, and there may be performance issues for ad-hoc queries and complex joins.

Picking the data architecture that’s best for your use case

Below is a matrix we’ve put together that lists which of these approaches best fits specific requirements and use cases.

Data WarehouseData MeshData Lakehouse
Structured data
Unstructured data
Fast access to data
Real-Time Data ProcessingPossible with additional tools
Data GovernanceCentralizedDecentralized Centralized or Decentralized
Cost-effectiveDepends on specific use case
Scalability
Self-service data discovery
Data IntegrationRequires specialized work
Analytics capabilitiesEvolving

As shown in the matrix, each architecture has its own strengths and weaknesses across different key capabilities.

A data warehouse architecture is well-suited for structured data, offers strong data governance, and mature analytics capabilities, but may be limited in its scalability and ability to handle unstructured data and real-time processing.

A data mesh architecture offers highly scalable and decentralized data management, high developer productivity, and flexible data governance, but may require additional tools for real-time processing and careful planning for data integration.

A data lakehouse architecture is well-suited for unstructured data, offers good scalability and data integration capabilities, and is reasonably cost-effective, but may be limited in its ability to handle highly structured data and may require varied data governance strategies.

The Open Data Lakehouse

At Ahana, we’re building the Open Data Lakehouse with Presto at its core. Presto, the open source SQL query engine, powers the analytics on your Open Data Lakehouse. We believe the data lakehouse approach strikes the best balance between flexibility, scalability, and cost-effectiveness, making it a favorable choice for organizations seeking a modern data management solution.

Screen Shot 2023 03 10 at 8.47.04 AM
You can learn more about our approach to the Data Lakehouse by downloading our free whitepaper.

AWS Athena vs. Databricks

In this article we’ll look at two different technologies in the data space and share more about which to use based on your use case and workloads.

The High Level Overview

To set the stage, it’s important to note that Databricks and Amazon Athena are two different beasts so a comparison is not really very helpful due to the breadth of functionality provided by each tool. For the purposes of this article, we’ll give an overview of each and share more on when it makes sense to use each tool.

AWS Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Databricks is an ETL, data science, and analytics platform which offers a managed version of Apache Spark. Databricks is widely known for its data lakehouse approach which gives you the data management capabilities of the warehouse coupled with the flexibility and affordability of the data lake.

One could conceivably use both tools within the same deployment, although there will be some overlap around data warehousing and ad-hoc workloads. This overlap might have grown larger recently with the release of Amazon Athena for Apache Spark.

An alternative to these offerings is Ahana Cloud, a managed service for Presto that gives you a prescriptive approach to building an open data lakehouse using open source technologies and open formats.

What is Databricks?

Databricks is a unified analytics platform built on open-source Apache Spark, which combines data science, engineering, and business analysis in an integrated workspace.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Databricks

The Databricks website claims that Databricks offers world-record-setting performance directly on data in the data lake, and that it is up to 12x better price/performance than traditional cloud data warehouses.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance.  Users generally view both Databricks and Athena as tools that provide good performance for big data workloads, but with some drawbacks when it comes to ongoing management.

Databricks

Users mention that Databricks has good performance for big data workloads, and quick lakehouse deployment. Some users have noted that Databricks makes it hard to profile code inside the platform. Additionally, some users have mentioned issues with logging for jobs, job scheduling, and job portability.
Athena

Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Databricks

The Databricks website claims that Databricks is highly scalable and comes with various enterprise readiness features such as security and user access control, as well as the ability to integrate with other parts of the user’s ecosystem.
Athena

Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Both tools offer auto-scaling, although Databricks can provide dedicated clusters which might provide more consistent performance.

Databricks

Users were happy with Databricks’s ability to autoscale clusters. They also note its open source technologies, and the ability to use different programming languages in the platform. Some users have mentioned challenges around security, user access control, and integration with other parts of the ecosystem. Users also note that Databricks is not compatible with some AI/ML libraries, difficult to secure and control access, and can get expensive.
Athena

Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Databricks

The Databricks website claims that Databricks is simple to install and operate, and that it uses familiar languages and syntaxes such as SQL, making it easy to use.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. 

Databricks

Multiple reviews mentioned that Databricks provides a good user experience and has a relatively simple setup process. On the other hand, users have mentioned that Databricks has a steep learning curve, which could make it difficult to use for those without specialized knowledge. Additionally, some users have noted that the UI can be confusing or repetitive.
Athena

Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. As your datasets and workloads grow, your Athena costs can grow quickly which can lead to sticker-shock. That’s why many Ahana customers were previous Athena users who were seeing unpredictable costs associated with their Athena usage – due to Athena’s serverless nature, you can never predict how many resources will be available.
  • Databricks pricing is based on compute usage. The cost of using Databricks is calculated by multiplying the amount of DBUs (Databricks Units) that you consumed with a corresponding $ rate. This rate is influenced by the cloud provider you’re working with (e.g., the cost AWS charges for EC2 machines), geographical region, subscription tier, and compute type.
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

While you can find some figures on Databricks’s pricing page, understanding how much you will end up paying can be quite difficult as it will depend on the type and volume of data, as well as whatever discount you could negotiate with AWS. Many of the user reviews mention the price of running Databricks as prohibitive, especially when compared to open-source Apache Spark. 

Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized. 

Ahana’s pricing is much simpler and also very opaque with the pricing calculator. Similar to Athena, the pricing will just be part of your AWS bill.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability as compared to Amazon Athena. Ahana will give you the starting blocks needed to build your Open Data Lakehouse.

Sources

Starburst vs. Athena: Evaluating different Presto vendors

logo presto lg

The High Level Overview – Athena vs. Starburst

Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3. Athena is a serverless service based on open-source Presto technology, while Starburst is the corporate entity behind a fork of Presto called Trino. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

All of these tools will cover similar ground in terms of use cases and workloads. Understanding the specific limitations and advantages of each tool will help you decide which one is right for you.

What is Starburst?
Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Amazon Athena?
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?
Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst
Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Athena
The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana Cloud
Ahana has multi-level data caching with RaptorX which includes one-click caching built-in to every Presto cluster. This can give you up to 30X query performance improvements.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally regard Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues.

Starburst
Reviewers who were happy with Starburst’s performance mentioned that it provides quick and efficient access to data, is able to handle large volumes of data and concurrent queries, and has good pluggability, portability, and parallelism. Some reviewers noted that tuning can be cumbersome, and that storing metadata in the Hive metastore creates overheads which can slow down performance. Others mentioned the cost associated with customization, the need for technical expertise to deploy Starburst Enterprise, and occasional performance issues when dealing with large datasets.
Athena
Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst
The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards
Athena
The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Additionally, Athena integrates out-of-the-box with AWS Glue, which allows users to create a unified metadata repository across various services, crawl data sources to discover data and populate their Data Catalog with new and modified table and partition definitions, and maintain schema versioning.
Ahana Cloud
Ahana has an autoscaling feature that helps you manage your Presto clusters by automatically adjusting the number of worker nodes in the Ahana-managed Presto cluster. You can read the docs for more information.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users see both tools are capable of operating at scale, but both have limitations in this respect as well.

Starburst
Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalablePotential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Athena
Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst
The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Athena
The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana Cloud
Ahana gives you Presto simplified – no installation, no AWS AMIs or CFTs, and no configuration needed. You can be running in 30 minutes, you get a built-in catalog and one-click integration to your data sources, and it’s all cloud native running on AWS EKS.

According to user reviews:

Overall, users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging.

Starburst
Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Athena
Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Starburst’s pricing is more complex as it is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Ahana Cloud is pay-as-you-go through your AWS bill based on the compute you use. There’s a pricing calculator you can use to get an idea.

While the specifics of your cloud bill will eventually depend on the way you use these tools and the amount of data you process in them, Athena and Ahana Cloud have a simpler cost structure and offer a more streamlined on-demand model.

Need a better alternative to Athena and Starburst?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto.

Sources

Hive vs Presto vs Spark for Data Analysis

Presto SQL Engine

Apache Hive, Apache Spark, and Presto are all popular open-source tools for working with data lakes and data lakehouses. However, these tools typically serve different functions – and while some of these overlap, there are also many differences, typically making them complimentary rather than competitive. Let’s look at the Presto vs Hive vs Spark, and see how each of these tools can be used for large-scale data analysis.

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data analysis and query. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto vs Hive vs Spark: The Comparison

Commonalities

  • All three projects – Presto, Hive, and Spark – are community-driven open-source software, with the latter two released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Hive vs Presto

Both Presto and Hive are used to query data in distributed storage, but Presto is more focused on analytical querying whereas Hive is mostly used to facilitate data access. Hive provides a virtual data warehouse that imposes structure on semi-structured datasets, which can then be queried using Spark, MapReduce, or Presto itself. Presto is a compute and querying layer that can connect to the Hive Metastore or other data catalogs such as Apache Iceberg.

Common use case:Query data stored in distributed storage
Hive:Facilitates data access
Presto:Focused on analytical querying

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.

Ahana Cloud for Presto Versus Amazon EMR

In this brief post, we’ll discuss some of the benefits of Ahana Cloud over Amazon Elastic MapReduce (EMR). While EMR offers optionality in the number of big data compute frameworks, that flexibility comes with operational and configuration burden. When it comes to low-latency interactive querying on big data that just works, Ahana Cloud for Presto offers much lower operational burden and Presto-specific optimizations.

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases. In fact, you can run federated queries across your data sources. Developed at Facebook, Presto is supported by the Presto Foundation, an independent nonprofit organization under the auspices of the Linux Foundation. Presto is used by leading technology companies, such as Facebook, Twitter, Uber, and Netflix.

Amazon EMR is a big data platform hosted in AWS. EMR allows you to provision a cluster with one or more big data technologies, such as Hadoop, Apache Spark, Apache Hive, and Presto. Ahana Cloud for Presto is the easiest cloud-native managed service for Presto, empowering data teams of all sizes. As a focused Presto solution, here are a few of Ahana Cloud’s benefits over Amazon EMR:

Less configuration. Born of the Hadoop era, Presto has several configuration parameters in several files to configure and tune to get right. With EMR, you have to configure these yourself. With Ahana Cloud, we tune more than 200 parameters out of the box, so when you spin up a cluster, you get excellent query performance from the get go. Out of the box, Ahana Cloud provides an Apache Superset sandbox for administrators to validate connecting to, querying and visualizing your data.

Easy-to-modify configuration. Ahana Cloud offers the ability to not only spin up and terminate clusters, but also stop and restart them—-allowing you to change the number of Presto workers and add or remove data sources. With EMR, any manual changes to the number of Presto workers and data sources require a new cluster or manually restarting the services yourself. Further, adding and removing data sources is done through a convenient user interface instead modifying low-level configuration files.

ahana data sources

Optimizations. As a Presto managed service, Ahana Cloud will continually provide optimizations relevant to Presto. For example, Ahana recently released data lake I/O caching. Based on the RubiX open source project and enabled with a single click, the caching eliminates redundant reads from your data lake if the same data is read over and over. This caching results in up to 5x query performance improvement and up to 85% latency reductions for concurrent workloads. Finally, idle clusters processing no queries can automatically scale down to a single Presto worker to preserve costs while allowing for a quick warm up.

Screen Shot 2021 03 18 at 4.45.32 PM

If you are experienced at tuning Presto and want full control of the infrastructure management, Amazon EMR may be the choice for you. If simplicity and accelerated go-to-market without needing to manage a complex infrastructure are what you seek, then Ahana Cloud for Presto is the way to go. Sign up for our free trial today.

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:

  • EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

Spark SQL and Presto, have become increasingly popular due to their capabilities in processing large amounts of data from various sources. In this blog post, we will dive deeper into Spark SQL and Presto, discussing their similarities, differences, and how they can be utilized to meet your specific data processing needs.

How these two tools are similar:

  • Both of these software frameworks are open source and are designed to handle large amounts of data. They operate in a distributed, parallel, and in-memory manner, which enables them to process data at high speeds.
  • BI tools connect with these frameworks using JDBC/ODBC connections.
  • They have been thoroughly tested and deployed by companies that process petabytes of data.
  • These frameworks can be executed either on-premises or in the cloud, and they can be containerized for a flexible and scalable deployment option.

Differences:

  • Presto is a query engine that provides access to and consolidation of data from various data sources using ANSI SQL:2003. It is generally deployed as a middle-layer for federation.On the other hand, Spark is a versatile cluster-computing framework that does not natively support SQL. To add structured data processing capabilities to Spark, you must install the Spark SQL module, which is also ANSI SQL:2003 compliant since Spark 2.0.
  • Presto is frequently used to support interactive SQL queries that are mainly analytical in nature but can also execute SQL-based ETL operations. Spark has more general-purpose applications, often utilized for data transformation and machine learning workloads.
  • By default, Presto allows querying of data in object stores like S3 and has many connectors available. It also works exceptionally well with Parquet and Orc format data.In contrast, Spark must rely on Hadoop file APIs to access S3 or purchase Databricks features. It also has limited connectors for data sources.

Many users are today are learning about Presto Spark. This lays out many of the differences on Presto vs Spark SQL and how Spark and Presto can be compared.

If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.

You can see our previous guide to compare the Spark execution engine vs Presto, or our comparison between Databricks and AWS Athena.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

spark streaming logo

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

Here are five open source alternatives to Spark Streaming

  • Apache Flink
  • Apache Apex
  • Apache Beam
  • Apache Samza
  • Apache Storm

Details about each alternative:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming. Learn more about Spark SQL vs Presto in our comparison article, or learn about using the invoking the Spark engine from Presto.