AWS Athena vs Snowflake

The High Level Overview

Snowflake and Amazon Athena are both cloud analytics tools, but are significantly different in terms of their architecture. Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Athena.

What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Snowflake’s performance, but note the high cost, and have generally positive opinions about Athena’s performance, but note potential performance issues and inability to scale the service.

Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Athena

– Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. 

– Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Athena

The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users note potential limitations in certain features for both tools, although both are capable of querying large datasets.

Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.
Athena

– Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.

– However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Users generally have positive opinions about Snowflake’s ease of use and configuration, while they are happy with the ease of deploying Athena in their AWS account, but mention drawbacks such as lack of support for stored procedures and unclear error messages when debugging queries.

Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.
Athena

– Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.

– However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for your data lake and lakehouse architecture. Ahana gives you SQL on S3 with better price performance than Athena and no vendor-lock in as compared to Snowflake.

Sources

Starburst vs Snowflake

The High Level Overview

Starburst and Snowflake are both in the data analytics space but are significantly different in terms of their architecture and use cases. Starburst is the corporate entity behind a fork of Presto called Trino, a SQL query engine whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to Starburst is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Amazon Athena.

What is Starburst?

Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. It enables users to query, analyze, and process data from multiple sources.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst

Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Ahana Cloud

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Starburst’s performance, but find it difficult to customize and integrate with external databases; Snowflake’s performance is seen as an advantage, but users note that it is expensive for some use cases.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Ahana Cloud

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst

The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards.
Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Ahana Cloud

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Overall, users generally think that both Starburst and Snowflake are capable of handling larger volumes of data, but may have other potential issues with scalability.

Starburst

– Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalable

– Potential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. 

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst

The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Ahana Cloud

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Overall, users generally find both Starburst and Snowflake to be easy to use, although they have noted some areas of improvement for each.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.

Cost

  • Starburst’s pricing is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana Cloud is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Starburst’s pricing can be difficult to predict based on the information available online, but the company is clearly leaning towards an enterprise pricing model that looks at annual commitment rather than pay-as-you-go.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto in the cloud.

Sources

Presto and ETL – Clarifying Common Misconceptions

Data integration and ETL are typically seen as crucial to gaining insights from data. But when it comes to Presto, things get trickier: Should you use Presto for ETL? Or should you run ETL flows before querying data with Presto?

In this article, we will explore using Presto for ETL, how it compares to traditional data warehouse architectures, and whether ETL is necessary before querying data with Presto.

Looking for a better solution to run Presto in public clouds? Try Ahana for superior price-performance.

What is ETL in big data?

ETL (Extract, Transform, Load) is an iterative data integration process used in data warehousing to bring data from multiple sources into a single, centralized data repository. The process involves extracting data from its original source, cleaning and validating it, transforming it into a suitable format, and loading it into the data warehouse.

This process is necessary because data from different sources may have different formats and structures, and so it needs to be unified and organised before it can be used in data analysis and decision-making. ETL also allows for data to be regularly updated, so that the data warehouse is always up-to-date. Additionally, ETL tools allow data to be migrated between a variety of sources, destinations, and analysis tools, and enable companies to address specific business intelligence needs through data analytics

In big data, the target the data is being loaded into might be a data warehouse such as BigQuery or Redshift, but it might also be a data lake or data lakehouse — where the transformed data is stored in object storage such as Amazon S3. For instance, data might be loaded into an S3 data lake in its raw form and then further processed, cleaned and transformed into a format that is more suitable for analytical querying, such as Apache Parquet or ORC.

Using Presto for ETL

Presto is a distributed SQL query engine that is designed to be fast and responsive, allowing users to quickly analyze large datasets. It is an excellent choice for running interactive queries on large datasets, as it can execute queries in seconds or minutes, depending on the size of the dataset.

Even though Presto was designed to be an ad-hoc query engine, it can be a suitable choice if the ETL process does not require too many resources or is not too complex. However, it doesn’t guarantee results and lacks many “ETL” features common in tools designed to do ETL, such as caching.

It is worth noting that Presto is not a replacement for traditional ETL tools like Apache Airflow, which are designed specifically for complex ETL workflows. However, if your ETL process is relatively simple and you are already using Presto for data analysis, it may make sense to use Presto for ETL as well.

If you are using Presto for ETL, it is recommended to spin a separate Presto cluster to avoid resource contention with interactive Presto workloads. You should also break complex queries into a series of shorter ones and create temporary tables for manual checkpointing.

Do you need to ETL your data before reading it in Presto?

Presto allows for ad-hoc querying of data wherever it resides without the need for ETL into a separate system. Using Presto connectors, analysts can access datasets that they have interest in, while in-place execution means queries are retrieved quickly — even when querying data directly from cloud object storage. This makes it much easier for data teams to access and analyze data in real-time, without having to wait for ETL processes to complete. It also helps to reduce the cost of data storage, as it eliminates the need to duplicate data in a separate system.

For most Presto use cases, there is no need to pre-process data with ETL or ELT before querying it. As soon as the data is stored in the data lake, it can be accessed with Presto SQL. This is almost always the case with interactive analytics, but there could be exceptions when it comes to scheduled reporting or BI dashboards – in these situations, you might consider using aggregated or pre-summarized datasets to reduce compute costs.

Comparing Presto to Data Warehouses

The ‘zero ETL’ approach that Presto enables is one of its core advantages over traditional data warehouse architectures. ETL/ELT can involve a lot of manual work, data duplication, and errors, which can make analytics slower and more complex.

PrestoDB’s ability to run analytical queries on diverse data sources and raw semi-structured data can significantly expedite the process. Presto eliminates the need to load data into a data warehouse, since it can be queried directly from its source and schema changes can be implemented in real-time; this also saves the need to perform costly transformations in order to apply a relational schema to file-based data.

If you’re interested in trying out Presto in the cloud, Ahana Cloud is a SaaS for Presto that makes it easy to deploy Presto in the cloud and run queries on your data. Check out a demo today.

S3 Select vs. AWS Athena – The Quick Comparison

logo amazon athena

Data analysts and data engineers need simpler ways to access business data stored on Amazon S3. Amazon Athena and S3 Select are two services that allow you to retrieve records on S3 using regular SQL. What are the differences, and when should you use one vs the other?

S3 Select vs Athena: What’s the Difference?

The short answer:

Both services allow you to query S3 using SQL. Athena is a fully-featured query engine that supports complex SQL and works across multiple objects while S3 Select is much more limited, and used to retrieve a subset of data from a single object in S3 using simple SQL expressions. 

The long answer:

S3 Select is more appropriate for simple filtering and retrieval of specific subsets of data from S3 objects using basic SQL statements, with reduced data transfer costs and latency. Amazon Athena, on the other hand, is suitable for running complex, ad-hoc queries across multiple paths in Amazon S3, offering more comprehensive SQL capabilities, improved performance, and optimization options. Athena supports more file formats, compression types, and optimizations, while S3 Select is limited to CSV, JSON, and Parquet formats.

An alternative to Amazon Athena is Ahana Cloud, a managed service for Presto that offers up to 10x better price performance.

Here is a detailed comparison between the two services:

Query Scope:

  • S3 Select operates on a single object in S3, retrieving a specific subset of data using simple SQL expressions.
  • Amazon Athena can query across multiple paths, including all files within those paths, making it suitable for more complex queries and aggregations.

SQL Capabilities:

  • S3 Select supports basic SQL statements for filtering and retrieving data, with limitations on SQL expression length (256 KB) and record length (1 MB).
  • Athena offers more comprehensive ANSI SQL compliant querying, including group by, having, window and geo functions, SQL DDL, and DML.

Data Formats and Compression:

  • S3 Select works with CSV, JSON, and Parquet formats, supporting GZIP and BZIP2 (only for CSV and JSON) compression.
  • Athena supports a wider range of formats, including CSV, JSON, Apache Parquet, Apache ORC, and TSV, with broader compression support.

Integration and Accessibility:

  • S3 Select can be used with AWS SDKs, the SELECT Object Content REST API, the AWS CLI, or the Amazon S3 console.
  • Athena is integrated with Amazon QuickSight for data visualization and AWS Glue Data Catalog for metadata management. It can be queried directly from the management console or SQL clients via JDBC.

Performance and Optimization:

  • S3 Select is a rudimentary query service mainly focused on filtering data, reducing data transfer costs and latency.
  • Athena offers various optimization techniques, such as partitioning and columnar storage, which improve performance and cost-efficiency.

Schema Management:

  • S3 Select queries are ad hoc and don’t require defining a data schema before issuing queries.
  • Athena requires defining a data schema before running queries.

Pricing:

  • According to the first source provided, the cost of S3 Select depends on three factors: the number of SELECT requests, the data returned, and the data scanned. As of Dec 2020, the cost for region US-EAST(Ohio) with Standard Storage is:
    • Amazon S3 Select — $0.0004 per 1000 SELECT requests
    • Amazon S3 Select data returned cost — $0.0007 per GB
    • Amazon S3 Select data scanned cost — $0.002 per GB
  • With Athena, you are charged $5.00 per TB of data scanned, rounded up to the nearest megabyte, with a 10MB minimum per query. 

An alternative to AWS Athena and S3 Select is Ahana Cloud, which gives you the ability to run complex queries at better price performance than Athena. Get a demo today.

Summary

FeatureS3 SelectAthena
Query ScopeSingle object (e.g., single flat file)Multiple objects, entire bucket
Use CasesAd-hoc data retrievalLog processing, ad-hoc analysis, interactive queries, joins
SQL CapabilitiesBasic queries, filteringComplex, ANSI-compliant SQL queries, aggregations, joins
File FormatsCSV, JSON, ParquetCSV, JSON, Parquet, TSV, ORC, and more
IntegrationServerless apps, Big Data frameworksAWS Glue Data Catalog, ETL capabilities
Query InterfaceS3 API (e.g., Python boto3 SDK)Management Console, SQL clients via JDBC
Performance OptimizationLimited, basic filteringPartitioning, columnar storage, and more
Schema DefinitionNot requiredRequired

When should you use Athena, and when should you use S3 select?

You should choose Amazon Athena for complex queries, analysis across multiple S3 paths, and integration with other AWS services, such as AWS Glue Data Catalog and Amazon QuickSight. Opt for S3 Select when you need to perform basic filtering and retrieval of specific subsets of data from a single S3 object.

Example Scenarios

Example 1: Log Analysis for a Web Application – Use Athena

Imagine you operate a web application, and you want to analyze log data stored in Amazon S3 to gain insights into user behavior and troubleshoot issues. In this scenario, you have multiple log files across different S3 paths, and you need to join and aggregate the data to derive meaningful insights.

In this case, you should use Amazon Athena because it supports complex SQL queries, including joins and aggregations, and can query across multiple paths in S3. With Athena, you can take advantage of its optimization features like partitioning and columnar storage to improve query performance and reduce costs.

Example 2: Filtering Customer Data for a Marketing Campaign – Use S3 Select

Suppose you have a customer data file stored in Amazon S3, and you want to retrieve a subset of records for a targeted marketing campaign. The data file is in JSON format, and you need to filter records based on specific criteria, such as customer location or spending habits.

In this scenario, S3 Select is the better choice, as it is designed for simple filtering and retrieval of specific subsets of data from a single S3 object using basic SQL expressions. Using S3 Select, you can efficiently retrieve the required records, reducing data transfer costs and latency.

Is S3 Select Faster than Athena?

Both S3 Select and Athena are serverless and rely on pooled resources provisioned by Amazon at the time the query is run. Neither is generally faster than the other. However, S3 Select can be faster than Athena for specific use cases, where retrieving a subset of the data is more efficient than processing the entire object. In cases where you only need the capabilities of S3 Select, it can also be easier to run compared to Athena, which requires a table schema to be defined.

Need a better SQL query engine for Amazon S3?

Ahana provides a managed Presto service that lets you run ad-hoc queries, interactive analytics, and BI workloads over your S3 storage. Learn more about Ahana or get a demo.

Sources used in this article:

https://ahana.io/answers/aws-s3-select-limitations/ 

https://aws.amazon.com/s3/pricing/

https://aws.amazon.com/athena/pricing/ 

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html

https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/

https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428

https://stackoverflow.com/questions/49102577/what-is-difference-between-aws-s3-select-and-aws-athena

https://repost.aws/questions/QU1_wCZSxES6-QHh7QBTDYDA/s-3-select-vs-athena

4 Trino Alternatives for Better Price / Performance

logo presto lg

Trino, a distributed SQL query engine, is known for its ability to process large amounts of semi-structured data using familiar SQL semantics. However, there are situations where an alternative may be more suitable. In this article, we explore four Trino alternatives that offer better price/performance for specific use cases.

What is Trino?

Trino is a distributed SQL query engine that supports various data sources, including relational and non-relational sources, through its connector architecture. It is a hard fork of the original Presto project, which was started at Facebook and later open-sourced in 2013. 

The creators of Presto, who later became cofounders/CTOs of Starburst, began the hard fork named Trino in early 2019. 

Trino has since diverged from Presto, and many of the innovations that the community is driving in Presto are not available in Trino. Trino is not hosted under the Apache Software Foundation (ASF) or Linux Foundation, but rather under the Trino Software Foundation, a non-profit corporation controlled by the cofounders of Starburst.

Learn more: 

When Should You Use Trino?

Presto-based services – including Trino and PrestoDB – are designed for ad-hoc querying and analytical processing over data lakes, and allow developers to run interactive analytics against massive amounts of semi-structured data. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations. 

Trino or Presto should be used when a user wants to perform fast queries against large amounts of data from different data sources using familiar SQL semantics. It is suitable for organizations that want to use their existing SQL skills to query data without having to learn new complex languages.

Other Trino use cases that have been mentioned in the context of data science workloads are for running a specific federated query that requires high performance and when you need to connect to data via Apache Hive as the backend. 

When Should You Look at Alternatives to Trino?

While Trino is a powerful and popular framework, there are situations where you might want to consider an alternative. These include:

  • If you’re looking for an open-source project that has a strong governance structure and charter, then Trino is not the best choice since it is a vendor-controlled non-profit corporation. Users who prefer to use a project hosted under a well-known project hosting organization like ASF or The Linux Foundation may choose to use another tool instead of Trino.
  • If you are looking for services and support from vendors, you should compare the functionality and price/performance provided by Trino to alternative tools such as Ahana, Amazon Athena, or Dremio.
  • If you’re looking for a database management system that stores and manages data, then Trino is not suitable. Similarly to Presto, Trino is a SQL query engine that queries the connected data stores and does not store data (although both tools have the option to write the results of a query back to object storage)

4 Alternatives to Trino 

If you’re looking for an alternative to Trino, consider one of the following:

  1. Open Source PrestoDB 
  2. Ahana, managed service for Presto on AWS
  3. Amazon Athena, serverless service for Presto/Trino on AWS
  4. Dremio

1. PrestoDB – the original Presto distribution used at Facebook

As mentioned above, Trino is originally a hard fork of PrestoDB. Trino was previously known as PrestoSQL before being rebranded in December 2020. The Presto Software Foundation was also rebranded as Trino Software Foundation to reflect the fact that these are two separate and divergent projects. 

While Trino and PrestoDB share a common history, they have different development teams and codebases, and may have different features, optimizations, and bug fixes.

Some key differences between PrestoDB and Trino:

  • PrestoDB is tested by and used by Facebook, Uber, Bytedance, and other internet-scale companies, while Trino is not. 
  • Presto is one of the fastest-growing open-source projects in the data analytics space.
  • The Presto Foundation (part of The Linux Foundation) oversees PrestoDB, whereas Trino is mainly steered by a single company (Starburst). 
  • Presto offers access to recent and current innovations in PrestoDB including Project Aria, Project Presto Unlimited, additional user-defined functions, Presto-on-Spark, Disaggregated Coordinator, and RaptorX Project. 

See the full comparison: Presto vs Trino.

There are several ways you can get started with open source Presto, including running it on-premises, through a Docker container, and more (check out our getting started with Presto page). 

2. Ahana Cloud: managed service for Presto on AWS

Ahana, a member of the Presto Foundation and contributor to the PrestoDB project, offers a managed, cloud-native version of open-source Presto – Ahana Cloud. It gives you a managed service offering for Presto by taking care of the hundreds of configurations and tuning parameters under the hood while still giving you more control and flexibility as compared to a serverless offering.

Ahana also includes some features like Data Lake Caching for better performance and AWS Lake Formation integration to take advantage of granular data security.

Check out a demo of Ahana Cloud.

3. Amazon Athena: managed Presto/Trino service provisioned by AWS

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL. Originally based off of PrestoDB, Athena now incorporates features from both Presto and Trino.

In our comparison between Athena and Trino-based Starburst, we concluded that:

  • Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3, but there are some key differences. 
  • Starburst has features like Cached Views and pushdown capabilities, while Athena is optimized for fast performance with Amazon S3 and executes queries automatically in parallel. 
  • Users generally regard both Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues. 
  • Users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging. 
  • In terms of cost, Athena charges a flat price of $5 per terabyte of data scanned, while Starburst’s pricing is more complex.

4. Dremio: serverless query engine based on Apache Arrow

Dremio, which is built on Apache Arrow, is another query engine that enables high-performance analytics directly on data lake storage. 

According to Dremio’s website, Dremio offers interactive analytics directly on the lake and is often used for BI dashboards, whereas Starburst primarily supports ad-hoc workloads only. Dremio provides self-service with a shared semantic layer for all users and tools, while Starburst lacks a semantic layer and data curation capabilities. 

On the other hand, Starburst touts a cost-based optimizer that helps define an optimal plan based on the table statistics and other info it receives from plugins. Starburst’s custom connectors are optimized to be run in parallel, taking advantage of Trino’s MPP architecture. 

While both platforms offer similar products, Dremio seems to be more focused on BI-oriented workloads reading from data lakes, whereas Starburst might be better suited for ad-hoc and federated queries.

Try Ahana Cloud’s managed Presto for free

If you’re evaluating SQL query engines, you’re in the right place. The easiest way to get started is with Ahana Cloud for Presto. You can try it for yourself, but we recommend scheduling a quick, no-strings-attached call with our solutions engineering team to understand your requirements and set up the environment. Get started now

Exploring Data Warehouse, Data Mesh, and Data Lakehouse: What’s right for you?

Screen Shot 2023 03 16 at 12.11.25 PM

We’re hosting a free hands-on lab on building your own Data Lakehouse in AWS. You’ll get trained by Presto and Apache Hudi experts.

When it comes to data management, there are various approaches and architectures for storing, processing, and analyzing data. In this article we’ll discuss three of the more popular approaches in the market today – the data warehouse, data mesh, and data lakehouse. 

Each approach has its own unique features, advantages, and disadvantages, and understanding the differences between them is crucial for organizations to make informed decisions about their data strategy. We’ll take you through each one and help you determine which approach is best suited for your organization’s data needs.

Data Warehouse: Centralized but Inflexible

A Data Warehouse is a centralized repository that stores structured data from various sources for analysis and reporting. Typically it’s a relational database and optimized for read-heavy workloads with a schema-on-write approach. 

Advantages of a data warehouse are that it’s a single source of truth for structured data, it provides high performance querying and dashboarding/reporting capabilities, and it supports business intelligence and analytics use cases. 

On the other hand, some of its disadvantages are that it requires data to be pre-processed and structured, it has limited flexibility in handling unstructured data and new data types, and it can be expensive to implement and maintain.

Learn more about choosing between data warehouse and data lake.

Data Mesh: Flexible but Complicated

A Data Mesh is a distributed and decentralized approach to data architecture that focuses on domain-driven design and self-service data access. Key features include decentralized data ownership and control, data that’s organized by domains rather than centralized by function, data is emphasized as a product that is discoverable and reusable, and data access is self-service for domain teams.

Advantages of a data mesh are that it offers agility and flexibility in handling complex and evolving data environments, it facilitates collaboration between data teams and domain teams, and it promotes data democratization and data-driven culture.

Disadvantages are that it requires a cultural shift and new ways of working to implement, distributed data ownership involves data governance and security challenges, and it requires strong data lineage and metadata management to ensure data quality and consistency. Performance can also be a problem if you’re doing joins across many data sources, because your query will only be as fast as your slowest connection.

Data Lakehouse: Hybrid Approach

A Data Lakehouse is a hybrid approach that combines the best features of data warehouses and data lakes. Those features include support for both structured and unstructured data, support for both read and write-heavy workloads, and a schema-on-read approach.

Advantages of a data lakehouse are that it offers flexibility in handling both structured and unstructured data, it supports real-time analytics and machine learning use cases, and it’s cost-effective compared to traditional data warehouses. They’re designed to handle both batch processing and real-time processing of data.

Disadvantages are that it requires data governance and management policies to prevent data silos and ensure data quality, complex data integration and transformation may require specialized skills and tools, and there may be performance issues for ad-hoc queries and complex joins.

Picking the data architecture that’s best for your use case

Below is a matrix we’ve put together that lists which of these approaches best fits specific requirements and use cases.

Data WarehouseData MeshData Lakehouse
Structured data
Unstructured data
Fast access to data
Real-Time Data ProcessingPossible with additional tools
Data GovernanceCentralizedDecentralized Centralized or Decentralized
Cost-effectiveDepends on specific use case
Scalability
Self-service data discovery
Data IntegrationRequires specialized work
Analytics capabilitiesEvolving

As shown in the matrix, each architecture has its own strengths and weaknesses across different key capabilities.

A data warehouse architecture is well-suited for structured data, offers strong data governance, and mature analytics capabilities, but may be limited in its scalability and ability to handle unstructured data and real-time processing.

A data mesh architecture offers highly scalable and decentralized data management, high developer productivity, and flexible data governance, but may require additional tools for real-time processing and careful planning for data integration.

A data lakehouse architecture is well-suited for unstructured data, offers good scalability and data integration capabilities, and is reasonably cost-effective, but may be limited in its ability to handle highly structured data and may require varied data governance strategies.

The Open Data Lakehouse

At Ahana, we’re building the Open Data Lakehouse with Presto at its core. Presto, the open source SQL query engine, powers the analytics on your Open Data Lakehouse. We believe the data lakehouse approach strikes the best balance between flexibility, scalability, and cost-effectiveness, making it a favorable choice for organizations seeking a modern data management solution.

Screen Shot 2023 03 10 at 8.47.04 AM
You can learn more about our approach to the Data Lakehouse by downloading our free whitepaper.

AWS Athena vs. Databricks

In this article we’ll look at two different technologies in the data space and share more about which to use based on your use case and workloads.

The High Level Overview

To set the stage, it’s important to note that Databricks and Amazon Athena are two different beasts so a comparison is not really very helpful due to the breadth of functionality provided by each tool. For the purposes of this article, we’ll give an overview of each and share more on when it makes sense to use each tool.

AWS Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Databricks is an ETL, data science, and analytics platform which offers a managed version of Apache Spark. Databricks is widely known for its data lakehouse approach which gives you the data management capabilities of the warehouse coupled with the flexibility and affordability of the data lake.

One could conceivably use both tools within the same deployment, although there will be some overlap around data warehousing and ad-hoc workloads. This overlap might have grown larger recently with the release of Amazon Athena for Apache Spark.

An alternative to these offerings is Ahana Cloud, a managed service for Presto that gives you a prescriptive approach to building an open data lakehouse using open source technologies and open formats.

What is Databricks?

Databricks is a unified analytics platform built on open-source Apache Spark, which combines data science, engineering, and business analysis in an integrated workspace.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Databricks

The Databricks website claims that Databricks offers world-record-setting performance directly on data in the data lake, and that it is up to 12x better price/performance than traditional cloud data warehouses.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance.  Users generally view both Databricks and Athena as tools that provide good performance for big data workloads, but with some drawbacks when it comes to ongoing management.

Databricks

Users mention that Databricks has good performance for big data workloads, and quick lakehouse deployment. Some users have noted that Databricks makes it hard to profile code inside the platform. Additionally, some users have mentioned issues with logging for jobs, job scheduling, and job portability.
Athena

Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Databricks

The Databricks website claims that Databricks is highly scalable and comes with various enterprise readiness features such as security and user access control, as well as the ability to integrate with other parts of the user’s ecosystem.
Athena

Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Both tools offer auto-scaling, although Databricks can provide dedicated clusters which might provide more consistent performance.

Databricks

Users were happy with Databricks’s ability to autoscale clusters. They also note its open source technologies, and the ability to use different programming languages in the platform. Some users have mentioned challenges around security, user access control, and integration with other parts of the ecosystem. Users also note that Databricks is not compatible with some AI/ML libraries, difficult to secure and control access, and can get expensive.
Athena

Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Databricks

The Databricks website claims that Databricks is simple to install and operate, and that it uses familiar languages and syntaxes such as SQL, making it easy to use.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. 

Databricks

Multiple reviews mentioned that Databricks provides a good user experience and has a relatively simple setup process. On the other hand, users have mentioned that Databricks has a steep learning curve, which could make it difficult to use for those without specialized knowledge. Additionally, some users have noted that the UI can be confusing or repetitive.
Athena

Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. As your datasets and workloads grow, your Athena costs can grow quickly which can lead to sticker-shock. That’s why many Ahana customers were previous Athena users who were seeing unpredictable costs associated with their Athena usage – due to Athena’s serverless nature, you can never predict how many resources will be available.
  • Databricks pricing is based on compute usage. The cost of using Databricks is calculated by multiplying the amount of DBUs (Databricks Units) that you consumed with a corresponding $ rate. This rate is influenced by the cloud provider you’re working with (e.g., the cost AWS charges for EC2 machines), geographical region, subscription tier, and compute type.
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

While you can find some figures on Databricks’s pricing page, understanding how much you will end up paying can be quite difficult as it will depend on the type and volume of data, as well as whatever discount you could negotiate with AWS. Many of the user reviews mention the price of running Databricks as prohibitive, especially when compared to open-source Apache Spark. 

Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized. 

Ahana’s pricing is much simpler and also very opaque with the pricing calculator. Similar to Athena, the pricing will just be part of your AWS bill.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability as compared to Amazon Athena. Ahana will give you the starting blocks needed to build your Open Data Lakehouse.

Sources

Starburst vs. Athena: Evaluating different Presto vendors

logo presto lg

The High Level Overview – Athena vs. Starburst

Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3. Athena is a serverless service based on open-source Presto technology, while Starburst is the corporate entity behind a fork of Presto called Trino. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

All of these tools will cover similar ground in terms of use cases and workloads. Understanding the specific limitations and advantages of each tool will help you decide which one is right for you.

What is Starburst?
Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Amazon Athena?
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?
Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst
Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Athena
The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana Cloud
Ahana has multi-level data caching with RaptorX which includes one-click caching built-in to every Presto cluster. This can give you up to 30X query performance improvements.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally regard Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues.

Starburst
Reviewers who were happy with Starburst’s performance mentioned that it provides quick and efficient access to data, is able to handle large volumes of data and concurrent queries, and has good pluggability, portability, and parallelism. Some reviewers noted that tuning can be cumbersome, and that storing metadata in the Hive metastore creates overheads which can slow down performance. Others mentioned the cost associated with customization, the need for technical expertise to deploy Starburst Enterprise, and occasional performance issues when dealing with large datasets.
Athena
Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst
The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards
Athena
The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Additionally, Athena integrates out-of-the-box with AWS Glue, which allows users to create a unified metadata repository across various services, crawl data sources to discover data and populate their Data Catalog with new and modified table and partition definitions, and maintain schema versioning.
Ahana Cloud
Ahana has an autoscaling feature that helps you manage your Presto clusters by automatically adjusting the number of worker nodes in the Ahana-managed Presto cluster. You can read the docs for more information.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users see both tools are capable of operating at scale, but both have limitations in this respect as well.

Starburst
Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalablePotential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Athena
Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst
The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Athena
The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana Cloud
Ahana gives you Presto simplified – no installation, no AWS AMIs or CFTs, and no configuration needed. You can be running in 30 minutes, you get a built-in catalog and one-click integration to your data sources, and it’s all cloud native running on AWS EKS.

According to user reviews:

Overall, users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging.

Starburst
Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Athena
Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Starburst’s pricing is more complex as it is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Ahana Cloud is pay-as-you-go through your AWS bill based on the compute you use. There’s a pricing calculator you can use to get an idea.

While the specifics of your cloud bill will eventually depend on the way you use these tools and the amount of data you process in them, Athena and Ahana Cloud have a simpler cost structure and offer a more streamlined on-demand model.

Need a better alternative to Athena and Starburst?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto.

Sources

PrestoDB on AWS

PrestoDB on AWS

What is PrestoDB on AWS?

Tip: If you are looking to better understand PrestoDB on AWS then check out the free, downloadable ebook, Learning and Operating Presto. This ebook will breakdown what Presto is, how it started, and best use cases.

To tackle this common question, what is PrestoDB on AWS, let’s first define Presto. PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. Presto was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. It was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables self-service ad-hoc analytics for its users on large amounts of data. With Presto, you can query data where it lives. This is including Hive, Amazon S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

AWS and Presto is a powerful combination. If you want to run PrestoDB on AWS it’s easy to spin up a managed Presto cluster. This can be done either through the Amazon Management Console, the AWS CLI, or the Amazon EMR API. It is not too difficult to run AWS Presto CLI EMR.

You can also give Ahana Cloud a try. Ahana is a managed service for Presto that takes care of the devops for you and provides everything you need to build your SQL Data Lakehouse using Presto.

Running Presto on AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does PrestoDB on AWS Work?

This is another very common question. The quickest answer is that PrestoDB is the compute engine on top of the data storage of your SQL Data Lakehouse. In this case, the storage is AWS S3. See the image below for an overview.

PrestoDB on AWS

There are some AWS services that work with PrestoDB on AWS, like Amazon EMR and Amazon Athena. Amazon EMR and Amazon Athena are the best Amazon services to deploy Presto in the cloud. They are managed services that do the integration, testing, setup, configuration, and cluster tuning for you. Amazon Athena Presto and EMR are widely used, but both come with some challenges, such as price performance and cost.

There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, and within minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune.

Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or PrestoDB. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for PrestoDB on AWS.

To get started with Presto for your SQL Data Lakehouse on AWS quickly, check out the services from Ahana Cloud. Ahana has two versions of their solution: a Full-Edition and a Free-Forever Community Edition. Each option has components of the SQL Lakehouse included, as well as support from Ahana. Explore Ahana’s managed service for PrestoDB.

Related Articles 

PrestoDB on Spark

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads.

Price-Performance Ratio of AWS Athena Presto vs Ahana Cloud for PrestoDB

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project. The biggest difference between the two is that Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers.

Athena Query Limits | Comparing AWS Athena & PrestoDB

In this blog, we discuss AWS Athena vs Presto and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service, like AWS pricing.

Ahana Cloud for PrestoDB

What is Presto and How Does It Work?

What is Presto and how does It work?

How does PrestoDB work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, PrestoDB is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

PrestoDB originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. PrestoDB between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s PrestoDB Foundation.

Is Presto In-Memory? 

When it comes to memory, how it works is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. PrestoDB itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When PrestoDB executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, PrestoDB will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 
– Capturing data
– Storing data
– Analysis
– Search
– Sharing
– Transfer
– Visualization
– Querying
– Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to PrestoDB and that you can query. There are a ton in the PrestoDB ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive? 

Presto Hive typically refers to using PrestoDB with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

PrestoDB and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

PrestoDB is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Autoscale your Presto cluster in Ahana Cloud

Autoscaling is now available on Ahana Cloud. This feature will monitor the worker nodes’ average CPU Utilization of your worker nodes and scale-out when reaching the 75% threshold.

Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources.

Hive vs Presto vs Spark for Data Analysis

Presto SQL Engine

Apache Hive, Apache Spark, and Presto are all popular open-source tools for working with data lakes and data lakehouses. However, these tools typically serve different functions – and while some of these overlap, there are also many differences, typically making them complimentary rather than competitive. Let’s look at the Presto vs Hive vs Spark, and see how each of these tools can be used for large-scale data analysis.

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data analysis and query. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto vs Hive vs Spark: The Comparison

Commonalities

  • All three projects – Presto, Hive, and Spark – are community-driven open-source software, with the latter two released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Hive vs Presto

Both Presto and Hive are used to query data in distributed storage, but Presto is more focused on analytical querying whereas Hive is mostly used to facilitate data access. Hive provides a virtual data warehouse that imposes structure on semi-structured datasets, which can then be queried using Spark, MapReduce, or Presto itself. Presto is a compute and querying layer that can connect to the Hive Metastore or other data catalogs such as Apache Iceberg.

Common use case:Query data stored in distributed storage
Hive:Facilitates data access
Presto:Focused on analytical querying

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.

Ahana Cloud for Presto Versus Amazon EMR

In this brief post, we’ll discuss some of the benefits of Ahana Cloud over Amazon Elastic MapReduce (EMR). While EMR offers optionality in the number of big data compute frameworks, that flexibility comes with operational and configuration burden. When it comes to low-latency interactive querying on big data that just works, Ahana Cloud for Presto offers much lower operational burden and Presto-specific optimizations.

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases. In fact, you can run federated queries across your data sources. Developed at Facebook, Presto is supported by the Presto Foundation, an independent nonprofit organization under the auspices of the Linux Foundation. Presto is used by leading technology companies, such as Facebook, Twitter, Uber, and Netflix.

Amazon EMR is a big data platform hosted in AWS. EMR allows you to provision a cluster with one or more big data technologies, such as Hadoop, Apache Spark, Apache Hive, and Presto. Ahana Cloud for Presto is the easiest cloud-native managed service for Presto, empowering data teams of all sizes. As a focused Presto solution, here are a few of Ahana Cloud’s benefits over Amazon EMR:

Less configuration. Born of the Hadoop era, Presto has several configuration parameters in several files to configure and tune to get right. With EMR, you have to configure these yourself. With Ahana Cloud, we tune more than 200 parameters out of the box, so when you spin up a cluster, you get excellent query performance from the get go. Out of the box, Ahana Cloud provides an Apache Superset sandbox for administrators to validate connecting to, querying and visualizing your data.

Easy-to-modify configuration. Ahana Cloud offers the ability to not only spin up and terminate clusters, but also stop and restart them—-allowing you to change the number of Presto workers and add or remove data sources. With EMR, any manual changes to the number of Presto workers and data sources require a new cluster or manually restarting the services yourself. Further, adding and removing data sources is done through a convenient user interface instead modifying low-level configuration files.

ahana data sources

Optimizations. As a Presto managed service, Ahana Cloud will continually provide optimizations relevant to Presto. For example, Ahana recently released data lake I/O caching. Based on the RubiX open source project and enabled with a single click, the caching eliminates redundant reads from your data lake if the same data is read over and over. This caching results in up to 5x query performance improvement and up to 85% latency reductions for concurrent workloads. Finally, idle clusters processing no queries can automatically scale down to a single Presto worker to preserve costs while allowing for a quick warm up.

Screen Shot 2021 03 18 at 4.45.32 PM

If you are experienced at tuning Presto and want full control of the infrastructure management, Amazon EMR may be the choice for you. If simplicity and accelerated go-to-market without needing to manage a complex infrastructure are what you seek, then Ahana Cloud for Presto is the way to go. Sign up for our free trial today.

Presto Geospatial Functions Tutorial

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (

trip_id int, 

orig_long double, 

orig_lat double, 

dest_long double, 

dest_lat double, 

trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast(‘2021-03-02 09:00:00 UTC’ as timestamp));

insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast(‘2021-03-02 09:30:00 UTC’ as timestamp));

insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast(‘2021-03-02 09:45:00 UTC’ as timestamp));

insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast(‘2021-03-02 10:45:00 UTC’ as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (

geo_shape varchar, 

name varchar);

insert into city_table values (‘POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))’, ‘London, central’);

insert into city_table values(‘POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))’, ‘Los Angeles, downtown’);

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Screen Shot 2021 03 12 at 10.45.18 AM
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”) 

Now lets run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips FROM trips_table as t JOIN city_table as c ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) GROUP BY 1;

         

CityTrips
Los Angeles, downtown1
London, central
(2 rows)
2

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris. 

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto 

Presto’s Geospatial functions are listed in the Presto documentation.  

Check out our other Learning Center content  at https://ahana.io/learn/ where you’ll find useful tutorials and examples, from using Python with Presto, to Machine Learning and more.  We also cover comparisons between Presto and other technologies. 

Also check https://ahana.io/answers/ for answers to frequently asked questions on all things Presto.

How to Write a Python Script to Connect to a Presto Cluster

If you need a quick tutorial on how to write a python script to connect to a Presto cluster and run a query, you came to the right place.

Prerequisites

In this tutorial we’ll be using a Python3 environment, and the SQLAlchemy.  SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. 

  1. You will need a working python installation – check that you have with:
$ python3 --version
Python 3.9.1

With Python 3.4 or later you will also have pip3 installer, which we’ll use to install two modules.

2. Install the sqlalchemy module using pip3:

$ pip3 install SQLAlchemy
Collecting SQLAlchemy
  Downloading SQLAlchemy-1.3.23-cp39-cp39-macosx_10_14_x86_64.whl (1.2 MB)
Installing collected packages: SQLAlchemy
Successfully installed SQLAlchemy-1.3.23

3. Install the pyhive interface for Presto. PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive. We want the Presto interface:

$ pip3 install 'pyhive[presto]'
Collecting pyhive[presto]
 ...
Successfully built pyhive future
Installing collected packages: six, urllib3, python-dateutil, idna, future, chardet, certifi, requests, pyhive
Successfully installed certifi-2020.12.5 chardet-4.0.0 future-0.18.2 idna-2.10 pyhive-0.6.3 python-dateutil-2.8.1 requests-2.25.1 six-1.15.0 urllib3-1.26.3

Test Script

Now you have python, SQLAchemy and PyHive ready to roll, create this sample script and save it as presto-test.py.  In it we’ll run a simple SELECT query that just returns the Presto cluster’s system timestamp.  Just replace the  user ID, password, cluster name, and metastore (e.g. the Ahana integrated ahana_hive in this case) with your own. Here we’re using an HTTPS connection so the default port number is 443:

# presto-test.py
# simple python script to run a query on a presto cluster and display the result set 
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine("presto://user:password@myPrestoClusterName:443/ahana_hive", connect_args={'protocol': 'https'})
 
with engine.connect() as con:
    rs = con.execute('SELECT now()')
    for row in rs:
        print(row)

Invocation

The script runs the SELECT statement and displays the result set which is a single row:

$ python3 presto-test.py
('2021-02-10 13:16:47.247 UTC',)

There are other methods one can use, like the Presto python client which can be found at https://github.com/prestodb/presto-python-client 

We hope you enjoyed this Presto & Python tutorial.  All the above information works perfectly with Ahana Cloud for Presto too, the fully integrated, cloud-native, managed service for AWS.

Setting-up Presto Resource Groups

Before you start allowing users on your Presto cluster it’s best practice to configure resource groups.  A resource group is a query admission control and workload management mechanism that manages resource allocation. Once configured, Presto’s resource group manager places limits on CPU and/or memory resource usage, can enforce queueing policies on queries, or divide resources up among groups and sub-groups. Each query belongs to a single resource group, and consumes resources from that group (and its “ancestor” groups).  Presto tracks memory and cpu usage of queries and these statistics can be recorded in the jmx catalog. 

Set-up is fairly straight-forward but there are some subtleties to be aware of like for example once a query starts execution, the resource group manager has no control over the query.

There follows a worked example which sets up four resource groups with controls based on memory usage: 

  1. Group1 one can use up to 40% of the (available) cluster memory 
  2. Group2 can use up to 30% of memory
  3. An administration group which has high priority for anyone logging in as “admin”
  4. An adhoc group for all other queries. 

Step 1: You’ll need to add a  etc/resource-groups.properties file with the following contents to enable resource groups:

resource-groups.configuration-manager=file
resource-groups.config-file=etc/resource-groups.json

Step 2: Create your etc/resource-groups.json file with your group1 (40% mem) and group2 (30% mem). Some recommendations:

  • Generally, allow a max global mem usage of 80% for Presto user activity. The remaining 20% is the recommended overhead for Presto to use as workspace.
  • Configure your maxQueued parameter (required) to the maximum number of queued queries you want the cluster to tolerate. Once this limit is reached new queries are rejected with an error message.
  • Configure hardConcurrencyLimit – this is the maximum query concurrency across the cluster. 
  • Configure your softMemoryLimit (your maximum amount of distributed memory each group may use before new queries are queued) for each group. You can specify it as a percentage (i.e. 40%) of the cluster’s memory. 
  • We’ll leave the cpuLimit parameters alone, which are optional anyway. These can be looked at later depending on your workload and performance requirements. 
  • Consider setting  “jmxExport”: true to store the statistics allowing you to monitor the resource behaviour. 
  • I recommend a group for admin use.  I’ve called this simply “admin” in the example json file below.  
  • I recommend an adhoc group for all other queries – a “catch-all”. I added this to the example json file below.

Step 3: You also need to set-up Selectors which let the cluster know who or what you are so you can be assigned a resource group. This can be based on the id of the user running the query, or a “source” string literal that you provide on the CLI with the –source  option. There’s a way to pass this via JDBC calls too.  I have added a “group1” and a “group2” selector to the example configuration below, plus “admin”, and “adhoc” for everything else.  

Step 4: Test it! Examine Presto’s log when you start the cluster to make sure your json config is valid. If all is well run a query with the Presto CLI and specify the resource group like this:

$ presto --server localhost:8090 --source group1
presto> show catalogs;

Check that the group is being picked-up correctly in the Presto UI – here you can see the right group is displayed (“group1”) for my test query:

X9GNEFOmajcspjkzY1kNPApTmuSroH6biHg 9cZPCOMmDlLCTF4BKzUgEjQUvlNx94JaGNcDfh5OQ89Q7glbLawpQwBkVLuWxsHMNhd5kBB64W7 ZP41WE4JT7Gbcf81PByIhrzM

There is a great example and more detailed description of all the params in the docs at https://prestodb.io/docs/current/admin/resource-groups.html

Once you get the basic resource groups set up you can tune it. You can consider using the optional schedulingPolicy which controls how queued queries are selected to run next. Also if your two resource groups have differing importance you can set their schedulingWeight (default is 1) to control how their queued queries are selected for execution – higher weight = higher priority e.g. users’ adhoc/interactive queries might be set to 10, but batch /  etl job type queries may be left at 1.  You can also have Presto auto-recognise DDL and treat such queries differently with their own group.

Here’s a sample etc/resource-groups.json file with the four groups defined:

{
  "rootGroups": [
    {
      "name": "global"
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "jmxExport": true,
      "subGroups": [
        {
          "name": "group1",
          "softMemoryLimit": "40%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "group2",
          "softMemoryLimit": "30%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
        "name": "adhoc",
        "softMemoryLimit": "10%",
        "hardConcurrencyLimit": 50,
        "maxQueued": 1,
        "schedulingWeight": 10,
        "subGroups": [
          {
            "name": "other",
            "softMemoryLimit": "10%",
            "hardConcurrencyLimit": 2,
            "maxQueued": 1,
            "schedulingWeight": 10,
            "schedulingPolicy": "weighted_fair",
            "subGroups": [
              {
                "name": "${USER}",
                "softMemoryLimit": "10%",
                "hardConcurrencyLimit": 1,
                "maxQueued": 100
              }
            ]
          }
        ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
      {
      "user": "admin",
      "group": "admin"
    },
  {
      "source": ".*group1.*",
      "group": "global.group1"
    },
    {
      "source": ".*group2.*",
      "group": "global.group2"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ]
}

I hope that helps you get you up and running with Presto resource groups.

How To Connect To Presto

Connections to and from Presto fall into three main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. There are many connectors and they are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to data in HDFS and S3 stored in a variety of formats.  Other connectors provide access to relational databases like RedShift, Postgres and MySQL , and to NoSQL sources like Elastic search and Mongo.

2) Applications that work with Presto out of the box – these include Apache Superset, Airpal, and the Presto CLI.

3) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax: http://<prestoServerHost>:<port>/v1/query. The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u: http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above How To helps you Connect To Presto.

http://myHost:8080/v1/query/20200926_204458_00000_68x9u
http://<prestoServerHost>:<port>/v1/query
http://myHost:8080/v1/query/20200926_204458_00000_68x9u

How To Access Presto

There are several ways to access and use Presto. Here are 3 of the most common ways:

  1. Command Line Interface

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

Download presto-cli-0.nnn-executable.jar from Presto’s Downloads page at https://prestodb.io/download.html, rename it to presto, make it executable with chmod +x, then run it like this:

./presto --server localhost:8080 --catalog hive --schema default --user xxx --password

On a Mac you can also use Brew to install the Presto CLI.  Simply type brew install presto on the command line.

  1. Apache Superset

Superset enables users to consume data in many different ways: Writing SQL queries, creating tables, creating a visualizations and sophisticated dashboards.  Superset can be downloaded from https://superset.incubator.apache.org

  1.  JDBC/ODBC Drivers

A huge number of tools and apps can be connected to Presto via JDBC and ODBC.  JDBC driver details and download is here: https://prestodb.io/docs/current/installation/jdbc.html

For ODBC several free and paid-for options exist:

In addition some applications have drivers for Presto built-in – Tableau is one such  example.

More tools, APIs and information on how to access presto can be found on the Presto Resources page at https://prestodb.io/resources.html.

Machine Learning With Presto

Presto is an invaluable tool for accessing and combining data, and performing feature engineering and feature extraction, in a scalable way ready for machine learning and other analytics.  Doing these tasks using familiar SQL is the fastest, most efficient way to prepare data for ML. Presto’s connectors, SQL engine and query federation capability give data scientists ready access to the data they need. 

Presto also has functions specific to Machine Learning.  These functions provide machine learning functionality as an aggregation function, enabling you to train Support Vector Machine (SVM) based classifiers and regressors for supervised learning problems. Note SVM is the only supported model at this time. 

In this short article we’ll take a look at these Presto ML Functions: 

  • learn_classifier()  – Receives training data and generates the model.
  • classify() – Receives the model & test data and returns the prediction.
  • features() –  Constructs a feature from numerical values. A feature is a representation of a data set composed of a sequence of pairs of labels.

There follows a worked Classification example showing how these functions are used.

This is how the features function is used to construct a numbered sequence of elements composed of numerical values: 

presto:default> SELECT features(2.0, 4.0, 6.5) AS test_features;
     test_features     
-----------------------
 {0=2.0, 1=4.0, 2=6.5} 
(1 row)

The returned features (correct name for which is an “N-dimensional vector”) are represented as a map-type value, whose key is an index of each feature.
Let’s create some test data and demonstrate all the functions at work.  The advantage of using Presto is that the data could be stored practically anywhere, and for more complex problems there may be several datasets stored across different data stores which Presto can effortlessly combine for us.  In this (well known) flower dataset, species is a label and sepal_length, sepal_width etc are features. This data describes three species of Iris:

use ahana_hive.default;
 
CREATE TABLE iris (
  id int
, sepal_length double
, sepal_width double
, petal_length double
, petal_width double
, species varchar
);
 
INSERT INTO iris VALUES 
(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
(2, 4.9, 3, 1.4, 0.2, 'Iris-setosa'),
(3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
(4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'),
(5, 5, 3.6, 1.4, 0.2, 'Iris-setosa'),
(6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'),
(7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'),
(8, 5, 3.4, 1.5, 0.2, 'Iris-setosa'),
(9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa'),
(10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(11, 5.4, 3.7, 1.5, 0.2, 'Iris-setosa'),
(12, 4.8, 3.4, 1.6, 0.2, 'Iris-setosa'),
(13, 4.8, 3, 1.4, 0.1, 'Iris-setosa'),
(14, 4.3, 3, 1.1, 0.1, 'Iris-setosa'),
(15, 5.8, 4, 1.2, 0.2, 'Iris-setosa'),
(16, 5.7, 4.4, 1.5, 0.4, 'Iris-setosa'),
(17, 5.4, 3.9, 1.3, 0.4, 'Iris-setosa'),
(18, 5.1, 3.5, 1.4, 0.3, 'Iris-setosa'),
(19, 5.7, 3.8, 1.7, 0.3, 'Iris-setosa'),
(20, 5.1, 3.8, 1.5, 0.3, 'Iris-setosa'),
(21, 5.4, 3.4, 1.7, 0.2, 'Iris-setosa'),
(22, 5.1, 3.7, 1.5, 0.4, 'Iris-setosa'),
(23, 4.6, 3.6, 1, 0.2, 'Iris-setosa'),
(24, 5.1, 3.3, 1.7, 0.5, 'Iris-setosa'),
(25, 4.8, 3.4, 1.9, 0.2, 'Iris-setosa'),
(26, 5, 3, 1.6, 0.2, 'Iris-setosa'),
(27, 5, 3.4, 1.6, 0.4, 'Iris-setosa'),
(28, 5.2, 3.5, 1.5, 0.2, 'Iris-setosa'),
(29, 5.2, 3.4, 1.4, 0.2, 'Iris-setosa'),
(30, 4.7, 3.2, 1.6, 0.2, 'Iris-setosa'),
(31, 4.8, 3.1, 1.6, 0.2, 'Iris-setosa'),
(32, 5.4, 3.4, 1.5, 0.4, 'Iris-setosa'),
(33, 5.2, 4.1, 1.5, 0.1, 'Iris-setosa'),
(34, 5.5, 4.2, 1.4, 0.2, 'Iris-setosa'),
(35, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(36, 5, 3.2, 1.2, 0.2, 'Iris-setosa'),
(37, 5.5, 3.5, 1.3, 0.2, 'Iris-setosa'),
(38, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'),
(39, 4.4, 3, 1.3, 0.2, 'Iris-setosa'),
(40, 5.1, 3.4, 1.5, 0.2, 'Iris-setosa'),
(41, 5, 3.5, 1.3, 0.3, 'Iris-setosa'),
(42, 4.5, 2.3, 1.3, 0.3, 'Iris-setosa'),
(43, 4.4, 3.2, 1.3, 0.2, 'Iris-setosa'),
(44, 5, 3.5, 1.6, 0.6, 'Iris-setosa'),
(45, 5.1, 3.8, 1.9, 0.4, 'Iris-setosa'),
(46, 4.8, 3, 1.4, 0.3, 'Iris-setosa'),
(47, 5.1, 3.8, 1.6, 0.2, 'Iris-setosa'),
(48, 4.6, 3.2, 1.4, 0.2, 'Iris-setosa'),
(49, 5.3, 3.7, 1.5, 0.2, 'Iris-setosa'),
(50, 5, 3.3, 1.4, 0.2, 'Iris-setosa'),
(51, 7, 3.2, 4.7, 1.4, 'Iris-versicolor'),
(52, 6.4, 3.2, 4.5, 1.5, 'Iris-versicolor'),
(53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'),
(54, 5.5, 2.3, 4, 1.3, 'Iris-versicolor'),
(55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'),
(56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'),
(57, 6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'),
(58, 4.9, 2.4, 3.3, 1, 'Iris-versicolor'),
(59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'),
(60, 5.2, 2.7, 3.9, 1.4, 'Iris-versicolor'),
(61, 5, 2, 3.5, 1, 'Iris-versicolor'),
(62, 5.9, 3, 4.2, 1.5, 'Iris-versicolor'),
(63, 6, 2.2, 4, 1, 'Iris-versicolor'),
(64, 6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'),
(65, 5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'),
(66, 6.7, 3.1, 4.4, 1.4, 'Iris-versicolor'),
(67, 5.6, 3, 4.5, 1.5, 'Iris-versicolor'),
(68, 5.8, 2.7, 4.1, 1, 'Iris-versicolor'),
(69, 6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'),
(70, 5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'),
(71, 5.9, 3.2, 4.8, 1.8, 'Iris-versicolor'),
(72, 6.1, 2.8, 4, 1.3, 'Iris-versicolor'),
(73, 6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'),
(74, 6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'),
(75, 6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(76, 6.6, 3, 4.4, 1.4, 'Iris-versicolor'),
(77, 6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'),
(78, 6.7, 3, 5, 1.7, 'Iris-versicolor'),
(79, 6, 2.9, 4.5, 1.5, 'Iris-versicolor'),
(80, 5.7, 2.6, 3.5, 1, 'Iris-versicolor'),
(81, 5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'),
(82, 5.5, 2.4, 3.7, 1, 'Iris-versicolor'),
(83, 5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'),
(84, 6, 2.7, 5.1, 1.6, 'Iris-versicolor'),
(85, 5.4, 3, 4.5, 1.5, 'Iris-versicolor'),
(86, 6, 3.4, 4.5, 1.6, 'Iris-versicolor'),
(87, 6.7, 3.1, 4.7, 1.5, 'Iris-versicolor'),
(88, 6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'),
(89, 5.6, 3, 4.1, 1.3, 'Iris-versicolor'),
(90, 5.5, 2.5, 4, 1.3, 'Iris-versicolor'),
(91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'),
(92, 6.1, 3, 4.6, 1.4, 'Iris-versicolor'),
(93, 5.8, 2.6, 4, 1.2, 'Iris-versicolor'),
(94, 5, 2.3, 3.3, 1, 'Iris-versicolor'),
(95, 5.6, 2.7, 4.2, 1.3, 'Iris-versicolor'),
(96, 5.7, 3, 4.2, 1.2, 'Iris-versicolor'),
(97, 5.7, 2.9, 4.2, 1.3, 'Iris-versicolor'),
(98, 6.2, 2.9, 4.3, 1.3, 'Iris-versicolor'),
(99, 5.1, 2.5, 3, 1.1, 'Iris-versicolor'),
(100, 5.7, 2.8, 4.1, 1.3, 'Iris-versicolor'),
(101, 6.3, 3.3, 6, 2.5, 'Iris-virginica'),
(102, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(103, 7.1, 3, 5.9, 2.1, 'Iris-virginica'),
(104, 6.3, 2.9, 5.6, 1.8, 'Iris-virginica'),
(105, 6.5, 3, 5.8, 2.2, 'Iris-virginica'),
(106, 7.6, 3, 6.6, 2.1, 'Iris-virginica'),
(107, 4.9, 2.5, 4.5, 1.7, 'Iris-virginica'),
(108, 7.3, 2.9, 6.3, 1.8, 'Iris-virginica'),
(109, 6.7, 2.5, 5.8, 1.8, 'Iris-virginica'),
(110, 7.2, 3.6, 6.1, 2.5, 'Iris-virginica'),
(111, 6.5, 3.2, 5.1, 2, 'Iris-virginica'),
(112, 6.4, 2.7, 5.3, 1.9, 'Iris-virginica'),
(113, 6.8, 3, 5.5, 2.1, 'Iris-virginica'),
(114, 5.7, 2.5, 5, 2, 'Iris-virginica'),
(115, 5.8, 2.8, 5.1, 2.4, 'Iris-virginica'),
(116, 6.4, 3.2, 5.3, 2.3, 'Iris-virginica'),
(117, 6.5, 3, 5.5, 1.8, 'Iris-virginica'),
(118, 7.7, 3.8, 6.7, 2.2, 'Iris-virginica'),
(119, 7.7, 2.6, 6.9, 2.3, 'Iris-virginica'),
(120, 6, 2.2, 5, 1.5, 'Iris-virginica'),
(121, 6.9, 3.2, 5.7, 2.3, 'Iris-virginica'),
(122, 5.6, 2.8, 4.9, 2, 'Iris-virginica'),
(123, 7.7, 2.8, 6.7, 2, 'Iris-virginica'),
(124, 6.3, 2.7, 4.9, 1.8, 'Iris-virginica'),
(125, 6.7, 3.3, 5.7, 2.1, 'Iris-virginica'),
(126, 7.2, 3.2, 6, 1.8, 'Iris-virginica'),
(127, 6.2, 2.8, 4.8, 1.8, 'Iris-virginica'),
(128, 6.1, 3, 4.9, 1.8, 'Iris-virginica'),
(129, 6.4, 2.8, 5.6, 2.1, 'Iris-virginica'),
(130, 7.2, 3, 5.8, 1.6, 'Iris-virginica'),
(131, 7.4, 2.8, 6.1, 1.9, 'Iris-virginica'),
(132, 7.9, 3.8, 6.4, 2, 'Iris-virginica'),
(133, 6.4, 2.8, 5.6, 2.2, 'Iris-virginica'),
(134, 6.3, 2.8, 5.1, 1.5, 'Iris-virginica'),
(135, 6.1, 2.6, 5.6, 1.4, 'Iris-virginica'),
(136, 7.7, 3, 6.1, 2.3, 'Iris-virginica'),
(137, 6.3, 3.4, 5.6, 2.4, 'Iris-virginica'),
(138, 6.4, 3.1, 5.5, 1.8, 'Iris-virginica'),
(139, 6, 3, 4.8, 1.8, 'Iris-virginica'),
(140, 6.9, 3.1, 5.4, 2.1, 'Iris-virginica'),
(141, 6.7, 3.1, 5.6, 2.4, 'Iris-virginica'),
(142, 6.9, 3.1, 5.1, 2.3, 'Iris-virginica'),
(143, 5.8, 2.7, 5.1, 1.9, 'Iris-virginica'),
(144, 6.8, 3.2, 5.9, 2.3, 'Iris-virginica'),
(145, 6.7, 3.3, 5.7, 2.5, 'Iris-virginica'),
(146, 6.7, 3, 5.2, 2.3, 'Iris-virginica'),
(147, 6.3, 2.5, 5, 1.9, 'Iris-virginica'),
(148, 6.5, 3, 5.2, 2, 'Iris-virginica'),
(149, 6.2, 3.4, 5.4, 2.3, 'Iris-virginica'),
(150, 5.9, 3, 5.1, 1.8, 'Iris-virginica');

Next we use this dataset and the classify(), features() and learn_classifier() functions to predict the species of flower based on some arbitrary features (measurements) that we provide.  The exact features used in the following example are not present in the iris dataset so it’s not a simple lookup – the model predicts the label (the flower species) for us:

select classify(features(5.8, 3, 5.1, 1.8), model) AS prediction 
FROM (
 SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width)) AS model FROM iris
 ) t;
 
   prediction   
----------------
 Iris-virginica 
(1 row)

The model predicts our measurements are an “Iris-virginica”. How does it work? The classify() function returns the predicted label by using the trained model. The trained model is constructed by the learn_classifier() function.  Using this approach means we cannot save the model so we pass it using the nested query (SELECT learn_classifier…) each time.

We can optionally view the trained model, but it does not make much sense to a human. Run the function to train the classification model again but omit the classify() function:

SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width) ) AS model
FROM iris;
 
                      model                      
-------------------------------------------------
 3c 43 6c 61 73 73 69 66 69 65 72 28 76 61 72 63 
 68 61 72 29 3e                                  
(1 row)

The trained model is returned in a serialized format. 

We hope the above Presto Machine Learning article and Classification example proves useful.  Stay tuned for a follow-up article introducing Regression – another type of supervised learning problem that Presto can help with.

How to Make The Presto Query Engine Run Fastest

How do you make the Presto query engine run as fast as possible?  The following list of things to consider, check and tune will help you get the most out of your Presto clusters:

  • How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger server/cloud instance, as more CPU and memory could help.

  • Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.

  • What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
4hywTjfhaF770lhcI6obDH1JKemQAWKGQjHeYbPe o F1Mn0NLqpnCsqatWR4Mfti63ssM3 FgXDhOmiFokxRploXqNHcMbYuQBDc84WF5Sei Pn07HspPct9lM7y3PMzz2Hp an
  • Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
  1. Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
  2. Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies

  • Do you have a separate coordinator node? With PrestoDB you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)

  • Is memory exhausted? Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 

The good news is there is memory information available in at least two places:

1. Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking.

2. There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.

  • When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection.
  • Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be.

  • Does IO and CPU utilisation look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough. You can check these stats in the Presto Console’s UI “Cluster Overview”.
  • Have you checked your data volumes recently? An obvious one to check but data volumes can grow unexpectedly and sometimes growth peaks occur without you noticing. The queries may simply be taking longer because there is x% more data than last month.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. More information can be found in the PrestoDB documentation.

We hope the above information makes your Presto clusters run as fast as they can. Click here for more about Ahana

How do I connect to Presto?

A Presto connection, connections to and from Presto, fall into two main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. The connectors are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to HDFS and S3 data stored in a variety of formats.

2) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

Presto JDBC Driver

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following Presto JDBC connector URL connection string formats are supported:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

Presto ODBC

Several free and paid-for options exist:

How to Connect to Presto Using Java

Connecting to Presto with Java requires the Presto JDBC driver (see above). By using the Presto JDBC driver, you can easily access and query your Presto cluster from Java. It allows you to run Presto queries from Java applications and provides an easy-to-use API for managing your Presto cluster.

After downloading and adding the jar file to your development environment, you can make a connection to the Presto cluster. To do this, you’ll need to provide the connection string and credentials. 

Here is an example of a connection string: jdbc:presto://host:port/catalog 

You can also provide additional parameters such as the catalog and schema, for example: jdbc:presto://host:port/catalog/schema 

Once connected, you can use a Statement object to execute SQL queries against the Presto cluster.

How to Connect Presto With Python

Python can be used to access Presto clusters, as long as the proper libraries are installed. The most popular library for connecting to Presto with Python is the PrestoDB library. This library provides a Python interface to the Presto HTTP API, allowing you to make queries and retrieve results from the Presto cluster.

The library can be installed with pip: pip install presto-python-client

Once this is done, you can use the library to make a connection to the Presto cluster. You will need to provide the URL of the Presto server and the catalog and schema to use. Then you can use the provided session object to execute SQL statements against the Presto cluster.

You can find this example in GitHub, to query the sys.node table:

import prestodb
conn=prestodb.dbapi.connect(
host=’localhost’,
port=8080,
user=’the-user’,
catalog=’the-catalog’,
schema=’the-schema’,
)
cur = conn.cursor()
cur.execute(‘SELECT * FROM system.runtime.nodes’)
rows = cur.fetchall()

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above information helps you with creating Presto Connections.

How to Presto Data Share

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables data sharing in two ways. 

1) With its broad list of data connectors, Presto provides a simple way to exploit data virtualization with federated SQL queries accessing data across multiple data sources.  Data in RDBMSs, NoSQL databases, legacy data warehouses, files in object storage can all be accessed,  and combined in a single SQL query if required.  Presto allows querying data where it lives, like in Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization by a variety of users and applications.  A complete list of Presto’s connectors can be found at  https://prestodb.io/docs/current/connector.html 

2) sharing access to data by connecting your Presto cluster to other systems, tools and applications is achieved using the Presto API or using SQL via client libraries, ODBC or JDBC interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query 

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries 

When it comes to Presto data usage and share, we hope the above information is useful.

Presto Performance: Speed, Optimization & Tuning

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Performance is particularly important for ad-hoc analytical queries so Presto has a number of design features to maximize Presto speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

Presto Performance Tips

The following recommendations can help you achieve maximum performance from your Presto clusters:  

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 


Finally, to assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service. 

We hope these Presto performance recommendations help you get the most out of Presto. If you’re interested in getting more information about Ahana, the easiest managed service for Presto on AWS, check out the case study comparing AWS Athena to Ahana.

Presto REST API

Presto’s HTTP API (Presto Rest API) is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Parquet

Parquet is a columnar storage format for Hadoop, supported by Presto.  Storing data as columns as opposed to rows enables Presto to fulfill queries more efficiently. By discarding unwanted data in rows, and preventing unnecessary scans, columnar storage saves disk space and improves query performance for larger data sets.

In Parquet, data is first horizontally partitioned into groups of rows, then within each group, data is vertically partitioned into columns. Data for a particular column is stored together via compression and encoding to save space and improve performance. Each Parquet Presto file has a footer that stores codecs, encoding information, as well as column-level statistics, e.g., the minimum and maximum number of column values. 

There have been several improvements made to Presto’s Parquet reader by the community, most notably by Uber, to enhance performance with features such as pushdown.  

For developers, a useful tool-set for working with Presto Parquet files is parquet-tools which allows you to see the schema contained in the file, view data, examine data types etc.  Parquet-tools can be installed by pip https://pypi.org/project/parquet-tools/ or installed on a Mac with brew install parquet-tools or cloned and built from https://github.com/apache/parquet-mr

Presto’s Parquet support is a popular choice for data storage.

Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design. 

How fast is Presto?

A well-optimized Presto cluster with enough nodes provisioned can provide blazing fast performance. In various benchmarking tests, Presto has been shown to outperform many popular data warehouses including Amazon Redshift.

In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto runs SQL an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 


To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more Presto speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service

Presto Tools

Several tools  and libraries are available, some are bundled with core Presto and others are downloadable extras. See the documentation Resources page at  https://prestodb.io/resources.html for details. 

Core Presto Tools:

  • Presto Console – Included with Presto, this browser-based UI allows administrators to monitor many aspects of their Presto deployment in real time, such as query workload, memory usage, execution plans etc. 

Community Contributed Tools 

Several community tools are listed at https://prestodb.io/resources.html including:

Other tools & Libraries 

Presto Documentation

Presto’s documentation is maintained by the Presto Foundation and the current version is available here: https://prestodb.io/docs/current/ 

Releases can be found at: https://github.com/prestodb and the Release Notes are at: https://prestodb.io/docs/current/release.html 

Currently the documentation site has no search function. Use Google to search the docs instead – for example if you want to find the Presto documentation page for window functions just google site:prestodb.io window functions and click the first match.

Some of the most popular and useful documentation pages are listed below for convenience: 

In addition the documentation for Ahana Cloud for Presto – a fully integrated, cloud-native managed service built for AWS – can be found at https://docs.ahana.cloud/docs

That’s your summary of Presto documentation. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Query Example

Since Presto is an ANSI SQL query engine, Presto’s SQL will be very familiar to practically anyone who has used a database, despite the fact Presto is not technically a database since it stores no data itself.

Starting with the SELECT statement, the full range of keywords that you would expect to use are supported including DISTINCT,  FROM, WHERE, GROUP BY, HAVING, UNION | INTERSECT | EXCEPT, ORDER B, LIMIT and so on.  More details can be found on the Presto SELECT doc page: https://prestodb.io/docs/current/sql/select.html 

Like most other database products, CTEs are also supported – Common Table Expressions allow you to define a temporary named result set that is available in the execution scope of a statement such as SELECT or INSERT.  Here’s the same query written without and then with a CTE:

--Without CTE
SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
 
--WITH CTE
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Here’s a CTE Presto query example:

presto> WITH my_time_data (the_time_now) as ( select now() ) 
SELECT the_time_now 
FROM my_time_data;
 
             the_time_now              
---------------------------------------
 2021-01-18 12:23:12.167 Europe/London 
(1 row)

The above Presto query example and documentation link should get you up and running writing Presto queries. 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Best Practices to Improve Performance and Reduce Costs

We’ve put together a list of Presto best practices, based on our experience. We hope you find these helpful.

Configuration

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 

File Format

  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.

Partitioning and Bucketing

  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 

Metastore

  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

Query Optimization

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

Dynamic Filtering

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
      
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 

Workload Management

  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 

We hope you find these Presto best practices useful. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto (and PrestoDB) Architecture and Design

The basics

The Presto architecture and design seamlessly integrates with an existing data ecosystem without needing to modify it. It’s an additional faster data access interface that turbocharges your existing stack. Presto separates compute from storage; its federated query engine lets teams run analytics for a variety of use cases across a wide range of data lakes and databases in-place, with no need to move data. It doesn’t store the data, which has the big advantage of being able to scale resources for queries up and down based on the demand. 

Presto supports a decoupled architecture

Most of the cloud deployments leverage object storage, which is already separated from the compute layer, and auto-scale to optimize resource costs. So the separation of compute and storage make it very suitable for cloud environments–though Presto can be deployed both in the cloud and on premises. 

Learn more about how Presto works with Amazon S3.

Presto nodes and how PrestoDB handles querying

Presto’s architecture is very similar to a classic database management system using cluster computing (MPP). Presto’s architecture consists of one coordinator node working together with multiple worker nodes. Clients enter their SQL queries that then are parsed, analyzed and planned with parallel tasks scheduled out to workers. Workers jointly process rows from the data sources and produce results that the client receives back. Presto does not write intermediate results to disk as its precursor Apache Hive did, and so you get significantly faster query speed. For Presto programming, it’s written in the Java programming language.

Presto data sources and connectors

A single Presto query can combine data from multiple sources. Presto uses connectors to access various data sources, some of the more popular combinations include Presto HDFS, Presto Amazon S3, Presto MySQL, Presto PostgreSQL, Presto Microsoft SQL Server, Presto Redis, Presto Kudu, Amazon Redshift, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. A connector is similar in concept to a driver for a database. Several of the connectors are built-in to Presto, namely for Hive, JMX, TPC-H benchmark data, and a system connector which provides access to built-in system tables. Presto accesses other data sources using third-party connectors, and more and more are becoming available.

Presto Guide, Documentation & Support

Presto is a fast, flexible distributed SQL engine created and used by Facebook at scale. O’Reilly’s Learning and Operating Presto ebook is an approachable guide for using Presto and getting Presto into a production deployment. It is intended to make Presto more accessible to anyone who cares about their organization becoming data-driven. It will explain how to begin Presto operations at your organization to derive insights on datasets of any type and any size, wherever they reside. 

Regardless of whether you’re a current user of Presto, new to Presto, or going to provide Presto to your users, the O’Reilly Learning and Operating Presto guide book does an excellent job of explaining what you need to know.  By reading this Presto guide you will learn what Presto is, where it came from, and how it differs from other data warehousing solutions. You’ll discover why Facebook, Uber, Twitter, and cloud providers use Presto and how you can quickly deploy Presto in production. You’ll hear about who uses Presto and the ways they solve data challenges with Presto. 

You can supplement what you learn in this Presto guide by visiting the Presto website, reading the Ahana blog, and participating in Presto’s growing community, which is filled with experienced users, developers, and data architects. You can join the Slack channel, engage via Github Discussions, attend the Presto Virtual Meetups.  All these avenues will help guide you through Presto. You can also view official Presto documentation.

If you’re looking for official support for Presto, take a look at Ahana Cloud. We offer SaaS for Presto with pay as you go pricing and technical support.

Presto Data Sources

Presto was designed and written from the ground up to efficiently query data against data sources of all sizes, ranging from gigabytes to petabytes. Presto connects to a wide variety of data sources, from HDFS to traditional relational databases, as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.

Presto is obviously a great fit for companies that have disparate data sources. For those organizations that can’t consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together. With Presto, you can leverage many data sources at once, which means Presto can handle very large volumes of data.

Some of popular combinations include Presto AWS S3, Presto Cassandra/cassandra presto, presto accumulo, and more. Here are more data sources Presto connects to:

Accumulo

Alluxio

Amazon Redshift

Amazon S3

Cassandra

Druid

Elastic

HDFS

Kafka

Kudu

Microsoft SQL Server

MongoDB

Phoenix

Pinto

RDS PostgreSQL

RDS MySQL

Redis

Teradata

If you want to get up and running with Presto quickly, check out Ahana Cloud which is SaaS for Presto.

Presto Server Types

When it comes to Presto Server Types, Coordinators and workers are the two types of Presto servers. Here’s a look at each server type in more detail.

A Presto coordinator parses statements, plans queries, and manages Presto worker nodes. A Presto installation is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.

The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages, which are then translated into a series of connected tasks running on a cluster of Presto workers.

A Presto worker is actually a server. Its basic job is to execute tasks assigned by the coordinator and to process data. When it starts working, it alerts the discovery server, which makes it available to the coordinator. The worker then obtains the data from connectors and exchanges data with the connectors. The coordinator then takes the results from the worker and sends those results back to the client.

Workers use an HTTP-based protocol to communicate with both the coordinator and other workers.

Is Presto Free?

Many folks may wonder “is Presto free”? In fact, PrestoDB is a free, open source federated, distributed SQL query engine used for ad hoc analytics. The PrestoDB AMI is 100% open source and available for use in production immediately. PrestoDB AMI is the first and only completely open source and completely free edition of the software.

At Ahana, we have a team of PrestoDB experts who are actively contributing to the open source project and community. Our mission is to help you be successful with your PrestoDB deployments. 

You can sign up for Ahana Cloud for Presto, the first full integrated, cloud-native managed service for Presto. It is the easiest Presto experience ever, and requires no installation, no AWS AMIs or CFTs and no configuration. You can be running SQL queries on your AWS data within 30 minutes. You can sign up for free here.

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster:

  • EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads. Presto is a distributed SQL query engine, also called a federation middle tier. Using EMR, users can spin up, scale and deploy Presto clusters. You can connect to many different data sources, some common integrations are: Presto Elasticsearch, Presto HBase connector, Presto AWS S3, and much more.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR Clusters Spark Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling for presto scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex (Presto EMR vs Redshift) – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much longer than deploying a packaged MPP database solution like Redshift when it comes to presto vs redshift/redshift vs presto.
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

Spark SQL and Presto, have become increasingly popular due to their capabilities in processing large amounts of data from various sources. In this blog post, we will dive deeper into Spark SQL and Presto, discussing their similarities, differences, and how they can be utilized to meet your specific data processing needs.

How these two tools are similar:

  • Both of these software frameworks are open source and are designed to handle large amounts of data. They operate in a distributed, parallel, and in-memory manner, which enables them to process data at high speeds.
  • BI tools connect with these frameworks using JDBC/ODBC connections.
  • They have been thoroughly tested and deployed by companies that process petabytes of data.
  • These frameworks can be executed either on-premises or in the cloud, and they can be containerized for a flexible and scalable deployment option.

Differences:

  • Presto is a query engine that provides access to and consolidation of data from various data sources using ANSI SQL:2003. It is generally deployed as a middle-layer for federation.On the other hand, Spark is a versatile cluster-computing framework that does not natively support SQL. To add structured data processing capabilities to Spark, you must install the Spark SQL module, which is also ANSI SQL:2003 compliant since Spark 2.0.
  • Presto is frequently used to support interactive SQL queries that are mainly analytical in nature but can also execute SQL-based ETL operations. Spark has more general-purpose applications, often utilized for data transformation and machine learning workloads.
  • By default, Presto allows querying of data in object stores like S3 and has many connectors available. It also works exceptionally well with Parquet and Orc format data.In contrast, Spark must rely on Hadoop file APIs to access S3 or purchase Databricks features. It also has limited connectors for data sources.

Many users are today are learning about Presto Spark. This lays out many of the differences on Presto vs Spark SQL and how Spark and Presto can be compared.

If you want to deploy a Presto cluster on your own, we recommend checking out how Ahana manages Presto in the cloud. We put together this free tutorial that shows you how to create a Presto cluster.

You can see our previous guide to compare the Spark execution engine vs Presto, or our comparison between Databricks and AWS Athena.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific Presto connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

spark streaming logo

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

Here are five open source alternatives to Spark Streaming

  • Apache Flink
  • Apache Apex
  • Apache Beam
  • Apache Samza
  • Apache Storm

Details about each alternative:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming. Learn more about Spark SQL vs Presto in our comparison article, or learn about using the invoking the Spark engine from Presto.

Presto Engine and Database Explained

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With the Presto sql engine, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto, and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem. You can do Presto data share with connectors.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Data Lakehouse

Querying AWS S3 with Presto

If you’re looking for a query engine for AWS S3, Presto is a popular choice. Presto is an open source SQL query engine, well known for ad hoc analytics as well as reporting and dashboarding. While users may think of S3 buckets as a data source for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases.

PrestoDB only needs access to the path of the data it needs to scan; it does not need to know the details like bucket name, etc. as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath. 

How to query Presto AWS S3 bucket:

You can use presto query S3 data. Here’s how to access Presto: there are two primary ways to Presto clusters get access to data stored in S3-

  1. IAM role via the instance Presto servers are running on 
  2. Access Key / Secret Key provided via the Hive connector properties file

IAM Role: Query S3 with Presto (Recommended Approach)

If using IAM role, Presto needs to be configured using

hive.s3.use-instance-credentials=true 

With this setting, the Presto server will have access to all the buckets that are accessible using the  IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role. 

Access Key / Secret Key: Query S3 with Presto

If AK/SK is used, Presto Hive connector needs to be configured with: 

hive.s3.aws-access-keyDefault AWS access key to use.
hive.s3.aws-secret-keyDefault AWS secret key to use.

With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected. 

If you’re interested in getting started with Presto on S3, check out the Ahana Cloud platform, a managed service for Presto in AWS. You can add data sources with a click of a button and query your S3-based data lake, and run a Pesto AWS S3 query easily.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

Optimize Presto on Amazon EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto EMR, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. It’s not really about Presto vs Hive EMR, more how to use it together. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization. A typical combination might be Hadoop Presto for analytics.

Using Presto on Amazon EMR

Presto and EMR Presto optimization allows users to run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. You can optimize Presto query on EMR as well. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto Amazon EMR is included automatically as part of the cluster software (when you optimize Presto on EMR). Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

For Amazon EMR Presto configuration, as a query engine Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. For EMR Presto optimization, with EMR 5.12.0 onwards by default EMR Presto versions uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

jampp
Screen Shot 2020 07 09 at 12.48.55 PM

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.

Using Presto with Hadoop

How does Presto work with Hadoop? What is Presto Hadoop?

You use Presto to run interactive queries on Hadoop. The difference between using Presto versus something like Hive, for instance, is that Presto is optimized for fast performance – this is crucial for interactive queries.

Presto’s distributed system runs on Hadoop and uses an architecture that’s similar to a massively parallel processing (MPP) database management system. 

Presto does not have its own storage system, so it acts as complementary to Hadoop. It can be installed with any implementation of Hadoop, like Amazon’s EMR Hadoop distribution.

If you’re looking for a Presto Hadoop tutorial, check out our getting started page.

Can Presto connect to HDFS?

Yes, Presto connects with HDFS through the Hive connector. You can use the Hive connector to query data stored in HDFS (or AWS S3, for that matter). One of the big benefits of Presto is that you can query data files in varying formats, which makes it easy to analyze all of your HDFS data. You can check out more on the Hive connector in the prestodb docs.

Is Presto related to Hive? What are the differences?

Apache Hive was developed as a project at Facebook in 2008 so they could leverage SQL syntax in their Hadoop system. It simplifies complex Java MapReduce jobs into SQL-like queries while executing jobs at massive scale.

As their Hadoop deployments grew, the Facebook team found that Hive wasn’t optimized for the fast performance they needed for their interactive queries. So they built Presto in 2013, which could operate quickly at petabyte scale.

At a high level, Hive is optimized for ad-hoc analysis and ease of use with its SQL-like syntax while Presto is highly optimized for low latency and fast performance – it takes longer for Hive to complete a job versus Presto.