Blog Archive

AWS Athena vs Snowflake

The High Level Overview

Snowflake and Amazon Athena are both cloud analytics tools, but are significantly different in terms of their architecture. Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Athena.

What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Snowflake’s performance, but note the high cost, and have generally positive opinions about Athena’s performance, but note potential performance issues and inability to scale the service.

Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Athena

– Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. 

– Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Athena

The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users note potential limitations in certain features for both tools, although both are capable of querying large datasets.

Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.
Athena

– Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.

– However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Users generally have positive opinions about Snowflake’s ease of use and configuration, while they are happy with the ease of deploying Athena in their AWS account, but mention drawbacks such as lack of support for stored procedures and unclear error messages when debugging queries.

Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.
Athena

– Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.

– However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for your data lake and lakehouse architecture. Ahana gives you SQL on S3 with better price performance than Athena and no vendor-lock in as compared to Snowflake.

Sources

share social presto and clickhouse march22

Webinar On-Demand

Data Lake, Real-time Analytics, or Both?

Exploring Presto and ClickHouse

Big data these days means big, fast, or both, and there’s a lot of technologies that promise to fulfill various pieces of that big data architecture.

Join us for this webinar we’re delivering in partnership with Altinity where we’ll explore open source big data solutions. We’ll contrast Presto, the leading SQL Query engine for data lakes, with ClickHouse, the DBMS champ for real-time analytics. After framing the problem with relevant use cases we’ll dig into solutions using Presto and ClickHouse.

You can expect a deep dive into the plumbing that exposes key trade-offs between approaches. Join us for an enlightening discussion filled with practical advice for your next big data project!

Speakers

Robert Hodges
CEO, Altinity

Robert Hodges

Robert Hodges is an entrepreneur and CEO at Altinity, a leading software and services provider for ClickHouse. Robert has more than 30 years experience with database systems and applications including pre-relational databases such as M204, online SQL transaction processing, Hadoop, and analytics. In the last few years, his work has focused on analytical databases, Kubernetes, and open source.

Rohan Pednekar
Product Manager, Ahana

rohan

Rohan Pednekar is a Product Manager at Ahana, the Presto company. He is also the Chairperson of the Presto Conformance Program. His work at Ahana is to develop open data lake analytics and he is currently focusing on performance, reliability, table format support, and security features for Presto. Before joining Ahana, Rohan worked on other open source projects like Apache HBase, Apache Phoenix at Hortonworks, and Cloudera.

Starburst vs Snowflake

The High Level Overview

Starburst and Snowflake are both in the data analytics space but are significantly different in terms of their architecture and use cases. Starburst is the corporate entity behind a fork of Presto called Trino, a SQL query engine whereas Snowflake is a cloud data warehouse that stores data in a proprietary format, although it utilizes cloud storage to provide elasticity. An alternative to Starburst is Ahana Cloud, a managed service for Presto.

Snowflake would more often be considered as an alternative to Redshift or other cloud data warehouse technologies – typically used for situations where workloads are predictable, or where organizations are willing to pay a premium to provide very fast query performance. Storing large volumes of semi-structured data in data warehouses will typically be expensive, and in these cases many organizations would consider a serverless alternative such as Ahana or Amazon Athena.

What is Starburst?

Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Snowflake?

Snowflake is a cloud-based data warehouse that provides a SQL interface for querying, loading, and analyzing data. It also provides tools for data sharing, security, and governance.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. It enables users to query, analyze, and process data from multiple sources.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst

Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Snowflake

The Snowflake website claims that Snowflake’s multi-cluster resource isolation ensures reliable, fast performance for both ad-hoc and batch workloads; and that this performance is ensured even when working at larger scale.
Ahana Cloud

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally have positive opinions about Starburst’s performance, but find it difficult to customize and integrate with external databases; Snowflake’s performance is seen as an advantage, but users note that it is expensive for some use cases.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Many reviewers have generally positive opinions about Snowflake’s performance – although it’s clear from the reviews that this performance comes at a high cost. They mention positive aspects such as its ability to handle multiple users at once, instantaneous cluster scalability, fast query performance, and automatic compute scaling

– Negative aspects mentioned include credit limits, expensive pricing for real-time use cases or large queries, cost of compute, time required to learn Snowflake’s scaling, and missing developer features.
Ahana Cloud

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst

The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards.
Snowflake

The Snowflake website claims that Snowflake can instantly and cost-efficiently scale to handle virtually any number of concurrent users and workloads, without impacting performance; an that Snowflake is built for high availability and high reliability, and designed to support effortless data management, security, governance, availability, and data resiliency.
Ahana Cloud

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Overall, users generally think that both Starburst and Snowflake are capable of handling larger volumes of data, but may have other potential issues with scalability.

Starburst

– Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalable

– Potential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Snowflake

– Reviewers note that Snowflake is capable of handling larger volumes of data. They also mention that it has features such as cluster scalability, flexible pricing models, and integrations with third-party tools that can help with scaling. 

– However, some reviewers also mention potential limitations such as the lack of full functionality for unstructured data, the difficulty of pricing out the product, and the lack of command line tools for integration.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. 

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst

The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Snowflake

The Snowflake website claims that Snowflake is a fully managed service, which can help users automate infrastructure-related tasks; and that Snowflake provides robust SQL support and the Snowpark developer framework for Python, Java, and Scala, allowing customers to work with data in multiple ways.
Ahana Cloud

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. Overall, users generally find both Starburst and Snowflake to be easy to use, although they have noted some areas of improvement for each.

Starburst

– Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.

– However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Snowflake

– Reviewers have mostly positive opinions about Snowflake’s ease of use and configuration. Several mention that Snowflake is easy to deploy, configure, and use, with many online training options available and no infrastructure maintenance required. 

– On the negative side, some reviews mention that there are too many tiers with their own credit limits, making it economically non-viable, and that the GUI for SQL Worksheets (Classic as well as Snowsight) could be improved. Additionally, some reviews mention that troubleshooting error messages and missing documentation can be challenging, and that they would like to see better POSIX support.

Cost

  • Starburst’s pricing is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Snowflake is priced based on two consumption-based metrics: usage of compute and of data storage, with different tiers available. Storage costs begin at a flat rate of $23 USD per compressed TB of data stored, while compute costs are $0.00056 per second for each credit consumed on Snowflake Standard Edition, and $0.0011 per second for each credit consumed on Business Critical Edition. 
  • Ahana Cloud is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

As we can see, Snowflake follows data warehouse pricing models, where users pay both for storage and compute. A recurring theme in many of the reviews is that costs are hard to control, especially for real-time or big data use cases. Starburst’s pricing can be difficult to predict based on the information available online, but the company is clearly leaning towards an enterprise pricing model that looks at annual commitment rather than pay-as-you-go.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto in the cloud.

Sources

Presto and ETL – Clarifying Common Misconceptions

Data integration and ETL are typically seen as crucial to gaining insights from data. But when it comes to Presto, things get trickier: Should you use Presto for ETL? Or should you run ETL flows before querying data with Presto?

In this article, we will explore using Presto for ETL, how it compares to traditional data warehouse architectures, and whether ETL is necessary before querying data with Presto.

Looking for a better solution to run Presto in public clouds? Try Ahana for superior price-performance.

What is ETL in big data?

ETL (Extract, Transform, Load) is an iterative data integration process used in data warehousing to bring data from multiple sources into a single, centralized data repository. The process involves extracting data from its original source, cleaning and validating it, transforming it into a suitable format, and loading it into the data warehouse.

This process is necessary because data from different sources may have different formats and structures, and so it needs to be unified and organised before it can be used in data analysis and decision-making. ETL also allows for data to be regularly updated, so that the data warehouse is always up-to-date. Additionally, ETL tools allow data to be migrated between a variety of sources, destinations, and analysis tools, and enable companies to address specific business intelligence needs through data analytics

In big data, the target the data is being loaded into might be a data warehouse such as BigQuery or Redshift, but it might also be a data lake or data lakehouse — where the transformed data is stored in object storage such as Amazon S3. For instance, data might be loaded into an S3 data lake in its raw form and then further processed, cleaned and transformed into a format that is more suitable for analytical querying, such as Apache Parquet or ORC.

Using Presto for ETL

Presto is a distributed SQL query engine that is designed to be fast and responsive, allowing users to quickly analyze large datasets. It is an excellent choice for running interactive queries on large datasets, as it can execute queries in seconds or minutes, depending on the size of the dataset.

Even though Presto was designed to be an ad-hoc query engine, it can be a suitable choice if the ETL process does not require too many resources or is not too complex. However, it doesn’t guarantee results and lacks many “ETL” features common in tools designed to do ETL, such as caching.

It is worth noting that Presto is not a replacement for traditional ETL tools like Apache Airflow, which are designed specifically for complex ETL workflows. However, if your ETL process is relatively simple and you are already using Presto for data analysis, it may make sense to use Presto for ETL as well.

If you are using Presto for ETL, it is recommended to spin a separate Presto cluster to avoid resource contention with interactive Presto workloads. You should also break complex queries into a series of shorter ones and create temporary tables for manual checkpointing.

Do you need to ETL your data before reading it in Presto?

Presto allows for ad-hoc querying of data wherever it resides without the need for ETL into a separate system. Using Presto connectors, analysts can access datasets that they have interest in, while in-place execution means queries are retrieved quickly — even when querying data directly from cloud object storage. This makes it much easier for data teams to access and analyze data in real-time, without having to wait for ETL processes to complete. It also helps to reduce the cost of data storage, as it eliminates the need to duplicate data in a separate system.

For most Presto use cases, there is no need to pre-process data with ETL or ELT before querying it. As soon as the data is stored in the data lake, it can be accessed with Presto SQL. This is almost always the case with interactive analytics, but there could be exceptions when it comes to scheduled reporting or BI dashboards – in these situations, you might consider using aggregated or pre-summarized datasets to reduce compute costs.

Comparing Presto to Data Warehouses

The ‘zero ETL’ approach that Presto enables is one of its core advantages over traditional data warehouse architectures. ETL/ELT can involve a lot of manual work, data duplication, and errors, which can make analytics slower and more complex.

PrestoDB’s ability to run analytical queries on diverse data sources and raw semi-structured data can significantly expedite the process. Presto eliminates the need to load data into a data warehouse, since it can be queried directly from its source and schema changes can be implemented in real-time; this also saves the need to perform costly transformations in order to apply a relational schema to file-based data.

If you’re interested in trying out Presto in the cloud, Ahana Cloud is a SaaS for Presto that makes it easy to deploy Presto in the cloud and run queries on your data. Check out a demo today.

S3 Select vs. AWS Athena – The Quick Comparison

logo amazon athena

Data analysts and data engineers need simpler ways to access business data stored on Amazon S3. Amazon Athena and S3 Select are two services that allow you to retrieve records on S3 using regular SQL. What are the differences, and when should you use one vs the other?

S3 Select vs Athena: What’s the Difference?

The short answer:

Both services allow you to query S3 using SQL. Athena is a fully-featured query engine that supports complex SQL and works across multiple objects while S3 Select is much more limited, and used to retrieve a subset of data from a single object in S3 using simple SQL expressions. 

The long answer:

S3 Select is more appropriate for simple filtering and retrieval of specific subsets of data from S3 objects using basic SQL statements, with reduced data transfer costs and latency. Amazon Athena, on the other hand, is suitable for running complex, ad-hoc queries across multiple paths in Amazon S3, offering more comprehensive SQL capabilities, improved performance, and optimization options. Athena supports more file formats, compression types, and optimizations, while S3 Select is limited to CSV, JSON, and Parquet formats.

An alternative to Amazon Athena is Ahana Cloud, a managed service for Presto that offers up to 10x better price performance.

Here is a detailed comparison between the two services:

Query Scope:

  • S3 Select operates on a single object in S3, retrieving a specific subset of data using simple SQL expressions.
  • Amazon Athena can query across multiple paths, including all files within those paths, making it suitable for more complex queries and aggregations.

SQL Capabilities:

  • S3 Select supports basic SQL statements for filtering and retrieving data, with limitations on SQL expression length (256 KB) and record length (1 MB).
  • Athena offers more comprehensive ANSI SQL compliant querying, including group by, having, window and geo functions, SQL DDL, and DML.

Data Formats and Compression:

  • S3 Select works with CSV, JSON, and Parquet formats, supporting GZIP and BZIP2 (only for CSV and JSON) compression.
  • Athena supports a wider range of formats, including CSV, JSON, Apache Parquet, Apache ORC, and TSV, with broader compression support.

Integration and Accessibility:

  • S3 Select can be used with AWS SDKs, the SELECT Object Content REST API, the AWS CLI, or the Amazon S3 console.
  • Athena is integrated with Amazon QuickSight for data visualization and AWS Glue Data Catalog for metadata management. It can be queried directly from the management console or SQL clients via JDBC.

Performance and Optimization:

  • S3 Select is a rudimentary query service mainly focused on filtering data, reducing data transfer costs and latency.
  • Athena offers various optimization techniques, such as partitioning and columnar storage, which improve performance and cost-efficiency.

Schema Management:

  • S3 Select queries are ad hoc and don’t require defining a data schema before issuing queries.
  • Athena requires defining a data schema before running queries.

Pricing:

  • According to the first source provided, the cost of S3 Select depends on three factors: the number of SELECT requests, the data returned, and the data scanned. As of Dec 2020, the cost for region US-EAST(Ohio) with Standard Storage is:
    • Amazon S3 Select — $0.0004 per 1000 SELECT requests
    • Amazon S3 Select data returned cost — $0.0007 per GB
    • Amazon S3 Select data scanned cost — $0.002 per GB
  • With Athena, you are charged $5.00 per TB of data scanned, rounded up to the nearest megabyte, with a 10MB minimum per query. 

An alternative to AWS Athena and S3 Select is Ahana Cloud, which gives you the ability to run complex queries at better price performance than Athena. Get a demo today.

Summary

FeatureS3 SelectAthena
Query ScopeSingle object (e.g., single flat file)Multiple objects, entire bucket
Use CasesAd-hoc data retrievalLog processing, ad-hoc analysis, interactive queries, joins
SQL CapabilitiesBasic queries, filteringComplex, ANSI-compliant SQL queries, aggregations, joins
File FormatsCSV, JSON, ParquetCSV, JSON, Parquet, TSV, ORC, and more
IntegrationServerless apps, Big Data frameworksAWS Glue Data Catalog, ETL capabilities
Query InterfaceS3 API (e.g., Python boto3 SDK)Management Console, SQL clients via JDBC
Performance OptimizationLimited, basic filteringPartitioning, columnar storage, and more
Schema DefinitionNot requiredRequired

When should you use Athena, and when should you use S3 select?

You should choose Amazon Athena for complex queries, analysis across multiple S3 paths, and integration with other AWS services, such as AWS Glue Data Catalog and Amazon QuickSight. Opt for S3 Select when you need to perform basic filtering and retrieval of specific subsets of data from a single S3 object.

Example Scenarios

Example 1: Log Analysis for a Web Application – Use Athena

Imagine you operate a web application, and you want to analyze log data stored in Amazon S3 to gain insights into user behavior and troubleshoot issues. In this scenario, you have multiple log files across different S3 paths, and you need to join and aggregate the data to derive meaningful insights.

In this case, you should use Amazon Athena because it supports complex SQL queries, including joins and aggregations, and can query across multiple paths in S3. With Athena, you can take advantage of its optimization features like partitioning and columnar storage to improve query performance and reduce costs.

Example 2: Filtering Customer Data for a Marketing Campaign – Use S3 Select

Suppose you have a customer data file stored in Amazon S3, and you want to retrieve a subset of records for a targeted marketing campaign. The data file is in JSON format, and you need to filter records based on specific criteria, such as customer location or spending habits.

In this scenario, S3 Select is the better choice, as it is designed for simple filtering and retrieval of specific subsets of data from a single S3 object using basic SQL expressions. Using S3 Select, you can efficiently retrieve the required records, reducing data transfer costs and latency.

Is S3 Select Faster than Athena?

Both S3 Select and Athena are serverless and rely on pooled resources provisioned by Amazon at the time the query is run. Neither is generally faster than the other. However, S3 Select can be faster than Athena for specific use cases, where retrieving a subset of the data is more efficient than processing the entire object. In cases where you only need the capabilities of S3 Select, it can also be easier to run compared to Athena, which requires a table schema to be defined.

Need a better SQL query engine for Amazon S3?

Ahana provides a managed Presto service that lets you run ad-hoc queries, interactive analytics, and BI workloads over your S3 storage. Learn more about Ahana or get a demo.

Sources used in this article:

https://ahana.io/answers/aws-s3-select-limitations/ 

https://aws.amazon.com/s3/pricing/

https://aws.amazon.com/athena/pricing/ 

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html

https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/

https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428

https://stackoverflow.com/questions/49102577/what-is-difference-between-aws-s3-select-and-aws-athena

https://repost.aws/questions/QU1_wCZSxES6-QHh7QBTDYDA/s-3-select-vs-athena

4 Trino Alternatives for Better Price / Performance

logo presto lg

Trino, a distributed SQL query engine, is known for its ability to process large amounts of semi-structured data using familiar SQL semantics. However, there are situations where an alternative may be more suitable. In this article, we explore four Trino alternatives that offer better price/performance for specific use cases.

What is Trino?

Trino is a distributed SQL query engine that supports various data sources, including relational and non-relational sources, through its connector architecture. It is a hard fork of the original Presto project, which was started at Facebook and later open-sourced in 2013. 

The creators of Presto, who later became cofounders/CTOs of Starburst, began the hard fork named Trino in early 2019. 

Trino has since diverged from Presto, and many of the innovations that the community is driving in Presto are not available in Trino. Trino is not hosted under the Apache Software Foundation (ASF) or Linux Foundation, but rather under the Trino Software Foundation, a non-profit corporation controlled by the cofounders of Starburst.

Learn more: 

When Should You Use Trino?

Presto-based services – including Trino and PrestoDB – are designed for ad-hoc querying and analytical processing over data lakes, and allow developers to run interactive analytics against massive amounts of semi-structured data. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations. 

Trino or Presto should be used when a user wants to perform fast queries against large amounts of data from different data sources using familiar SQL semantics. It is suitable for organizations that want to use their existing SQL skills to query data without having to learn new complex languages.

Other Trino use cases that have been mentioned in the context of data science workloads are for running a specific federated query that requires high performance and when you need to connect to data via Apache Hive as the backend. 

When Should You Look at Alternatives to Trino?

While Trino is a powerful and popular framework, there are situations where you might want to consider an alternative. These include:

  • If you’re looking for an open-source project that has a strong governance structure and charter, then Trino is not the best choice since it is a vendor-controlled non-profit corporation. Users who prefer to use a project hosted under a well-known project hosting organization like ASF or The Linux Foundation may choose to use another tool instead of Trino.
  • If you are looking for services and support from vendors, you should compare the functionality and price/performance provided by Trino to alternative tools such as Ahana, Amazon Athena, or Dremio.
  • If you’re looking for a database management system that stores and manages data, then Trino is not suitable. Similarly to Presto, Trino is a SQL query engine that queries the connected data stores and does not store data (although both tools have the option to write the results of a query back to object storage)

4 Alternatives to Trino 

If you’re looking for an alternative to Trino, consider one of the following:

  1. Open Source PrestoDB 
  2. Ahana, managed service for Presto on AWS
  3. Amazon Athena, serverless service for Presto/Trino on AWS
  4. Dremio

1. PrestoDB – the original Presto distribution used at Facebook

As mentioned above, Trino is originally a hard fork of PrestoDB. Trino was previously known as PrestoSQL before being rebranded in December 2020. The Presto Software Foundation was also rebranded as Trino Software Foundation to reflect the fact that these are two separate and divergent projects. 

While Trino and PrestoDB share a common history, they have different development teams and codebases, and may have different features, optimizations, and bug fixes.

Some key differences between PrestoDB and Trino:

  • PrestoDB is tested by and used by Facebook, Uber, Bytedance, and other internet-scale companies, while Trino is not. 
  • Presto is one of the fastest-growing open-source projects in the data analytics space.
  • The Presto Foundation (part of The Linux Foundation) oversees PrestoDB, whereas Trino is mainly steered by a single company (Starburst). 
  • Presto offers access to recent and current innovations in PrestoDB including Project Aria, Project Presto Unlimited, additional user-defined functions, Presto-on-Spark, Disaggregated Coordinator, and RaptorX Project. 

See the full comparison: Presto vs Trino.

There are several ways you can get started with open source Presto, including running it on-premises, through a Docker container, and more (check out our getting started with Presto page). 

2. Ahana Cloud: managed service for Presto on AWS

Ahana, a member of the Presto Foundation and contributor to the PrestoDB project, offers a managed, cloud-native version of open-source Presto – Ahana Cloud. It gives you a managed service offering for Presto by taking care of the hundreds of configurations and tuning parameters under the hood while still giving you more control and flexibility as compared to a serverless offering.

Ahana also includes some features like Data Lake Caching for better performance and AWS Lake Formation integration to take advantage of granular data security.

Check out a demo of Ahana Cloud.

3. Amazon Athena: managed Presto/Trino service provisioned by AWS

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL. Originally based off of PrestoDB, Athena now incorporates features from both Presto and Trino.

In our comparison between Athena and Trino-based Starburst, we concluded that:

  • Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3, but there are some key differences. 
  • Starburst has features like Cached Views and pushdown capabilities, while Athena is optimized for fast performance with Amazon S3 and executes queries automatically in parallel. 
  • Users generally regard both Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues. 
  • Users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging. 
  • In terms of cost, Athena charges a flat price of $5 per terabyte of data scanned, while Starburst’s pricing is more complex.

4. Dremio: serverless query engine based on Apache Arrow

Dremio, which is built on Apache Arrow, is another query engine that enables high-performance analytics directly on data lake storage. 

According to Dremio’s website, Dremio offers interactive analytics directly on the lake and is often used for BI dashboards, whereas Starburst primarily supports ad-hoc workloads only. Dremio provides self-service with a shared semantic layer for all users and tools, while Starburst lacks a semantic layer and data curation capabilities. 

On the other hand, Starburst touts a cost-based optimizer that helps define an optimal plan based on the table statistics and other info it receives from plugins. Starburst’s custom connectors are optimized to be run in parallel, taking advantage of Trino’s MPP architecture. 

While both platforms offer similar products, Dremio seems to be more focused on BI-oriented workloads reading from data lakes, whereas Starburst might be better suited for ad-hoc and federated queries.

Try Ahana Cloud’s managed Presto for free

If you’re evaluating SQL query engines, you’re in the right place. The easiest way to get started is with Ahana Cloud for Presto. You can try it for yourself, but we recommend scheduling a quick, no-strings-attached call with our solutions engineering team to understand your requirements and set up the environment. Get started now

Exploring Data Warehouse, Data Mesh, and Data Lakehouse: What’s right for you?

Screen Shot 2023 03 16 at 12.11.25 PM

We’re hosting a free hands-on lab on building your own Data Lakehouse in AWS. You’ll get trained by Presto and Apache Hudi experts.

When it comes to data management, there are various approaches and architectures for storing, processing, and analyzing data. In this article we’ll discuss three of the more popular approaches in the market today – the data warehouse, data mesh, and data lakehouse. 

Each approach has its own unique features, advantages, and disadvantages, and understanding the differences between them is crucial for organizations to make informed decisions about their data strategy. We’ll take you through each one and help you determine which approach is best suited for your organization’s data needs.

Data Warehouse: Centralized but Inflexible

A Data Warehouse is a centralized repository that stores structured data from various sources for analysis and reporting. Typically it’s a relational database and optimized for read-heavy workloads with a schema-on-write approach. 

Advantages of a data warehouse are that it’s a single source of truth for structured data, it provides high performance querying and dashboarding/reporting capabilities, and it supports business intelligence and analytics use cases. 

On the other hand, some of its disadvantages are that it requires data to be pre-processed and structured, it has limited flexibility in handling unstructured data and new data types, and it can be expensive to implement and maintain.

Learn more about choosing between data warehouse and data lake.

Data Mesh: Flexible but Complicated

A Data Mesh is a distributed and decentralized approach to data architecture that focuses on domain-driven design and self-service data access. Key features include decentralized data ownership and control, data that’s organized by domains rather than centralized by function, data is emphasized as a product that is discoverable and reusable, and data access is self-service for domain teams.

Advantages of a data mesh are that it offers agility and flexibility in handling complex and evolving data environments, it facilitates collaboration between data teams and domain teams, and it promotes data democratization and data-driven culture.

Disadvantages are that it requires a cultural shift and new ways of working to implement, distributed data ownership involves data governance and security challenges, and it requires strong data lineage and metadata management to ensure data quality and consistency. Performance can also be a problem if you’re doing joins across many data sources, because your query will only be as fast as your slowest connection.

Data Lakehouse: Hybrid Approach

A Data Lakehouse is a hybrid approach that combines the best features of data warehouses and data lakes. Those features include support for both structured and unstructured data, support for both read and write-heavy workloads, and a schema-on-read approach.

Advantages of a data lakehouse are that it offers flexibility in handling both structured and unstructured data, it supports real-time analytics and machine learning use cases, and it’s cost-effective compared to traditional data warehouses. They’re designed to handle both batch processing and real-time processing of data.

Disadvantages are that it requires data governance and management policies to prevent data silos and ensure data quality, complex data integration and transformation may require specialized skills and tools, and there may be performance issues for ad-hoc queries and complex joins.

Picking the data architecture that’s best for your use case

Below is a matrix we’ve put together that lists which of these approaches best fits specific requirements and use cases.

Data WarehouseData MeshData Lakehouse
Structured data
Unstructured data
Fast access to data
Real-Time Data ProcessingPossible with additional tools
Data GovernanceCentralizedDecentralized Centralized or Decentralized
Cost-effectiveDepends on specific use case
Scalability
Self-service data discovery
Data IntegrationRequires specialized work
Analytics capabilitiesEvolving

As shown in the matrix, each architecture has its own strengths and weaknesses across different key capabilities.

A data warehouse architecture is well-suited for structured data, offers strong data governance, and mature analytics capabilities, but may be limited in its scalability and ability to handle unstructured data and real-time processing.

A data mesh architecture offers highly scalable and decentralized data management, high developer productivity, and flexible data governance, but may require additional tools for real-time processing and careful planning for data integration.

A data lakehouse architecture is well-suited for unstructured data, offers good scalability and data integration capabilities, and is reasonably cost-effective, but may be limited in its ability to handle highly structured data and may require varied data governance strategies.

The Open Data Lakehouse

At Ahana, we’re building the Open Data Lakehouse with Presto at its core. Presto, the open source SQL query engine, powers the analytics on your Open Data Lakehouse. We believe the data lakehouse approach strikes the best balance between flexibility, scalability, and cost-effectiveness, making it a favorable choice for organizations seeking a modern data management solution.

Screen Shot 2023 03 10 at 8.47.04 AM
You can learn more about our approach to the Data Lakehouse by downloading our free whitepaper.

AWS Athena vs. Databricks

In this article we’ll look at two different technologies in the data space and share more about which to use based on your use case and workloads.

The High Level Overview

To set the stage, it’s important to note that Databricks and Amazon Athena are two different beasts so a comparison is not really very helpful due to the breadth of functionality provided by each tool. For the purposes of this article, we’ll give an overview of each and share more on when it makes sense to use each tool.

AWS Athena is a serverless query engine based on open-source Presto technology, which uses Amazon S3 as the storage layer; whereas Databricks is an ETL, data science, and analytics platform which offers a managed version of Apache Spark. Databricks is widely known for its data lakehouse approach which gives you the data management capabilities of the warehouse coupled with the flexibility and affordability of the data lake.

One could conceivably use both tools within the same deployment, although there will be some overlap around data warehousing and ad-hoc workloads. This overlap might have grown larger recently with the release of Amazon Athena for Apache Spark.

An alternative to these offerings is Ahana Cloud, a managed service for Presto that gives you a prescriptive approach to building an open data lakehouse using open source technologies and open formats.

What is Databricks?

Databricks is a unified analytics platform built on open-source Apache Spark, which combines data science, engineering, and business analysis in an integrated workspace.
What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?

Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Databricks

The Databricks website claims that Databricks offers world-record-setting performance directly on data in the data lake, and that it is up to 12x better price/performance than traditional cloud data warehouses.
Athena

The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana

Ahana has multi-level data lake caching that can give customers up to 30X query performance improvements. Ahana is also known for its better price-performance as compared to Athena especially.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance.  Users generally view both Databricks and Athena as tools that provide good performance for big data workloads, but with some drawbacks when it comes to ongoing management.

Databricks

Users mention that Databricks has good performance for big data workloads, and quick lakehouse deployment. Some users have noted that Databricks makes it hard to profile code inside the platform. Additionally, some users have mentioned issues with logging for jobs, job scheduling, and job portability.
Athena

Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.
Ahana

Ahana is similar to Athena in that you get fast and reliable data analytics at scale. Unlike Athena, you get more control over your Presto deployment – no issues with concurrency or deterministic performance.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Databricks

The Databricks website claims that Databricks is highly scalable and comes with various enterprise readiness features such as security and user access control, as well as the ability to integrate with other parts of the user’s ecosystem.
Athena

Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable.
Ahana

Ahana has autoscaling built-in which automatically adjusts the number of worker nodes in an Ahana-managed Presto cluster. This allows for efficient performance and also helps to avoid excess costs.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Both tools offer auto-scaling, although Databricks can provide dedicated clusters which might provide more consistent performance.

Databricks

Users were happy with Databricks’s ability to autoscale clusters. They also note its open source technologies, and the ability to use different programming languages in the platform. Some users have mentioned challenges around security, user access control, and integration with other parts of the ecosystem. Users also note that Databricks is not compatible with some AI/ML libraries, difficult to secure and control access, and can get expensive.
Athena

Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Databricks

The Databricks website claims that Databricks is simple to install and operate, and that it uses familiar languages and syntaxes such as SQL, making it easy to use.
Athena

The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana

Ahana is a managed service which means you get more control over your deployment than you would with Athena, but it also takes care of the configuration parameters under the hood.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s usability. 

Databricks

Multiple reviews mentioned that Databricks provides a good user experience and has a relatively simple setup process. On the other hand, users have mentioned that Databricks has a steep learning curve, which could make it difficult to use for those without specialized knowledge. Additionally, some users have noted that the UI can be confusing or repetitive.
Athena

Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. As your datasets and workloads grow, your Athena costs can grow quickly which can lead to sticker-shock. That’s why many Ahana customers were previous Athena users who were seeing unpredictable costs associated with their Athena usage – due to Athena’s serverless nature, you can never predict how many resources will be available.
  • Databricks pricing is based on compute usage. The cost of using Databricks is calculated by multiplying the amount of DBUs (Databricks Units) that you consumed with a corresponding $ rate. This rate is influenced by the cloud provider you’re working with (e.g., the cost AWS charges for EC2 machines), geographical region, subscription tier, and compute type.
  • Ahana is pay-as-you-go pricing based on your consumption. There’s a pricing calculator if you want to see what your deployment model would cost.

While you can find some figures on Databricks’s pricing page, understanding how much you will end up paying can be quite difficult as it will depend on the type and volume of data, as well as whatever discount you could negotiate with AWS. Many of the user reviews mention the price of running Databricks as prohibitive, especially when compared to open-source Apache Spark. 

Athena’s pricing structure is simpler and based entirely on the amount of data queried, although it can increase significantly if the source S3 data is not optimized. 

Ahana’s pricing is much simpler and also very opaque with the pricing calculator. Similar to Athena, the pricing will just be part of your AWS bill.

Need a better alternative?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability as compared to Amazon Athena. Ahana will give you the starting blocks needed to build your Open Data Lakehouse.

Sources

Ahana Adds New Awards and Industry Recognitions for Data and Analytics Innovations

Data Lakehouse

Mountain View, Calif. – February 21, 2023 Ahana, the only SaaS for Presto, today announced many new awards and industry accolades for its data and analytics innovations as it exited 2022 and kicked-off 2023. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

“Touted as the best of both the data warehouse and data lake worlds, the Data Lakehouse is giving customers the flexibility, scale and cost management benefits of the data lake coupled with the data management capabilities of the data warehouse,” said Steven Mih, Cofounder and CEO, Ahana. “With the Ahana Cloud for Presto managed service, we’ve delivered a prescriptive approach to building an open SQL data lakehouse that brings the best of the data warehouse and the data lake using open, non-proprietary technologies. We are excited to help our customers future-proof their business with this approach.” 

Recent award recognitions, include:

  • DBTA, “Trend Setting Products in Data and Information Management for 2023” – These products, platforms and services range from long-established offerings that are evolving to meet the needs of their loyal constituents to breakthrough technologies that may only be in the early stages of adoption. However, the common element for all is that they represent a commitment to innovation and seek to provide organizations with tools to address changing market requirements. Ahana is included in this list of most significant products. 
  • CDO (Chief Data Officer) Magazine, “Global Data Founders List 2022.” The world of data and analytics is rapidly expanding its dominion, resulting in digital innovation that is reshaping the sector. The development of technology is creating a plethora of chances for businesses and individuals to flourish in the competitive tech world. With the rapid evolution of disruptive technologies, leading a business is more difficult than ever. CDO presents the Global Data Founders’ List 2022, which includes Ahana’s Founder and CEO Steven Mih.
  • 2022 – 2023 Cloud Awards – Ahana Cloud for Presto has been declared a finalist for two Cloud Awards in the international Cloud Awards competition, including Best Cloud-Native Project / Solution and Best Cloud Business Intelligence or Analytics Solution. Head of Operations for the Cloud Awards, James Williams, said: “Advancing to the next stage of The Cloud Awards program is a remarkable achievement and we’re excited to celebrate with all those finalists who made the cut.”
  • CRN, “The 10 Coolest Business and Analytic Tools of 2022” – Businesses and organizations increasingly rely on data and data analysis for everything from making day-to-day business decisions to long-range strategic planning. Data analytics also plays a critical role in major initiatives like business process automation and digital transformation. CRN lists Ahana Cloud for Presto to its list of 10 of the coolest business analytics tools that can help businesses find efficient ways to analyze and leverage data for competitive advantage. 
  • InsideBIGDATA, “IMPACT 50 List for Q1 2023” – Ahana earned an Honorable Mention as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 

Tweet this: @AhanaIO receives many industry #awards and #accolades for innovation in #BigData #Data #Analytics and #Presto https://bit.ly/3wZeeqR

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in Mountain View, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Starburst vs. Athena: Evaluating different Presto vendors

logo presto lg

The High Level Overview – Athena vs. Starburst

Starburst and Amazon Athena are both query engines used to query data from object stores such as Amazon S3. Athena is a serverless service based on open-source Presto technology, while Starburst is the corporate entity behind a fork of Presto called Trino. An alternative to these offerings is Ahana Cloud, a managed service for Presto.

All of these tools will cover similar ground in terms of use cases and workloads. Understanding the specific limitations and advantages of each tool will help you decide which one is right for you.

What is Starburst?
Starburst Enterprise is a data platform that leverages Trino, a fork of the original Presto project, as its query engine. It enables users to query, analyze, and process data from multiple sources. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.
What is Amazon Athena?
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL.
What is Ahana Cloud?
Ahana Cloud is a managed service for Presto on AWS that gives you more control over your deployment. Typically users see up to 5x better price performance as compared to Athena.

Try Ahana for Superior Price-Performance

Run SQL workloads directly on Amazon S3 with a platform that combines the best parts of open-source Presto and managed infrastructure. Start for free or get a demo.

Performance

We are defining performance as the ability to maintain fast query response times, and whether doing so requires a lot of manual optimization.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ performance. 

Starburst
Starburst’s website mentions that the product provides enhanced performance by using Cached Views and pushdown capabilities. These features allow for faster read performance on Parquet files, the ability to generate optimal query plans, improved query performance and decreased network traffic.
Athena
The AWS website mentions that Athena is optimized for fast performance with Amazon S3 and automatically executes queries in parallel for quick results, even on large datasets. 
Ahana Cloud
Ahana has multi-level data caching with RaptorX which includes one-click caching built-in to every Presto cluster. This can give you up to 30X query performance improvements.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s performance. Users generally regard Starburst and Athena as having good performance, but note that Starburst may require more customization and technical expertise, and Athena may need more optimization and sometimes has concurrency issues.

Starburst
Reviewers who were happy with Starburst’s performance mentioned that it provides quick and efficient access to data, is able to handle large volumes of data and concurrent queries, and has good pluggability, portability, and parallelism. Some reviewers noted that tuning can be cumbersome, and that storing metadata in the Hive metastore creates overheads which can slow down performance. Others mentioned the cost associated with customization, the need for technical expertise to deploy Starburst Enterprise, and occasional performance issues when dealing with large datasets.
Athena
Many reviewers see Athena as fast and reliable, and capable of handling large volumes of data. Negative aspects mentioned include Athena not supporting stored procedures, the possibility of performance issues if too many partitions are used, concurrency issues, inability to scale the service, and the need to optimize queries and data.

Scale

We are defining scale as how effectively a data tool can handle larger volumes of data and whether it is a good fit for more advanced use cases.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ scale. 

Starburst
The Starburst website claims that Starburst offers fast access to data stored on multiple sources, such as AWS S3, Microsoft Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and more. It also provides unified access to Hive, Delta Lake, and Iceberg. It has features such as high availability, auto scaling with graceful scaledown, and monitoring dashboards
Athena
The AWS website claims that Athena automatically executes queries in parallel, so results are fast, even with large datasets and complex queries. Athena is also highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Additionally, Athena integrates out-of-the-box with AWS Glue, which allows users to create a unified metadata repository across various services, crawl data sources to discover data and populate their Data Catalog with new and modified table and partition definitions, and maintain schema versioning.
Ahana Cloud
Ahana has an autoscaling feature that helps you manage your Presto clusters by automatically adjusting the number of worker nodes in the Ahana-managed Presto cluster. You can read the docs for more information.

According to user reviews:

Below is a summary of the claims made on user reviews in websites such as G2, Reddit, and Stack Overflow, related to each tool’s scale. Users see both tools are capable of operating at scale, but both have limitations in this respect as well.

Starburst
Multiple reviews note that Starburst Data is capable of handling larger volumes of data, can join disparate data sources, and is highly configurable and scalablePotential issues with scalability noted in the reviews include the need for manual tuning, reliance on technical resources on Starburst’s side, and the need to restart a catalog after adding a new one. Issues with log files and security configurations are also mentioned.
Athena
Some reviews suggest that Athena is well-suited for larger volumes of data and more advanced use cases, with features such as data transfer speed and integration with Glue being mentioned positively.However, other reviews suggest that Athena may not be able to handle larger volumes of data effectively due to issues such as lack of feature parity with Presto, lack of standard relational table type, and difficulty in debugging queries.

Usability, Ease of Use and Configuration

We define usability as whether a software tool is simple to install and operate, and how much effort users need to invest a lot of effort in order to accomplish their tasks. We assume that data tools that use familiar languages and syntaxes such as SQL are easier to use than tools that require specialized knowledge.

According to the vendor:

Below is a summary of the claims made in each vendor’s promotional materials related to their products’ ease of use. 

Starburst
The Starburst website claims that Starburst is easy to use and can be connected to multiple data sources in just a few clicks. It provides features such as Worksheets, a workbench to run ad hoc queries and explore configured data sources, and Starburst Admin, a collection of Ansible playbooks for installing and managing Starburst Enterprise platform (SEP) or Trino clusters.
Athena
The AWS website claims that Athena requires no infrastructure or administration setup. Athena is built on Presto, so users can run queries against large datasets in Amazon S3 using ANSI SQL.
Ahana Cloud
Ahana gives you Presto simplified – no installation, no AWS AMIs or CFTs, and no configuration needed. You can be running in 30 minutes, you get a built-in catalog and one-click integration to your data sources, and it’s all cloud native running on AWS EKS.

According to user reviews:

Overall, users have found Starburst and Athena to be relatively easy to use, but have also mentioned some drawbacks related to complex customization, lack of features, and difficulty debugging.

Starburst
Several reviewers mention that Starburst is easy to deploy, configure, and scale, and that the customer support is helpful.However, some reviews also mention negatives such as the need for complex customization to achieve optimal settings, difficulty in configuring certificates with Apache Ranger, and unclear error messages when trying to integrate with a Hive database.
Athena
Reviewers are happy with the ease of deploying Athena in their AWS account, and mention that setting up tables, views and writing queries is simple.However, some reviews also mention drawbacks such as the lack of support for stored procedures, and the lack of feature parity between Athena and Presto. Another issue that comes up is that debugging queries can be difficult due to unclear error messages.

Cost

  • Athena charges a flat price of $5 per terabyte of data scanned. Costs can be reduced by compressing and partitioning data.
  • Starburst’s pricing is more complex as it is based on credits and cluster size. The examples given on the company’s pricing page hint at a minimum of a few thousands of $s spend per month
  • Ahana Cloud is pay-as-you-go through your AWS bill based on the compute you use. There’s a pricing calculator you can use to get an idea.

While the specifics of your cloud bill will eventually depend on the way you use these tools and the amount of data you process in them, Athena and Ahana Cloud have a simpler cost structure and offer a more streamlined on-demand model.

Need a better alternative to Athena and Starburst?

Get a demo of Ahana to learn how we deliver superior price/performance, control and usability for Presto.

Sources

See Ahana Cloud in action

On-Demand product demo (2 min)


Ready to get started?

Start Your Free Trial with Ahana

Start your 14-day free trial today! We’ll help you get on-boarded .

Schedule a personalized demo

Get answers based on your use case and workloads.

Whitepaper: Ahana Architecture and Security

Check out our technical whitepaper which dives into Ahana Cloud.

Ahana Cofounders Make 2023 Data Predictions

Data Lakehouse

Open Lakehouses, End User Simplicity, Open Source Managed Services and SQL Workloads Will Dominate

SAN MATEO, Calif., Dec. 19, 2022 — Ahana’s Cofounder and Chief Executive Officer, Steven Mih and Cofounder and Chief Technology Officer, Dave Simmon predict major developments in cloud, data analytics, open lakehouses and open source in 2023.

Steven Mih, Cofounder and CEO, outlines the major trends he sees on the horizon in 2023:

  • End user experience becomes a top priority: As deep integrations of data platforms become standard, the reduced complexity will usher a focus on end user experience. The data platform will become abstracted even further to end users. Instead of worrying about the underlying engines and data structures, end users will be able to easily and seamlessly leverage powerful underlying engines for interactive, batch, real-time, streaming and ML workloads.
  • Industry accepted open lakehouse stacks will emerge: As the market further chooses open options for table formats, compute engines and interfaces, the Lakehouse version of the LAMP stack will emerge. Linux Foundation and Apache Software Foundation projects will constitute those components.
  • Open source SaaS market will shift toward open source managed services: As data and analytics workloads proliferate in the public cloud accounts, and as IT departments demand more control of their own data and applications, we’ll see the adoption of more cloud native managed services instead of full SaaS solutions.
  • Public cloud providers will make huge investments into open source software, and make more contributions back to the community: In the past cloud vendors have been accused of strip-mining OSS software projects. Cloud vendors will go on the “offensive” by contributing to open source more aggressively and even donating their own projects to open source communities.
  • SQL workloads will explode as more NLP (Natural Language Processing) and other Machine Learning (ML) applications generate SQL: While data analysts and scientists continue to uncover insights using SQL, increasingly we’ll see apps that “speak SQL” drive a large portion of the analytical compute. Natural Language Processing (NLP) applications both enable citizen data analysts and demand more compute on data platforms. Similarly, ML applications can dig into datasets in new ways which will blow through today’s level of demand for analytic compute. SQL is not only the ‘lingua franca’ of data analysis, SQL is the ‘lingua franca’ of ML and NLP too.

Dave Simmon, Co-founder and CTO, outlines a major trend he sees on the horizon in 2023:

  • Open lakehouse will more effectively augment the proprietary cloud enterprise data warehouse: As the architectural paradigm shift toward the lakehouse continues forward, the disaggregated stack will become more fully-featured data management systems from disjointed components evolving into cohesive stacks that include metadata, security and transactions.

Tweet this: @AhanaIO announces 2023 #Data Predictions #cloud #opensource #analytics https://bit.ly/3VQeouV

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC
978-649-7189
beth@ahana.io 

How to Cut the Cost of Your Lakehouse Solution

Lakehouse solutions are becoming more and more popular as an augmentation or replacement for expensive locked-in data warehouses. However, many organizations still struggle with the cost of these implementations. Let’s discuss how to reduce the cost of your lakehouse solution. We will look at the drivers of cost and how open source can help reduce expenses. We will also examine the biggest cost drivers in a lakehouse solution and how they can be mitigated.

Leveraging an open data lakehouse offers countless advantages, from its distinct compute and storage architecture to a lack of vendor lock-in. Plus, you gain the freedom to opt for whichever engine is best suited for your needs and cut costs along the way!

As separated storage has become more affordable and available, the compute engines have been a major driving cost factor associated with data lakehouses. When building a data lakehouse, storage, metadata catalog, and table/data management are not the components that drive significant increase in costs. Compute, however, is a major factor as the number of jobs and queries that need to be executed continue to increase, necessitating more hardware and software, and increasing costs significantly

Fortunately, the majority of distributed computing engines, like Presto, are available as open source software that can be used absolutely free! All you have to pay for are servers or cloud instances. Although all computation engines share similar functions, certain ones have a more optimized design due to their underlying technology. These are far more powerful than others in terms of efficiency, resulting in significant cost saving due to the lower volume of servers required.

The open source presto engine is very efficient and becoming more efficient as the compute workers leverage native C++ vectorization technologies. Compared with other systems that run a Java virtual machine, native C++ code is drastically more efficient. The reason C++ is faster than Java is because C++ is compiled, whereas Java is interpreted. By utilizing C++, developers can take advantage of superior memory allocation for increased speed and efficiency. Java Virtual Machine (JVM) is susceptible to infamous garbage collection storms whereas C++ does not suffer from this issue. A perfect example of this is Apache SparkSQL, which leverages Java Virtual Machines as well as Databrick’s recently introduced proprietary Photon engine that utilizes C++. 

Running AWS Lakehouse with Presto can potentially reduce your compute cost by ⅔. Let’s take a look at a sample comparison of running an AWS Lakehouse with another solution vs. with Presto.  Let’s consider a 200TB lakehouse with 20 nodes of Presto, using current AWS pricing (December 2022): 20 X r5.8xl instances  = $40/hour

If used for 30 days, the compute would be $29K per month.

200TB of S3 per month = $4K per month.

Setting aside the data transfer charges, you’ll be spending 88% on the compute.

So if you have a compute engine that is 3 times more efficient, you would end up with 1/3 the compute nodes for the same workload:

7 X r5.8xl instances = $14/hour

If used for 30 days, the compute would be $10K per month.

200TB of S3 per month = $4K per month.

Furthermore, though it does not account for the insignificant data transfer or metadata fees, these are negligible. 

So the comparison would be $33K vs. $14K

The total savings would be on the order of 60% cost savings.

Next: Learn more about Data Warehouse vs Data Mesh vs Data Lakehouse

The Case for Unbundling Your Lakehouse

When you are looking for a lakehouse, do you think about getting an all in one solution from one vendor? If so, you may be missing out on a great opportunity. By unbundling the lakehouse and using open components that work well together, you can get all of the benefits of owning a lakehouse without breaking the bank. One of the key components of this strategy is the Presto SQL query engine. Let’s take a closer look at what presto can do for you!

 With the rising popularity of data lakehouses, businesses should consider unbundling the ecosystem for greater efficiency and cost savings. A main advantage of a data lakehouse is its capability to process a variety of compute workloads as your organization’s needs evolve over time. Computing workloads can be divided into SQL queries or non-SQL based code, which could be used for Machine Learning training or data manipulation. Most firms realize that SQL is an ideal tool to help their analysts and developers explore data more effectively. Oftentimes, they begin with the introduction of a well–functioning SQL platform rather than other advanced workloads like ML training.

By leveraging open-source presto, organizations are able to create a SQL Lakehouse and provide fast, reliable SQL queries. These presto-based systems are often more cost-effective than a bundled, single-vendor option which tend to have closed-source components, like a proprietary engine and metadata catalog. 

Unbundling your lakehouse offers a number of distinct benefits. For the SQL lakehouse example, Linux Foundation Presto can be used as the powerful open source query engine, Apache Hudi for the table format, Apache Hive Metastore for the catalog, all open components without lock in. Additionally, unbundling gives platform engineers the opportunity to opt for other open source projects that are quickly evolving to find the more cost-effective platform at any given time. Therefore, unbundling can provide unprecedented levels of functionality, scalability, flexibility, and performance at a decreased cost compared to traditional single-vendor lakehouse offerings.

Presto is a distributed query engine designed to enable fast, interactive analytics on all your data. Presto is an open-source system and has several key advantages over closed-source offerings: performance, scalability, and reliability. Use Presto for your unbundled SQL Lakehouse.

Ahana to Present About Presto’s Query Optimizer and the Velox Source Project at PrestoCon

Dec. 7-8 all-things Presto event features speakers from Uber, Meta, Ahana, Alibaba Cloud, MinIO, Upsolver and more

San Mateo, Calif. – November 30, 2022 Ahana, the only SaaS for Presto, today announced its participation at PrestoCon, a day dedicated to all things Presto taking place virtually and in-person at the Computer History Museum in Mountain View, CA on December 7 – 8, 2022. In addition to being a premier sponsor of the event, Ahana will be presenting two sessions.

Ahana Sessions at PrestoCon

December 8 at 1:00 pm PT – “Building Large Scale Query Operators and Window Functions for Prestissimo Using Velox” by Aditi Pandit, Presto/Velox Contributor and Principal Software Engineer, Ahana. 

In this talk, Aditi will throw the covers back on some of the most interesting portions of working on Prestissimo and Velox. The talk will be based on the experience of implementing  window functions in Velox. It will cover the nitty gritty on the vectorized operator, memory management and spilling.  This talk is perfect for anyone who is using Presto in production and wants to understand more about the internals, or someone who is new to Presto and is looking for a deep technical understanding of the architecture.

December 8 at 2:30 pm PT – “The Future of Presto’s Query Optimizer” by Bill McKenna, Query Optimizer Pioneer and Author and Principal Software Engineer, Ahana. 

In this talk, you will hear Bill, the architect for the query optimizer that became the code base of the Amazon Redshift query optimizer, and co-author of The Volcano Optimizer Generator: Extensibility and Efficient Search go into detail about the state of modern query optimizers, and how Presto stacks up against them and where it will go in the near future. This is a must-attend session for attendees interested in database theory.  

Presto continues to be recognized as an industry-leading fast and reliable SQL engine for data analytics and the open data lakehouse. BigDATAwire (formerly Datanami) Readers’ and Editors’ Choice Awards recognized the innovation Presto is bringing to the open data lakehouse landscape naming it an Editors’ Choice: Top 3 Data and AI Open Source Projects to Watch. Determined through a nomination and voting process with the global BigDATAwire community, as well as selections from the BigDATAwire editors, the awards recognize the companies and products that have made a difference in the big data community this year, and provide insight into the state of the industry.

Additionally, from among more than 260 applicants, CRN staff selected products spanning the IT industry – including in cloud, infrastructure, security, storage and devices – that offer ground-breaking functionality, differentiation and partner opportunity. Ahana Cloud for Presto was named a finalist in the Business Intelligence and Analytics category in CRN’s 2022 Tech Innovator Awards. 

View all the PrestoCon 2022 sessions in the full program schedule

PrestoCon 2022 is an in-person and virtual event on December 6-7th. Registration is open

Tweet this: @AhanaIO announces its participation at #PrestoCon #cloud #opensource #analytics #presto https://bit.ly/3UTJp12

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes.

Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC
978-649-7189
beth@ahana.io

Ahana CEO Steven Mih Recognized as Top 50 SaaS CEO

The Software Report award commends impactful leadership in support of company’s commitment to the open SQL data lakehouse powered by Presto

San Mateo, Calif. – October 25, 2022 – Ahana, the only SaaS for Presto, today announced today that CEO, Steven Mih, has been recognized on The Software Report’s annual Top 50
SaaS CEOs
list.  

The Software Report acknowledges top CEOs in a variety of industries who demonstrate that with strong, innovative leadership, the best software solutions thrive and are rapidly adopted across the global economy. 2022’s awardees were selected based on thousands of nominations from colleagues, peers, and other software industry professionals and in-house industry research.

Under Steven’s leadership, Ahana was founded in April 2020 during the COVID-19 pandemic as an all-remote company. Ahana brings together decades of cloud, open source, database and distributed systems experience to be the only commercial company focused on PrestoDB, the project hosted by the Linux Foundation’s Presto Foundation. Presto is the de-facto open source distributed SQL query engine for data analytics and for the open Data Lakehouse. Ahana works closely with other Presto Foundation members, Meta & Uber, where Presto is battle-tested and runs at very large scale. Steven has taken the company from inception to Series A financing
with $32 million in capital raised to date.

“Since founding Ahana, with my talented team I’ve been focused on building a company that brings the reliability and performance of the data warehouse together with the flexibility and better price performance of the open data lake, enabling SQL and ML/AI use cases on data to engineers and data science teams,” said Mih. “We’re excited about making distributed compute infrastructure as easy as B2C SaaS applications enabling our users to deliver data-driven insights on large amounts of data.”

Tweet this: @AhanaIO CEO Steven Mih recognized as Top 50 #SaaS #CEO by @SoftwareReport1 #opensource #data #analytics #presto https://bit.ly/3VJrTgV

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes.

Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC
978-649-7189
beth@ahana.io

AWS Athena Alternatives: Best Amazon Athena Alternatives

Looking for Athena alternatives? Ahana gives you the scale and performance of PrestoDB – the same underlying technology that powers Athena, and which is used for petabyte-scale analytics at Meta and Uber – with none of the limitations. Get better price/performance and regain control over your cloud infrastructure with Ahana’s managed Presto solution for AWS. Request a demo today!

This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits
Part 3: AWS Athena vs. PrestoDB Blog Series: Athena Partition Limits

If you’re looking for Amazon Athena alternatives, you’ve come to the right place. In this blog post, we’ll explore some of the best AWS Athena alternatives out there.

Athena is a great tool for querying data stored in S3 – typically in a data lake or data lakehouse architecture – but it’s not the only option out there. There are a number of other alternatives that you might want to consider, including serverless options such as Ahana or Presto, as well as cloud data warehouses.

Each of these tools has its own strengths and weaknesses, and really the best choice depends on the data you have and what you want to do with it. In this blog post, we’ll compare Athena with each of these other options to help you make the best decision for your data.

What is AWS Athena?

AWS Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Amazon Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Why would I not want to use AWS Athena?

There are various reasons users look for alternative options to Athena, in spite of its advantages: 

  1. Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
  2. Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
  3. Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
  4. Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
  5. Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.

What are the typical alternatives to AWS Athena?

  1. DIY open-source PrestoDB
  2. Managed Hadoop and Presto
  3. Managed Presto Service
  4. Cloud data warehouse such as Redshift or Snowflake

Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:

DIY open-source PrestoDB

Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment. 

Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.

The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.

Managed Hadoop and Presto

Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.

AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.

This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.

You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.

Managed Presto Service

If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.

Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs. 

A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.

Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight. 

Cloud Data Warehouse (Redshift, Snowflake)

Another alternative to Amazon Athena would be to use a data warehouse such as Snowflake or Redshift. This would a require a shift of paradigm from a decoupled open lakehouse architecture to a more traditional design pattern focused on a centralized storage and compute layer.

If you don’t have a lot of data and are mainly looking to run BI-type predictable workloads (rather than interactive analytics), storing all your data in a data warehouse such as Amazon Redshift or Snowflake would be a viable option. However, companies that work with larger amounts of data and need to run more experimental types of analysis will often find that data warehouses do not provide the required scale and cost-performance benefits and will gravitate towards a data lake.

In these cases, Athena or Presto can be used in tandem with a data warehouse and data engineers can choose where to run each workload on an ad-hoc basis. In other cases, the serverless option can replace the data warehouse completely.

Presto vs Athena: To Summarize

You have a wide variety of options regarding your use of PrestoDB. 

If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you. 

On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go. 

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Related Articles

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

There can be some confusion with the difference between AWS Redshift Spectrum and AWS Athena. Learn more about the differences in this article.

AWS Athena vs AWS Glue: What Are The Differences?

Here, we talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

Presto SQL Syntax: Learn to Write SQL Queries in Presto

Presto is powerful, but running it on your own can get complicated. If you’re looking for a managed Presto experience that can let you focus on querying your data rather than managing infrastructure, try Ahana Cloud today.

PrestoDB uses regular ANSI SQL to query big data stored in object storage. If you’ve used SQL in databases before, you should be very comfortable using Presto. However, there are some quirks you need to keep in mind, stemming from the fact Presto is typically used to query semi-structured storage such as Amazon S3 rather than relational databases.

Below you’ll find some of our most popular resources relating to writing Presto SQL

Working with Date and Time Data

Working With Different Data Types

Manipulating Tables and Data

Additional Presto SQL Resources

Ahana to Deliver Session About the Open Data Lakehouse at AI & Big Data Expo North America

San Mateo, Calif. – September 28, 2022 – Ahana, the only SaaS for Presto, today announced it will lead a session at AI & Big Data Expo North America about the open data lakehouse. The hybrid event is being held October 5 – 6 at the Santa Clara Convention Center in Santa Clara, CA and virtual.

Session Title: “Value of the Open Data Lakehouse”
Session Date & Time: Thursday, October 6 at 2:20 pm PT and virtual
Session Presenter: Ahana’s Rohan Pednekar, senior product manager
Session Details: With up to 80% of data stored in the data lake today, how do you unlock the value of the data lakehouse? The value lies in the compute engine that runs on top of an open data lakehouse. During this talk, Rohan will discuss the benefits of the emerging Open Data Lakehouse. He will also cover why Presto is the de facto query engine for the open data lake; what the open data lakehouse is; the benefits of moving to an open data lakehouse; and how the open source query engine Presto is critical to the open data lakehouse.

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, users can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis.

To register for AI & Big Data Expo, please go to the event’s registration page to purchase a registration.

Tweet this: @AhanaIO to deliver session about the open data lakehouse at @AI_Expo opensource #data #analytics #presto https://bit.ly/3xHp2uw

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes.

Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC
978-649-7189
beth@ahana.io

what is amazon redshift

Top 4 Amazon Redshift Alternatives & Competitors

Introduction

In the last article we discussed the fundamental problems with Amazon Redshift. To add to that article, we’ll provide some information about where to look, if you’re starting to explore new options. We’ll unearth some of the available Redshift Alternatives, and why they are worth looking into. Disclaimer – this is not an ordinal layout.

1. Ahana

Ahana offers the only managed service for Presto as a feature-rich, next-gen SQL query engine in Ahana Cloud. It plays a critical role for data platform users searching for something with ease-of-use, full integration, and a cloud native option for their SQL engine on their AWS S3 data lakes; as well as other data sources. Ahana Cloud has everything the user will  need to get started with SQL on the Open Data Lakehouse. It’s a great choice for a Redshift alternative, or even to augment the warehouse.

 Currently, Ahana is offering a free trial for their enterprise solution, as well as a free community edition.

2. BigQuery

BigQuery is another great AWS Redshift alternative. It’s a cloud data warehouse to ingest and process queries at scale onGoogle Cloud Platform. If you’re on Google Cloud, it doesn’t require much effort to integrate it with other Google products.

You can run queries or analyze terabytes of data in seconds. BigQuery allows the user to leverage the power of Google’s infrastructure to load data. They user can also use Google Cloud Storage to bulk load your data, or stream it in bursts of up to a thousand rows per second.

It’s supported by the BigQuery REST API that comes with client libraries like Java, PHP, and Python. While BigQuery is the most proven tool on this list, it’s not the easiest to use. If your team lacks an experienced data engineer, you’re going to have challenges as the learning curve is significant.

BigQuery pricing: query based on the amount of data processed at $5 per TB (and includes one free TB per month).

3. Azure SQL Data Warehouse

as a Redshift alternative Azure is a good choice. Azure SQL Data Warehouse is perfect for large businesses dealing with consumer goods, finance, utilities, and more. As one of the most used services on Microsoft Azure, it’s a SQL server in the cloud but is fully managed and more intelligent.

Now absorbed into Azure Synapse Analytics, it’s a powerful cloud-based analytics platform you can use to design the data structure immediately (without worrying about potential implementation challenges). Its provisioned resources also allow users to query data quickly and at scale.

If you’re not familiar with the Azure environment, you’ll have to invest some time in understanding it. As it’s fully featured and well documented, there’s enough support to get you over the learning curve.

Like Redshift and Snowflake, Azure Synapse also follows a consumption-based pricing model. So, it’s best to have an experienced data engineer on-board to make “reasonably accurate guesstimates” before committing.
Azure pricing: follows an hourly data consumption model (and offers a 12-month free trial).

4. Snowflake

Like Redshift, Snowflake is a robust cloud-based data warehouse built to store data for effortless analysis. Snowflake is a good Redshift alternative as it was developed for experienced data architects and data engineers, Snowflake leverages a SQL workbench and user permissions to allow multiple users to query and manage different types of data.

Snowflake also boasts robust data governance tools, security protocols, and the rapid allocation of resources. While the platform is powerful and efficient at managing different data types, it still proves to be a significant challenge for users who don’t hail from a strong data background.

Snowflake also lacks data integrations, so your data teams will have to use an external ETL to push the data into the warehouse. Whenever you use third-party tools, you’ll also have to consider the extra costs and overheads (such as setup and maintenance costs) that come with them.

Snowflake follows a consumption-based pricing model similar to that of Redshift. This is great for experienced users who can make an educated guess about this data consumption. Others may have to deal with an unpleasant surprise at the end of the billing cycle.For a more in-depth look into Snowflake as a competitor, check the Snowflake breakdown.

Snowflake pricing: based on a per-second data consumption model (with an option of a 30-day free trial).

Test out the Alternative

Ready to see the alternative in action?

Using a Managed Service for Presto to as a Redshift Alternative

Redshift, while a fantastic tool, does have some significant issues the user is going to have to overcome. Below are the most frequently stated causes of concern expressed by Amazon Redshift users and the catalyst to search out Redshift alternatives:

Price-Performance

Redshift gets expensive quickly. As data volumes increase, the cost of storage and compute in the warehouse becomes problematic. Redshift comes with a premium cost, especially if you use Spectrum outside of AWS Redshift.A solution to this is to reach for a tool focused on reducing overhead cost. As a biased example, Ahana Cloud is easy to run and allows the users to only pay for what they use without upfront costs. Simply, the performance you’re used to and pay less for it!

Closed & Inflexible

 Working with a data warehouse, while having some perks, comes with some drawbacks. In this environment the user loses their flexibility. Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. Redshift does not provide or utilize flexible or open data formats available.

Other modern solutions allow the user to define and manage data sources. Ahana permits data teams to  attach or detach from any cluster with the click of a button; also taking care of configuring and restarting the clusters.

Vendor Lock-in

One of the biggest pain points, and desire to find Redshift alternatives is due to vendor lock-in. Data warehouse vendors, like AWS Redshift, make it difficult to use your data outside of their services. To do so data would need to be pulled out of the warehouse and duplicated, further driving up compute costs. Use the tools & integrations you need to get value from your data without the proprietary data formats. Head over to this additional comparison for a discernible solution dealing with vendor lock-in, price-performance, and flexibility.

Redshift Alternatives

Summary: Redshift Alternatives

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for If you are using Amazon Redshift now, and are looking to solve some of the problems with it, check out this  on-demand webinar providing instructions to augment your Redshift warehouse with an Open Data Lakehouse. This webinar also explains why so many of today’s companies are moving away from warehouses like Snowflake and Amazon Redshift towards other Redshift alternatives – specifically, a SQL Data Lakehouse with Presto.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

what is amazon redshift

The Fundamental Problems with Amazon Redshift

In the last article we discussed the Difference Between the data warehouse and Redshift Spectrum. To continue on this topic, let’s understand the problems with Amazon Redshift and some of the available alternatives for that data teams can explore further.

Amazon Redshift made it easy for anyone to implement data warehouse use cases in the cloud. However, It is unable to provide the same benefits to newer, more advanced cloud data warehouses. When it was a relatively new technology, everyone was going through a learning curve.

Here are some of the fundamental Problems with Amazon Redshift:

redshift alternative

AWS Redshift’s Cost

Amazon Redshift is a traditional MPP platform where the compute is closely integrated with the storage. The advantage of the cloud is that theoretically compute and storage are completely independent of each other, and storage is virtually unlimited. If you want more storage with this data warehouse you will have to purchase more compute power. As data volumes increase, the cost of storage and compute in the warehouse becomes challenging. AWS products, particularly the warehouse of topic and Spectrum come with a premium cost. This is especially true if you use Spectrum outside of AWS Redshift.This results in one of the most expensive cloud data warehouse solutions.

Vendor lock-in with Redshift

Data warehouse vendors, like AWS, make it difficult to use your data outside of their services. Data would need to be pulled out of the warehouse and duplicated, further driving up compute costs.

Proprietary data formats

Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. No flexible or open data formats available.

No Staging Area in Redshift

It is expensive to host data with Amazon, so duplication of data has to be avoided at all cost. In traditional RDBMS systems, we tend to have landing, staging layers and warehouse layers in the same database. But for Amazon’s data warehouse the landing and staging layer has to be on S3. Only the data on which reports and analytics will be built should be loaded in Redshift. This task should happen on a need basis rather than keeping the entire dataset in the warehouse

No Index support in Amazon Redshift

This warehouse does not support indexes like other data warehouse systems, hence it is designed to perform the best when you select only the columns that you absolutely need to query. As Amazon’s data warehouse is columnar storage, a construct called Distribution Key needs to be used, which is nothing but a column based on which data is distributed across different nodes of the clusters.

Manual house-keeping

Performance based issues that need to be handled in proper maintenance like Vacuum and Analyze, SORT Keys, Compressions, Distribution styles, etc. 

Tasks like VACUUM and ANALYZE need to be run regularly which are expensive and time consuming tasks. There’s no good frequency to run this that suits all. This requires a quick cost-benefit analysis before deciding on the frequency.

Disk space capacity planning

Control over disk space is a must with Amazon Redshift especially when you’re dealing with analytical workloads. There are high chances you oversubscribe the system, and not just reduced disk space degrades the performance of the query but also makes it cost prohibitive. Having a cluster filled above 75% isn’t good for performance.

Concurrent query limitation

Above 10 concurrent queries, you start seeing issues. Concurrency scaling may mitigate queue times during bursts in queries. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the constraints on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

Conclusion

These were some of the fundamental problems vocalized by users that you need to keep in mind while using or exploring Amazon Redshift. If you are searching for more information about AWS products regarding challenges or benefits check out the next article in this series about AWS and query limitations check out this article.

Comparing AWS Redshift?

See how it the alternatives rank

Amazon Redshift Pricing: An Ultimate Guide

AWS’ data warehouse is a completely managed cloud service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, it is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Managed service for SQL

Ahana Joins Leading Open Source Innovators in its Commitment to the Velox Open Source Project Created by Meta

Extends engineering resources and names significant contributors  

San Mateo, Calif. – August 31, 2022 Ahana, the only SaaS for Presto, today announced it is strengthening its commitment to the further development of the Velox Open Source Project, created by Meta, with the dedication of more engineers and significant contributors. Ahana joined Intel and ByteDance as the project’s primary contributors when it was open sourced in 2021.

Velox is a state-of-the-art, C++ database acceleration library. It provides high-performance, reusable, and extensible data processing components, which can be used to accelerate, extend, and enhance data computation engines. It is currently integrated with more than a dozen data systems at Meta, from analytical query engines such as Presto and Spark to stream processing platforms and machine learning libraries such as PyTorch. 

“Velox is poised to be another vibrant open source project created by Meta with significant industry impact. It caught our attention as it enables developers to build highly-efficient data processing engines,” said Steven Mih, Cofounder and CEO, Ahana. “It’s well understood that at Meta, there are diverse opportunities to improve data processing at scale, and, as a result, trailblazing innovations are developed. As data becomes central to every organization, we see many enterprise data teams facing similar challenges around consistency of diverse data systems, which Velox-based systems could solve. As a primary contributor from the start, we are furthering our commitment to grow a community of developers to collaboratively accelerate the project.”

“To our knowledge, Velox is a pioneer effort at unifying execution engines in a centralized open source library. Other than efficiency gains due to its state-of-art components, Velox also provides benefits in terms of increased consistency across big data engines, and by promoting reusability,” said Pedro Pedreira, Software Engineer, Meta. “We see Velox as an important step towards a more modular approach to architecting data management systems. Our long-term goal is to position Velox as a de-facto execution engine in the industry, following the path paved by Apache Arrow for columnar memory format. We are excited about the momentum the project is getting, and the engagement and partnership with Ahana’s engineers and other open source contributors.”

“We’re excited to work closely with Ahana, Meta, and Velox community,” said Dave Cohen, Senior Principal Engineer, Intel. “While there are other database acceleration libraries, the Velox project is an important, open-source alternative.”

Velox Project Significant Contributors from Ahana, include:

Deepak Majeti, Principal Engineer and Velox Contributor. Deepak has been contributing to the Velox project since it was open sourced in 2021. Before joining Ahana, he was a technical lead at Vertica. He has expertise in Big Data and High-Performance computing with a Ph.D. from the Computer Science department at Rice University. Deepak is also an Apache ORC PMC member, Apache Arrow and Apache Parquet committer.

Aditi Pandit, Principal Engineer and Velox Contributor. Aditi has also been contributing to the Velox project since it was open sourced in 2021. Before joining Ahana, she was a senior software engineer at Google on the Ads Data Infrastructure team and prior to that, a software engineer with Aster Data and Informatica Corp. 

Ying Su, Principal Engineer and Velox Contributor.  Ying joined Deepak and Aditi as significant contributors to Velox in 2022.  Prior to Ahana, she was a software engineer at Meta and before that, a software engineer at Microsoft. Ying is also a Presto Foundation Technical Steering Committee (TSC) member and project committer. 

Supporting Resources

Meta blog introducing Velox is here.
Tweet this: @AhanaIO joins leading open source innovators in its commitment to the Velox Open Source Project created by Meta #opensource #data #analytics https://bit.ly/3pqwBkH

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

JMeter-Blog_Summary-Report

Using JMeter with Presto

Apache JMeter is an open source application written in Java that is designed for load testing. This article presents how to install it, and how to create and run a test plan for testing SQL workloads on Presto clusters.

You will need Java on the system you are installing JMeter on. If you do not have Java installed on your system, see How do I install Java?

You will need a Presto cluster to configure JMeter to connect to and run the test plan on. You can create a Presto cluster for free in Ahana Cloud Community Edition.

Installing JMeter

To install Jmeter, start by downloading the latest JMeter build and unzipping the downloaded file into a new directory. For this article, the new directory’s name is jmeter.

Next, download the version of the Presto JDBC driver that matches your Presto version. The file you want to download is called presto-jdbc-X.XXX.jar

💡 As of this writing, the Presto JDBC driver version to use with an Ahana Presto cluster is presto-jdbc-0.272.jar. To find the Presto version for a cluster in Ahana, open the Manage view of the Presto cluster and look for Version of Presto in the Information section. The version shown will be similar to 0.272-AHN-0.1. Use the first four numbers to choose the Presto JDBC driver to download.

Copy the downloaded Presto JDBC driver jar file into the jmeter directory’s lib folder.

To run JMeter, change directories to the directory and run the command

bin/jmeter

Create a Test Plan in JMeter

In JMeter, select the Templates icon to show the Templates window. In the dropdown of the Templates window, select JDBC Load Test, then select Create.

Enter the JDBC endpoint in Database URL. In Ahana, you can find and copy the JDBC endpoint in Connection Details of the Presto cluster.

You can include either or both of the catalog and schema names in the Database URL separated by slashes after the port number. For example:

jdbc:presto://report.youraccount.cp.ahana.cloud:443/tpch/tiny

If not, you must specify the catalog and schema names in the SQL query in JDBC Request.

Enter com.facebook.presto.jdbc.PrestoDriver in JDBC Driver class.

Enter Username and Password of a Presto user attached to the Presto cluster.

In the navigation bar on the left, expand Thread Group, select JDBC Request, and enter the SQL query in Query.

💡 Do not include a semicolon at the end of the SQL query that you enter, or the test plan run will fail.

Set how many database requests run at once in Number of Threads (users).

In Ramp-up period (seconds), enter the time that JMeter should take to start all of the requested threads.

Loop Count controls how many times the thread steps are executed.

For example, if Number of Threads (users) = 10, Ramp-up period (seconds) = 100, and Loop Count = 1, JMeter creates a new thread every 10 seconds and the SQL query runs once in each thread.

You can add a report of the performance metrics for the requests to the test plan. To do so, right-click Test Plan in the left navigation bar, then select AddListenerSummary Report.

Select the Save icon in JMeter and enter a name for the test plan.

Run the Test Plan and View Results

To run the test plan, select the Start icon.

In the left navigation bar, select View Results Tree or Summary Report to view the output of the test plan.

Using JMeter with Presto_Summary Report Image

Run JMeter in Command Line Mode

For best results from load testing it is recommended to run without the GUI. After you have created and configured a test plan using the GUI, quit JMeter then run it from the command line.

For an example, to run JMeter with a test plan named testplan, and create a report in a new directory named report, run the following command:

bin/jmeter -n -t bin/templates/testplan.jmx -l log.jtl -e -o report

When the test plan is run, JMeter creates an index.html file in the report directory summarizing the results.

Open-Data-Lakehouse-Hudi-Presto-S3

Virtual Lab On-Demand:

Building an Open Data Lakehouse with Presto, Hudi, and AWS S3

Learn how to build an open data lakehouse stack using Presto, Apache Hudi, and AWS S3 in this on-demand virtual lab.

What you’ll learn:

  • A quick overview on the open data lakehouse stack, including what is Presto (query engine) and what is Apache Hudi (transaction layer)
  • How to get HUDI support on Presto
  • Querying HUDI data with Presto  
  • How to use Presto to query your AWS S3 Data Lake
  • Future – What additional HUDI support is coming to Presto

By the end of this lab, you’ll know how to run queries with Presto and Hudi to optimize your AWS S3 data lake.

Additional Resources

Blog: Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Ahana Community Office Hours: August 24 at 10:30am PT/1:30pm ET

Join us for our Ahana Community Office Hours. Our experts will answer your questions about getting started with Ahana Cloud.

Speakers

Sivabalan Narayanan

Software Engineer, Onehouse

Sivabalan-Narayanan_headshot

Jalpreet Singh Nanda

Software Engineer, Ahana

Jalpreet Headshot
Managed service for SQL

Ahana Awarded Many Industry Recognitions and Accolades for Big Data, Data Analytics and Presto Innovations

San Mateo, Calif. – August 3, 2022 Ahana, the only SaaS for Presto, today announced many new industry accolades in 1H 2022. Presto, originally created by Meta (Facebook) which open sourced and donated the project to Linux Foundation’s Presto Foundation, is the fast and reliable SQL query engine for data analytics and the data lakehouse. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

“Businesses are looking for ways to bring the reliability of the data warehouse together with the scale and flexibility of the data lake,” said Steven Mih, Cofounder and CEO, Ahana. “We believe the Data Lakehouse offers a new paradigm for a self-service data platform built on open-source foundations, leveraging the scalability of modern cloud services.  With the Ahana Cloud for Presto managed service, we’ve delivered an open SQL data lakehouse that brings the best of the data warehouse and the data lake. We are very excited to see its reception in the marketplace as time and time again it is recognized for its innovation and the benefits it delivers to customers.” 

Recent award recognitions, include:

  • CRN, “The 10 Coolest Big Data Tools of 2022 (so far)” – Data is an increasingly valuable asset for businesses and a critical component of many digital transformation and business automation initiatives. CRN named Ahana Cloud for Presto Community Edition to its list of 10 cool tools in the big data management and analytics space that made their debut in the first half of the year.
  • CRN, “Emerging Big Data Vendors to Know in 2022” – As data becomes an increasingly valuable asset for businesses—and a critical component of many digital transformation and business automation initiatives—demand is growing for next-generation data management and data analytics technology. Ahana is listed among 14 startups that are providing it with its Presto SQL query engine on AWS with the vision to simplify open data lake analytics.
  • CRN, “The Coolest Business Analytics Companies of the 2022 Big Data 100”CRN’s Big Data 100 includes a look at the vendors solution providers should know in the big data business analytics space. Ahana offers the Ahana Cloud for Presto, a SQL data analytics managed service based on Presto, the high-performance, distributed SQL query engine for distributed data residing in a variety of sources, and was named to this prestigious list
  • Database Trends & Applications, “DBTA 100 2022: The Companies That Most in Data” – Business leadership understands that creating resilient IT systems and pipelines for high-quality, trustworthy data moving into employees’ workflows for decision making is essential. To help bring new resources and innovation to light, each year, Database Trends and Applications magazine presents the DBTA 100, a list of forward-thinking companies, such as Ahana, seeking to expand what’s possible with data for their customers. 
  • InsideBIGDATA, “IMPACT 50 List for Q1, Q2 and Q3 2022” – Ahana earned an Honorable Mention for all of the last three quarters of the year as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 
  • 2022 SaaS Awards Shortlist – Ahana was recognized by the SaaS Awards as a finalist for Best SaaS Newcomer and Best Data Innovation in a SaaS Product finalist on the 2022 shortlist.
  • 2022 American Business Awards, “Stevie Awards” – Ahana was named the winner of a Silver Stevie® Award in the Big Data Solution category in the 20th Annual American Business Awards®. The winners were determined by the average scores of more than 250 professionals worldwide in a three-month judging process.

Tweet this: @AhanaIO receives many industry #awards and #accolades for innovation in  #BigData #Data #Analytics and #Presto https://bit.ly/3OHiVvX 

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

what is amazon redshift

Redshift vs Redshift Spectrum: A Complete Comparison

Amazon Redshift is a cloud-based data warehouse service offered by Amazon. Redshift is a columnar database which is optimized to handle the sort of queries now running in enterprise star schemas and snowflake schemas.

Redshift Spectrum is an extension of Amazon Redshift. Redshift Spectrum as a feature of Redshift, allows the user  to query data available on S3. With Amazon Redshift Spectrum, you can continue to store and grow your data at S3 and use Redshift as one of the compute options to process your data (other options could be EMR, Athena or Presto.)

There are many differences between Amazon Redshift and Redshift Spectrum, here are some of them:

Architecture

What is Amazon redshift
Image Source: https://docs.aws.amazon.com/

Amazon Redshift cluster is composed of one or more compute nodes. A cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. The client application interacts directly only with the leader node and the compute nodes are transparent to external applications.

What is redshift spectrum
Image Source: aws.amazon.com

Whereas Redshift Spectrum queries are submitted to the leader node of your Amazon Redshift cluster., The Amazon Redshift compute nodes generate multiple requests depending on the number of objects that need to be processed, and submits them concurrently to Redshift Spectrum. The Redshift Spectrum worker nodes scan, filter, and aggregate your data from Amazon S3 for processing back to your Amazon Redshift cluster. Then, the final join and merge operations are performed locally in your cluster and the results are returned to your client.

Redshift Spectrum is a service that uses dedicated servers to handle the S3 portion of your queries. The S3 Glue catalog service is used to maintain the definition of the external tables. Redshift loosely connects to S3 data by the following route:

 AWS Redshift Spectrum alternatives

External database, schema, and table definitions in Redshift use an IAM role to interact with the Presto catalog and Spectrum, which handles the S3 portion of the queries.

Use case 

Amazon Redshift is a full-managed data warehouse that is efficient in storing historical data from various different sources. This tool is designed to ease the process of data warehouse and analytics. 

Redshift Spectrum is used to perform analytics directly on the data in the Amazon S3 cluster using an Amazon Redshift node. This allows users to separate storage and compute.  The user can scale them independently.

You can use Redshift Spectrum, which is an add-on to Amazon redshift, for its capability to query the data from the files of S3 with existing information from the Redshift data warehouse. In addition to querying the data in S3, you can join the data from S3 to tables residing in Redshift.

Performance

Because Amazon Redshift holds dominion  over how data is stored, compressed and queried, it has a lot more options for optimizing a query. On other hand Redshift Spectrum only has control over how the data is queried (because  it is up to AWS S3 how it’s stored). Performance of Redshift Spectrum depends on your Redshift cluster resources and optimization of S3 storage.

That said, Spectrum offers the convenience of not having to import your data into Redshift. Basically you’re trading performance for the simplicity of Spectrum. Lots of companies use Spectrum as a way to query infrequently accessed data and then move the data of interest into Redshift for more regular access.

Conclusion

This article provides a quick recap of the major differences between Amazon Redshift and Redshift Spectrum. It takes into consideration today’s data platform needs. 

Simply, Amazon Redshift can be classified as a tool in the “Big Data as a Service” category, whereas  Amazon Redshift Spectrum is grouped under “Big Data Tools”. 

If you are an existing customer of Amazon Redshift and looking for a best price per performance solution to run SQL on an AWS S3 data lake then try our community edition or 14-day free trial

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

redshift alternative

How to Use AWS Redshift Spectrum in AWS Lake Formation

As we’ve covered previously in What is Redshift Used For?, AWS Redshift is a cloud data warehouse used for online analytical processing (OLAP) and business intelligence (BI). Due to Redshift’s coupled architecture and relatively high costs at larger data volumes, businesses often seek to limit the workloads running on Redshift, while utilizing other analytic services including open-source Presto as part of a data lake house architecture.

Lake Formation makes it easier to set up the data lake, and to incorporate Redshift as part of the compute layer alongside other analytics tools and services. Developers can optimize their costs by using AWS Redshift for frequently accessed data and move less frequently accessed data to the Amazon S3 data lake, where it can be queried using serverless query engines such as Athena, Ahana, and Redshift Spectrum.

Two main reasons you would want to use Redshift with Lake Formation:

  • Granting and revoking permissions: Within Lake Formation, there is an independent permissions model in addition to the general IAM permissions set on an AWS account. This enables granular control over who can read data from a lake. You can grant and revoke permissions to the Data Catalog objects, such as databases, tables, columns, and underlying Amazon S3 storage. With Redshift following the Lake Formation permissions model out-of-the-box, you can ensure that the users querying data in Redshift are only accessing data they are meant to access. 
  • Creating external tables and running queries: Amazon Redshift Spectrum can be used as a serverless query option to join data stored in Redshift with data residing on S3. Lake Formation allows you to create virtual tables that correspond to S3 file locations and register them in the Data Catalog. A Redshift Spectrum query would then be able to consume this data without additional configuration.

How to Integrate AWS Redshift in Lake Formation

Lake Formation relies on the AWS Glue Crawler to store table locations in the Glue Data Catalog, which can then be used to control access to S3 data for other analytics services, including Redshift. This AWS blog post suggests a reference architecture for connecting the various services involved:

  • Data stored in an Amazon S3 lake is crawled using AWS Glue Crawler.
  • Glue Crawler then stores the data in tables and databases in the AWS Glue Data Catalog.
  • The S3 bucket is registered as the data lake location with Lake Formation. Lake Formation is natively integrated with the Glue Data Catalog.
  • Lake Formation grants permissions at the database, table, and column level to the defined AWS Identity and Access Management (IAM) roles.
  • Developers create external schemas within Amazon Redshift to manage access for other business teams.
  • Developers provide access to the user groups to their respective external schemas and associate the appropriate IAM roles to be assumed. 
  • Users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.
  • After the data is registered in the Data Catalog, each time users try to run queries, Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum, and the query runs.

Using Lake Formation as Part of an Open Data Lakehouse 

One of the advantages of a data lake is its open nature, which allows businesses to use a variety of best-in-breed analytics tools for different workloads. This replaces database-centric architectures, which requires storing data in proprietary formats and getting locked-in with a particular vendor.

Implementing Lake Formation makes it easier to move more data into your lake, where you can store it in open-source file formats such as Apache Parquet and ORC. You can then use a variety of tools that interface with the Glue Data Catalog and read data directly from S3. This provides a high level of flexibility, provides vendor lock-in, and strongly decouples storage from compute, reducing your overall infrastructure costs. (You can read more about this topic in our new white paper: The SQL Data Lakehouse and Foundations for the New Data Stack.)

If you’re looking for a truly open and flexible option for serverless querying, you should check out Ahana Cloud. Ahana Cloud and AWS Lake Formationmake it easy build and query secure S3 data lakes. Using the native integration, data platform teams can seamlessly connect Presto with AWS Glue, AWS Lake Formation and AWS S3 while providing granular data security. Enabling the integration in Ahana Cloud is a single-click affair when creating a new Presto cluster.

Learn more about Ahana Cloud’s integration with AWS Lake Formation.

Related Articles


Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

prestoconday22-logo

Ahana to Present About Presto on the Open Data Lakehouse at PrestoCon Day; Ahana Customer Blinkit to Discuss Its Presto on AWS Use Case

July 21 all-things Presto event features speakers from Uber, Meta, Ahana, Blinkit, Platform24, Tencent, Bytedance and more

San Mateo, Calif. – July 14, 2022 Ahana, the only SaaS for Presto, today announced its participation in PrestoCon Day, a day dedicated to all things Presto taking place virtually on Thursday, July 21, 2022. In addition to being a premier sponsor of the event, Ahana will be participating in three sessions and Ahana customer Blinkit will also be presenting its Presto use case.

Ahana and Ahana Customer Sessions at PrestoCon

July 21 at 9:35 am PT – “Free-Forever Managed Service for Presto for your Cloud-Native Open SQL Lakehouse,” by Wen Phan, Director of Product Management, Ahana

Getting started with a do-it-yourself approach to standing up an open SQL Lakehouse can be challenging and cumbersome.  Ahana Cloud Community Edition dramatically simplifies it and gives users the ability to learn and validate Presto for their open SQL Lakehouse—for free.  In this session, Wen will show how easy it is to register for, stand up, and use the Ahana Cloud Community Edition to query on top of a lakehouse.

July 21 at 10:30 am PT – “How Blinkit is Building an Open Data Lakehouse with Presto on AWS,” by Akshay Agarwal, Software Engineer, Blinkit; and Satyam Krishna, Engineering Manager, Blinkit

Blinkit, India’s leading instant delivery service, uses Presto on AWS to help them deliver on their promise of “everything delivered in 10 minutes”. In this session, Satyam and Akshay will discuss why they moved to Presto on S3 from their cloud data warehouse for more flexibility and better price performance. They’ll also share more on their open data lakehouse architecture which includes Presto as their SQL engine for ad hoc reporting, Ahana as SaaS for Presto, Apache Hudi and Iceberg to help manage transactions, and AWS S3 as their data lake.

July 21 at 11:00 am PT – “Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy,” by George Wang, Principal Software Engineer, Ahana

Today Presto supports broadcast join by having a worker to fetch data from a small data source to build a hash table and then sending the entire data over the network to all other workers for hash lookup probed by large data source. This can be optimized by a new query execution strategy as source data from small tables is pulled directly by all workers which is known as replicated reads from dimension tables. This feature comes with a nice caching property given that all worker nodes N are now participating in scanning the data from remote sources. The table scan operation for dimension tables is cacheable per all worker nodes. In addition, there will be better resource utilization because the presto scheduler can now reduce the number plan fragment to execute as the same workers run tasks in parallel within a single stage to reduce data shuffles.

July 21 at 2:25 pm PT – “Presto for the Open Data Lakehouse,” panel session moderated by Eric Kavanagh, CEO, Bloor Group with Dave Simmen, CTO & Co-Founder, Ahana; Girish Baliga, Chair of Presto Foundation & Sr. Engineering Manager, Uber; Biswapesh Chattopadhyay, Tech Lead, DI Compute, Meta; and Ali LeClerc, Chair of Presto Outreach Committee and Head of Community, Ahana


Today’s digital-native companies need a modern data infra that can handle data wrangling and data-driven analytics for the ever-increasing amount of data needed to drive business. Specifically, they need to address challenges like complexity, cost, and lock-in. An Open SQL Data Lakehouse approach enables flexibility and better cost performance by leveraging open technologies and formats. Join us for this panel where leading technologists from the Presto open source project will share their vision of the SQL Data Lakehouse and why Presto is a critical component.

View all the sessions in the full program schedule

PrestoCon Day is a free virtual event and registration is open

Tweet this: @AhanaIO announces its participation in #PrestoCon Day #cloud #opensource #analytics #presto https://bit.ly/3ImlAcU

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Hands-on Presto Tutorial: How to run Presto on Kubernetes

Data Lakehouse

What is Presto?

Tip: looking for a more technical guide to understanding Presto? Get the free ebook, Learning and Operating Presto.

To learn how to run Presto on Kubernetes, let’s cover the basics first. Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started with Presto on Kubernetes

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Presto configure

Figure 1: Creating a new nodegroup

scale presto

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Presto cluster

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

image5 summary

Figure 4: Assume role privileges

docker

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated

  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are pretty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

What is Kubernetes
  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

Using Presto with Kubernetes

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Presto for S3

How to get started running Presto on Kubernetes

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

how to run Presto on Kubernetes
  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
AWS s3 Kubernetes
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

yaml file to the Kubernetes
  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

Prestodb and Kubernetes
  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

how to use Kubernetes
  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

Kubernetes cluster
  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

get started with Presto on Kubernetes
  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

How to run presto on Kubernetes
  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

Kubernetes
  1. What does the current deployment look like?
Running Kubernetes

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4
How to use Kubernetes

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

Presto on Kubernetes.

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Next Lesson

Ready for your next Presto lesson from Ahana? Check out our guide to running Presto with AWS Glue as catalog on your laptop.

Data Warehouse: A Comprehensive Guide

Introduction

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. It is typically composed of operational data that has been aggregated and organized in such a way that facilitates the requirements of the data teams. Data consumers need/want to be able to do their work at a very high speed to make decisions. By design, there is usually some level of latency involved in data appearing in a warehouse, keep that in mind when designing your systems and what the requirements are for your users. In this article, we’re going to review the data warehouse types, the different types of architecture, and the different warehouse model types.

Data Warehouse Architecture Types

The various data warehouse architecture types break down into three categories:

Single-tier architecture – The objective of this architecture is to dramatically reduce data duplication and produce a dense set of data. While this design keeps the volume of data as low as possible, it is not appropriate for complex data requirements that include numerous data sources.

Two-tier architecture – This architecture design splits the physical data from the warehouse itself, making use of a system and a database server. This design is typically used for a data mart in a small organization, and while efficient at data storage, it is not a scalable design and can only support a relatively small number of users.

Three-tier architecture – The three-tier architecture is the most common type of data warehouse as it provides a well-organized flow of your raw information to provide insights. It is comprised of the following components:

  • Bottom tier – comprises the database of the warehouse servers. It creates an abstraction layer on the various information sources to be used in the warehouse. 
  • Middle tier – includes an OLAP server to provide an abstracted view of the database for the users. Being pre-built into the architecture, this tier can be used as an OLAP-centric warehouse.
  • Top tier – comprises the client-level tools and APIs that are used for data analysis and reporting. 

Data Warehouse Model Types 

The data warehouse model types break down into four categories:

  1. Enterprise Data Warehouse

An EDW is a centralized warehouse that collects all the information on subjects across the entire organization. These tend to be a collection of databases as opposed to one monolith, that provides a unified approach to querying data by subject.

  1. Data Mart 

 Consisting of a subset of a warehouse that is useful for a specific group of users. Consider a marketing data mart that is populated with data from ads, analytics, social media engagement, email campaign data, etc. This enables the (marketing) department to rapidly analyze their data without the need to scan through volumes of unrelated data. A data mart can be further broken into “independent”, where the data stands alone, or “dependent” where the data is coming from the warehouse.

  1. Operational Data Store

The ODS might seem slightly counterintuitive to start with as it is used for operational reporting, and typically we don’t want to do reporting and analytic workloads on operational data. It is a synergistic component for the previously mentioned EDW and used for reporting on operational types of data. Low-velocity data that is managed in real-time, such as customer records or employee records, are typical of this kind of store.

  1. Virtual Warehouse

The Virtual Warehouse is maybe a questionable inclusion, but nonetheless important. This is implemented as a set of views over your operational database. They tend to be limited in what they can make available due to the relationships in the data, and the fact that you don’t want to destroy your operational database performance by having large numbers of analytic activities taking place on it at the same time.

Summary

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for insights. As a data analyst is looking for trends or actionable insights, the ability to navigate through various data dimensions easily is paramount. The warehouse approach allows you to store and analyze vast amounts of information, which also comes at a cost for storage and compute. You can mitigate some of these costs by optimizing your warehouse for data retrieval. Picking a DW design and sticking with it, and ensuring that your data has been cleansed and standardized prior to loading.
Another option to the warehouse is the growing data lake approach, where information can be read in place from an object store such as (AWS) S3. Some advantages are reduced costs and latency as the load to the DW is no longer necessary. The Community Edition of the Presto managed service from Ahana is a great way to try out the data lake to test your requirements.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Typically a data warehouse contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from diverse data sources. It requires the process: Extract, Transform, and Load (ETL) from diverse data sources and create another copy within the data warehouse to support SQL queries and analysis. 

Following data design techniques are required to facilitate data retrieval for analytical processing:

Star Schema: It is the foundational and simplest schema among data warehousing modeling. It contains one or more fact tables indexing to any number of dimensional tables. Its graphical representation looks like a star hence why we call it a star schema. Fact tables are usually very large compared to dimension tables; and dimension tables can contain redundant data as these are not required to be normalized. 

Snowflake Schema: It is an extension of the star schema where a centralized fact table references the number of other dimension tables; however, those dimension tables are further normalized into multiple related tables. The entity-relationship diagram of this schema resembles a snowflake shape, hence we called it a snowflake schema.

Data Warehouse Example

Consider a fact table that stores sales quantities for each product and customer at a certain time. Sales quantities will be measured here and (Primary) keys from the customer, product, and time dimension tables will flow into the fact table. Additionally, all of the products can be further grouped under different product families and stored on a different table, the primary key of product family tables also goes into the product table as a foreign key. Such a construct is called a snow-flake schema as the product table is further snow-flaked into the product family.

snowflake schema design

Figure 1 explains the typical snowflake schema design. 

ETL or ELT—Extracting, Transforming, and Loading Data

Besides the difference in data modeling and schemas, building a data warehouse involves the critical task of ETL – the compiling of data into a warehouse from other sources.  

ETL v ELT diagram

In data extraction, we move data out of source systems. It could be relational databases, NoSQL databases or streaming data sources. The challenge during this step is to identify the right data and manage access control. 

In a data pipeline or batch workloads, we frequently move a large amount of data from different source systems to the data warehouse. Here the challenges are to plan a realistic SLA and to have a reliable and fast network and infrastructure. 

In data transformation, we format data so that it can be represented consistently in the data warehouse. The original data might reside in different databases using different data types or in different table formats, or in different file formats in different file systems. 

We load data into the fact tables correctly with an error-handling procedure in data loading.

Data Warehouse To Data Lake To Data Lakehouse

A data lake is a centralized file system or storage designed to store, process, and secure large amounts of structured, semistructured, or unstructured data. It can store data in its native format and process any variety of it. Examples of a data lake include HDFS, AWS S3, ADLS or GCS.

Data lakes use the ELT (Extract Load Transform) process while data warehouses use ETL (Extract Transform Load) process. With a SQL engine like Presto you can run interactive queries, reports, and dashboards from a data lake, without the need to create yet another data warehouse or copy of your data. and add an operational overhead. 

A data lake is just one element of an Open Data Lakehouse, as it is taking the benefits from both: a data warehouse and a data lake. However, an Open Data Lakehouse is much more than that. It is the entire stack. In addition to hosting a data lake (AWS S3), and a SQL engine (presto), it also allows for governance (AWS Lake Formation), and ACID transactions. Transactionality or transaction support is achieved using technologies and projects such as Apache Hudi; while Presto is the SQL engine that then sits on top of the cloud data lake you’re querying. In addition to this, there is Ahana Cloud. Ahana is a managed service for Presto, designed to simplify the process of configuring and operating Presto. 


As cloud data warehouses become more cost-prohibitive and limited by vendor lock-in, and the data mesh, or data federation, the approach is not performant, more and more companies are migrating their workloads to an Open Data Lakehouse. If all your data is going to end up in cloud-native storage like Amazon S3, ADLS Gen2, GCS. then the most optimized and efficient data strategy is to leverage an Open Data Lakehouse stack, which provides much more flexibility and remedies the challenges noted above. Taking on the task of creating an Open Data Lakehouse is difficult. As ab introduction to the process check out this on-demand presentation, How to build an Open Data Lakehouse stack. In it you’ll see how you can build your stack in more detail, while incorporating technologies like Ahana, Presto, Apache Hudi, and AWS Lake Formation.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

data-plus-ai-2022

Ahana Will Co-Lead Session At Data & AI Summit About Presto Open Source SQL Query Engine

San Mateo, Calif. – June 23, 2022Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at Data & AI Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 27 – 30 in San Francisco, CA and virtual.

Session Title: “Presto 101 – An Introduction to Open Source Presto.”

Session Time: On Demand

Session Presenters: Ahana’s Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell.

Session Details: Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, users can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Rohan and Philip will introduce the Presto technology and share why it’s becoming so popular. In fact, companies like Facebook, Uber, Twitter, Alibaba, and many others use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. This session will show a quick demo on getting Presto running in AWS.

To register for Data & AI Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @AhanaIO to present at #DataAISummit about #Presto https://bit.ly/3n8YDQt #OpenSource #Analytics #Cloud

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Data Lakehouse

AWS Redshift Query Limits

What is AWS Redshift?

At its heart, AWS Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. It has evolved and been enhanced since then into a powerful distributed system that can provide speedy results across millions of rows. Conceptually it is based on node clusters, with a leader node and compute nodes. The leader generates the execution plan for queries and distributes those tasks to the compute nodes. Scalability is achieved with elastic scaling that can add/modify worker nodes as needed and quickly. We’ll discuss the details in the article below.

Limitations of Using AWS Redshift

There are of course Redshift limitations on many parameters, which Amazon refers to as “quotas”. There is a Redshift query limit, a database limit, a Redshift query size limit, and many others. These have default values from Amazon and are per AWS region. Some of these quotas can be increased by submitting an Amazon Redshift Limit Increase Form. Below is a table of some of these quota limitations.

QuotaValueAdjustable
Nodes per cluster128Yes
Nodes per region200Yes
Schemas per DB per cluster9,900No
Tables per node type9,900 – 100,000No
Query limit50No
Databases per cluster60No
Stored procedures per DB10,000No
Query size limit100,000 rowsYes
Saved queries2,500Yes
Correlated SubqueriesNeed to be rewrittenNo

AWS Redshift Performance

To start, Redshift is storing data in compressed, columnar format. This means that there is less area on disk to scan and less data that has to be moved around. Add to that indexing and you have the base recipe for high performance. In addition, Redshift maintains a results cache, so frequently executed queries are going to be highly performant. This is aided by the query plan optimization done in the leader node. Redshift also optimizes the data partitioning in a highly efficient manner to complement the optimizations done in the columnar data algorithms.

Scaling

There are a robust number of scaling strategies available from Redshift. With just a few clicks in the AWS Redshift console, or even with a single API call, you can change node types, add nodes and pause/resume the cluster. You are also able to use Elastic Resize to dynamically adjust your provisioned capacity within a few minutes. A Resize Scheduler is also available where you can schedule changes, say for month-end processing for example. There is also Concurrency Scaling that can automatically provision additional capacity for dynamic workloads.

Pricing

A lot of variables go into Redshift pricing depending on the scale and features you go with. All of the details and a pricing calculator can be found on the Amazon Redshift Pricing page. To give you a quick overview, however, prices start as low as $.25 per hour. Pricing is based on compute time and size and goes up to $13.04 per hour. Amazon provides some incentives to get you started and try out the service.

First, similar to the Ahana Cloud Commnity Edition, Redshift has a “Free Tier”, if your company has never created a Redshift cluster then you are eligible for a DC2 large node trial for two months. This provides 750 hours per month for free, which is enough to continuously run that DC2 node, with 160GB of compressed SSD storage. Once your trial expires or your usage exceeds 750 hours per month, you can either keep it running with their “on-demand” pricing, or shut it down.

Next, there is a $500 credit available to use their Amazon Redshift Serverless option if you have never used it before. This applies to both the compute and storage and how long it will last depends entirely on the compute capacity you selected, and your usage.

Then there is “on-demand” pricing. This option allows you to just pay for provisioned capacity by the hour with no commitments or upfront costs, partial hours are billed in one-second increments. Amazon allows you to pause and resume these nodes when you aren’t using them so you don’t continue to pay, and you also preserve what you have, you’ll only pay for backup storage.

Summary

Redshift provides a robust, scalable environment that is well suited to managing data in a data warehouse. Amazon provides a variety of ways to easily give Redshift a try without getting too tied in. Not all analytic workloads make sense in a data warehouse, however, and if you are already landing data into AWS S3, then you have the makings of a data lakehouse that can offer better price/performance. A managed Presto service, such as Ahana, can be the answer to that challenge.

Want to learn more about the value of the data lake?

In our free whitepaper, Unlocking the Business Value of the Data Lake, we’ll show you why companies are moving to an open data lake architecture and how they are getting the most out of that data lake to drive their business initiatives.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What is AWS Redshift Spectrum?

Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Learn more about its performance and price.

press-community-announcement

Ahana Announces Additional $7.2 Million Funding Led by Liberty Global Ventures and Debuts Free Community Edition of Ahana Cloud for Presto for the Open Data Lakehouse

Only SaaS for Presto now available for free with Ahana Community Edition; Additional capital raise validates growth of the Open Data Lakehouse market

San Mateo, Calif. – June 16, 2022 Ahana, the only Software as a Service for Presto, today announced an additional investment of $7.2 million from Liberty Global Ventures with participation from existing investor GV, extending the company’s Series A financing to $27.2 million. Liberty Global is a world leader in converged broadband, video and mobile communications services. This brings the total amount of funding raised to date to $32 million. Ankur Prakash, Partner, Liberty Global Ventures, will join the Ahana Board of Directors as a board observer. Ahana will use the funding to continue to grow its technical team and product development; evangelize the Presto community; and develop go-to-market programs to meet customer demand. 

Ahana also announced today Ahana Cloud for Presto Community Edition, designed to simplify the deployment, management and integration of Presto, an open source distributed SQL query engine, for the Open Data Lakehouse. Ahana Community Edition is immediately available to everyone, including users of the 100,000+ downloads of Ahana’s PrestoDB Sandbox on DockerHub. It provides simple, distributed Presto cluster provisioning and tuned out-of-the-box configurations, bringing the power of Presto to data teams of all sizes for free. Instead of downloading and installing open source Presto software, data teams can quickly learn about Presto and deploy initial SQL data lakehouse use cases in the cloud. Community Edition users can easily upgrade to the full version of Ahana Cloud for Presto, which adds increased security including integration with Apache Ranger and AWS Lake Formation, price-performance benefits including multi-level caching, and enterprise-level support.

“Over the past year we’ve focused on bringing the easiest managed service for Presto to market, and today we’re thrilled to announce a forever-free community edition to drive more adoption of Presto across the broader open source user community. Our belief in Presto as the best SQL query engine for the Open Data Lakehouse is underscored by our new relationship with Liberty Global,” said Steven Mih, Cofounder and CEO, Ahana. “With the Community Edition, data platform teams get unlimited production use of Presto at a good amount of scale for lightning-fast insights on their data.”

“Today we’re seeing more companies embrace cloud-based technologies to deliver superior customer experiences. An underlying architectural pattern is the leveraging of an Open Data Lakehouse, a more flexible stack that solves for the high costs, lock-in, and limitations of the traditional data warehouse,” said Ankur Prakash, Partner, Liberty Global Ventures. “Ahana has innovated to address these challenges with its industry-leading approach to bring the most high-performing, cost-effective SQL query engine to data platforms teams. Our investment in Ahana reflects our commitment to drive more value for businesses, specifically in the next evolution of the data warehouse to Open Data Lakehouses.” 

Details of Ahana Cloud for Presto Community Edition, include:

●        Free to use, forever

●        Use of Presto in an Open Data Lakehouse with open file formats like Apache Parquet and advanced lake data management like Apache Hudi

●        A single Presto cluster with all supported instance types except Graviton

●        Pre-configured integrations to multiple data sources including the Hive Metastore for Amazon S3, Amazon OpenSearch, Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and Amazon Redshift

●        Community support through public Ahana Community Slack channel plus a free 45 minute onboarding session with an Ahana Presto engineer

●        Seamless upgrade to the full version which includes enterprise features like data access control, autoscaling, multi-level caching, and SLA-based support

“Enterprises continue to embrace ‘lake house’ platforms that apply SQL structures and querying capabilities to cloud-native object stores,” said Kevin Petrie, VP of Research, Eckerson Group. “Ahana’s new Community Edition for Presto offers a SQL query engine that can help advance market adoption of the lake house.”

Supporting Resources:

Get Started with the Ahana Community Edition

Join the Ahana Community Slack Channel

Tweet this:  @AhanaIO announces additional $7.2 million Series A financing led by Liberty Global Ventures; debuts free community edition of Ahana #Cloud for #Presto on #AWS https://bit.ly/3xlAVW4 

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Open Source Summit Logo

Ahana Will Co-Lead Session At Open Source Summit About Presto SQL Query Engine

San Mateo, Calif. – June 14, 2022 — Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at the Linux Foundation’s Open Source Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 20 – 24 in Austin, TX and virtual.

Session Title: “Introduction to Presto – The SQL Engine for Data Platform Teams.”

Session Time: Tuesday, June 21 at 11:10am – 11:50am CT

Session Presenters: Ahana Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell. 

Session Details: Presto is an open-source high performance, distributed SQL query engine. Born at Facebook in 2012, Presto was built to run interactive queries on large Hadoop-based clusters. Today it has grown to support many users and use cases including ad hoc query, data lake analytics, and federated querying. In this session, we will give an overview of Presto including architecture and how it works, the problems it solves, and most common use cases. We’ll also share the latest innovation in the project as well as what’s on the roadmap.

To register for Open Source Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @Ahana to present at #OpenSourceSummit about #Presto https://bit.ly/3xMGQ7M #OpenSource #Analytics #Cloud 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Cloud for PrestoDB

What are the Benefits of a Managed Service?

Managed Services – Understanding the basics

What are the operational benefits of using a managed service for Presto with Ahana Cloud? To answer this question, first let’s hear from an AWS Solution Architect about his experience using Ahana as a solution for his data lakehouse: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So, the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service. The customer of Ahana does not have to do a lot of work. Everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service.”

The architect goes on to state, “the customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC.”

In addition to this the AWS architect continues to state, “this is beneficial. As the user, you have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that some context from a current user and a solution provided from this data architect, let’s get more specific about the reasons a user would want to select a managed service for their SQL engine for Data Lakehouse analytics and reporting.

For example, let’s say you want to create a a new cluster. It’s just a couple of clicks with Ahana Cloud, rather than an entire arduous process without the faciliation of a service. You can pick the the coordinator instance type and the Hive metastore instance type. And it is all flexible.

In this scenario, as to further progress with this illustration, instead of using the Ahana Cloud provided Hive metastore, you can bring your own Amazon Glue catalog. This allows the user to main control and streamline their tasks.

Then of course it’s easy to add additional data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated. After the connection, then Ahana Cloud automatically restarts the cluster.

When compared to EMR or with other distributions, this is more cumbersome for the user. All of this has to be manually completed by the user when they are not using a managed service:

  • You have to create a catalog properties file for each data source
  • Restart the cluster on your own
  • Scale the cluster manually
  • Add your own query logs and statistic
  • Rebuild everything when you stop and restart clusters
Managed service for Presto

With Ahana Cloud as a managed service for PrestoDB, all of this manual action and complexity is taken away, which in turn is allowing the data analysts and users to focus on their work – rather than spending a large amount of time being distracted with high labor processes and complicated configurations as a prerequisite to getting started with analytical tasks.

For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud, as a managed service, and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless or as simple, like with Ahana, is when you stop the entire cluster.

In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. When you as tne user restart the cluster back up, all will come up pre-integrated with the click of a button. Meaning, the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. T

Here this is noted as rather important. The reason for this is because otherwise the operator would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

Reduce Frustration When Configuring

See how Ahana simplifies your SQL Engine

Serverless SQL Engine

Next Steps – Exploring a Managed Service

As you are your team members are looking to reduce the friction from your data analytics stack, learn how Ahana Cloud reduces frustration and time spent configuring for data teams. The number one reason for selecting a managed service is that it will make your life easier. Check out our customer stories to see how organizations like Blinkit, Carbon, and Adroitts were able to increase price-performance and bring control back to their data teams – all while simplifying their processes and bringing a sense of ease to their in-house data management outfits.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

ahana + cube_integration

Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube

See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs

Ahana provides managed Presto clusters running in your AWS account.

Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.

Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.

However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.

What is Cube?

Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

855489ee 4cef 4146 876b e44d0ebcc711

This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.

Integration

Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.

You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to learn more about connecting Cube to Ahana.

Example: Parsing logs from multiple data sources with Ahana and Cube

Let’s build a real-world data application with Ahana and Cube.

We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.

Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.

Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.

You need all of that to prepare personalized dashboards showing the results of your hard work.

Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.

You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.

Architecture diagram

In the end, you want a dashboard looking like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model

The final result: two dashboards showing the number of errors made by two variants of the ML model

How to configure Ahana?

Allowing Ahana to access your AWS account

First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.

On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.

When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.

The Outputs tab containing the identifier of the IAM role for Ahana

We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.

The Ahana setup page

Creating an Ahana cluster

After configuring AWS access, we have to start a new Ahana cluster.

In the Ahana dashboard, click the “Create new cluster” button.

Ahana create new cluster

In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.

Ahana cluster setup page

Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.

Retrieving data from S3 and PostgreSQL

After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.

Adding a PostgreSQL database to Ahana

In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.

In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.

PostgreSQL data source configuration

Adding an S3 bucket to Ahana

AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.

We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).

AWS Glue databases

Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).

Here we specify where the crawler should look for new data
Here we specify where the crawler should look for new data

On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.

Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.

Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!

IAM role configuration

Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:

c80af05b 95d9 428d 8bea b3f720fb5931

Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:

configuring the crawler's schedule

In the output configuration, we need to customize the settings.

First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”

We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.

Crawler's output configuration

Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.

Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.

AWS Glue tables

In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.

JSON serialization configured in the table details view

We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:

Default partition names replaced with their actual names

After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.

Configuring data sources in the Presto cluster

Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.

AWS Glue data catalog setup in Ahana

Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.

When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:

The "Outputs" tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup page, we click the “Add data source” button.

Adding data sources to an existing cluster

Finally, we can add both data sources to our Ahana cluster.

We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.

We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.

Adding data sources to an Ahana cluster

Writing the Presto queries

The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.

Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.

Example input:

{"time_series": [51, 37, …, 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}

In the response, the model returns a JSON object with a single “prediction”key and a decimal value:

{"prediction": 21.266147618448954}

A single request in Sagemaker Endpoint logs looks like this:

{"captureData": {"endpointInput": {"observedContentType": "application/json", "mode": "INPUT", "data": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "application/json", "mode": "OUTPUT", "data": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}

AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:

with sagemaker as (
  select
  model_name,
  variant_name,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)

After that, we join both data sources and calculate the absolute error value:

sql
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),

Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.

percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)

In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.

SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant

How to configure Cube?

In our application, we want to display the distribution of absolute prediction errors.

We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.

If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.

At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:

example chart_Both models perform almost the same

If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures

An improved second version of the model_example chart

Creating a Cube deployment

Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.

First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.

Cube Deployments dashboard page

We choose the Presto cluster:

Database connections supported by Cube

Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!

Presto configuration page

Defining the data model in Cube

We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.

Let’s open the Schema view in Cube and add a new file.

The schema view in Cube showing where we should click to create a new file

In the next window, type the file name errorpercentiles.js and click “Create file.”

87e853ec 6099 48cd 8376 c0e6f9780841

In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!

Below, you see the entire content of the file. Later, we explain the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

cube('errorpercentiles', {
  sql: `with sagemaker as (
    select
    model_name,
    variant_name,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
});

In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.

A newly created cube configuration file

We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:


measures: {
  perc_10: {
    sql: `perc_10`,
    type: `max`
  },
  perc_20: {
    sql: `perc_20`,
    type: `max`
  },
  perc_30: {
    sql: `perc_30`,
    type: `max`
  },
  perc_40: {
    sql: `perc_40`,
    type: `max`
  },
  perc_50: {
    sql: `perc_50`,
    type: `max`
  },
  perc_60: {
    sql: `perc_60`,
    type: `max`
  },
  perc_70: {
    sql: `perc_70`,
    type: `max`
  },
  perc_80: {
    sql: `perc_80`,
    type: `max`
  },
  perc_90: {
    sql: `perc_90`,
    type: `max`
  },
  perc_100: {
    sql: `perc_100`,
    type: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    type: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    type: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    type: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    type: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    type: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    type: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    type: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    type: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    type: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    type: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
A part of the error percentiles configuration in Cube

The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.

We had to add the following code before using the cube function!

First, we have to create an array of column names:


const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

Now, we must generate the measures configuration object. We iterate over the array and create a measure configuration for every column:


const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

Finally, we can replace the measure definitions with:

measures: measures

After changing the file content, click the “Save All” button.

The top section of the schema view

And click the Continue button in the popup window.

The popup window shows the URL of the test API

In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):

An example result in the Playground view

Configuring access control in Cube

In the Schema view, open the cube.js file.

We will use the queryRewrite configuration option to allow or disallow access to data.

First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.

For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.

The Security Context window in the Playground view

After rejecting unauthorized access, we add a filter to all queries.

We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.

In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.models) {
      throw new Error('No models found in Security Context!');
    }
    query.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.models,
    });
    return query;
  },
};

Configuring caching in Cube

By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.

When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.

For the sake of the example, we will retrieve the value only twice a day.

Preparing data sources for caching

First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.

In PostgreSQL, we grant permissions to the user account used by Presto to access the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;

If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.

We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.

If you want to use a different database name, follow the instructions in our documentation.

f70a3069 71cd 4b6c a945 9d63d3ff3c26

Caching configuration in Cube

Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    type: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      every: `12 hour`,
    },
  },
},

After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:

An empty “Commit Changes & Push”view

When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.

We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.

Successfully cached pre-aggregations

Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.

The message that indicates that our pre-aggregation works correctly

Building the front-end application

Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.

Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.

Getting the Cube API key and the API URL

Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.

The Env vars tab on the settings page

We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.

The location of the Copy API URL link

Back end for front end

Now, we can write the back-end code.

First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.

After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.

‚Äã‚Äãconst jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
  // Pass it as an environment variable at runtime or use the
  // secret management feature of your container orchestration system

  const cubejsToken = jwt.sign(
    { "models": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );
  
  return cubejsToken;
}

We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.

We create a new express application running in the node server and configure the /models endpoint:

const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()

app.use(bodyParser.json())
app.get('/models', getAvailableModels);

app.listen(port, () => {
  console.log(`Server is running on port ${port}`)
})

In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:

Our function returns a list of objects containing the modelName and modelVariant fields.

function getAvailableModels(req, res) {
  res.setHeader('Content-Type', 'application/json');
  request.post(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({"query": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, body) => {
    if (err) {
      console.log(err);
    }
    body = JSON.parse(body);
    response = body.data.map(item => {
      return {
        modelName: item["errorpercentiles.modelName"],
        modelVariant: item["errorpercentiles.modelVariant"]
      }
    });
    res.send(JSON.stringify(response));
  });
};

Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.

The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.

{
  "query": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}

The response parsing code extracts the number of values in every bucket and prepares bucket labels:

response = body.data.map(item => {
  return {
    modelName: item["errorpercentiles.modelName"],
    modelVariant: item["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      item['errorpercentiles.perc_20_value'],
      item['errorpercentiles.perc_30_value'],
      item['errorpercentiles.perc_40_value'],
      item['errorpercentiles.perc_50_value'],
      item['errorpercentiles.perc_60_value'],
      item['errorpercentiles.perc_70_value'],
      item['errorpercentiles.perc_80_value'],
      item['errorpercentiles.perc_90_value'],
      ">=" + item['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      item['errorpercentiles.perc_20'],
      item['errorpercentiles.perc_30'],
      item['errorpercentiles.perc_40'],
      item['errorpercentiles.perc_50'],
      item['errorpercentiles.perc_60'],
      item['errorpercentiles.perc_70'],
      item['errorpercentiles.perc_80'],
      item['errorpercentiles.perc_90'],
      item['errorpercentiles.perc_100']
    ]
  }
})

Dashboard website

In the last step, we build the dashboard website using Vue.js.

If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.

We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.

We put the charts in the Suspense component to allow asynchronous loading.

To keep the example short, we will skip the CSS style part.

<script setup>
  import OwnerName from './components/OwnerName.vue'
  import ChartView from './components/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const models = ref([]);
  axios.get(SERVER_URL + '/models').then(response => {
    models.value = response.data
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName name="Test Inc." />
    </div>
  </header>
  <main>
    <div v-for="model in models" v-bind:key="model.modelName">
      <Suspense>
        <ChartView v-bind:title="model.modelName" v-bind:variant="model.modelVariant" type="percentiles"/>
      </Suspense>
    </div>
  </main>
</template>

The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.

In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:

Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);

We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:

const props = defineProps({
  title: String,
  variant: String,
  type: String
})

Next, we retrieve the chart data and labels from the back-end service. We will also prepare the variable containing the label text:

const response = await axios.get(SERVER_URL + '/' + props.type + '?model=' + props.title + '&variant=' + props.variant)
const data = response.data[0].values;
const labels = response.data[0].labels;
const label_text = "Number of prediction errors of a given value"

Finally, we prepare the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      display: true,
      text: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    display: false
  },
  tooltip: {
    enabled: false
  }
}

In the template section of the Vue component, we pass the configuration to the Bar instance:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>

If we have done everything correctly, we should see a dashboard page with error distributions.

Charts displaying the error distribution for different model variants

Wrapping up

Thanks for following this tutorial.

We encourage you to spend some time reading the Cube and Ahana documentation.

Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.

Data Lakehouse

Price-Performance Ratio of AWS Athena vs Ahana Cloud for Presto

What does AWS Athena cost? Understand the price-performance ratio of Amazon Athena vs. Ahana. Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

To explain AWS Athena pricing compared to Ahana pricing, let’s first cover what is different between them. The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
Cost dimensionPay for the amount of data is scanned on on a per query basis at USD $5 per Terabyte Scanned. It may be hard to estimate how much data your queries will scan. Pay only for EC2 usage on a per node / hour basis for EC2 and Ahana
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, Presto on Ahana Cloud is upgraded on a quarterly basis to keep up with the most recent releases.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

AWS Athena cost

Ahana has higher concurrency so queries finish faster

Ahana cost
  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
TypeInstancePrice/hrQty.Cost/hr
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08
Total$3.45

So, you can run many queries for one hour that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.

Summary

While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Ready to Compare Ahana to Athena

Start a free trial today and experience better price performance

AWS Athena Limitations

AWS Athena Alternatives

Amazon Athena is a useful query tool – but sometimes you need more control over price, performance, and scale. Ahana runs on the same powerful underlying technology but gives you back control – so you can scale your data lake analytics without exploding your cloud bill. Learn more or request a demo today

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB. We’ll also discuss some of the reasons why you’d choose to deploy PrestoDB on yourself, rather than using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. Athena has a serverless architecture, which is a benefit. However, one of the drawbacks is the cost of AWS Athena. Currently, users pay per query. Currently it’s priced at $5 per terabyte scanned. Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. In addition to this, AWS Athena is built on an older version of PrestoDB and it only supports a subset of the PrestoDB features.

An overview on AWS Athena limits

AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Most of the limitations associated with Athena are rather challenging. Luckily, some are soft quotas. With these, you can request AWS to increase them. One big issue is around Athena’s restrictions on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena Alternatives

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

Ready To Work Without Limitations?

Get Started Today for Free With Ahana Cloud

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Athena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100* buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

2.    Athena restricts each account to 100* databases, and databases cannot include over 100* tables.

*Note, recently Athena has increased this to 10K databases per account and 200K tables per database.

Summary: Athena DB limits:

Amazon S3 bucket limit1k buckets per account
Database limit10K databases per account
Tables per database200k

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An Amazon Athena alternative is deploying your own PrestoDB cluster. Amazon Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service, without the limitations of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Up next: AWS Athena Query Limits

Related Articles 

Athena vs Presto

Learn the differences between Presto and Ahana and understand the pros and cons.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Best Athena Alternative

Discover the 4 most popular choices to replace Amazon Athena.

Data Lakehouse

What is AWS Redshift Spectrum?

What is Redshift Spectrum?

Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.

Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.

Benefits of AWS Redshift Spectrum

When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.

Spectrum allows you to access your data lake files from within your Redshift data warehouse, without having to go through an ingestion process. This makes data management easier. This also reduces data latency since you aren’t waiting for ETL jobs to be written and processed.

With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift. This means there are no new tools to learn and it allows you to leverage your existing skillsets to query Redshift. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.

AWS Redshift Spectrum Performance & Price

AWS Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.

As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.

Redshift and Redshift Spectrum Use Case

An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse. However, real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.

Summary

Amazon Redshift Spectrum provides a layer of functionality to Redshift. This allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for the companies who are using Redshift and need to stay there, but also need to make use of the data lake. Or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.

Run SQL on your Data Lakehouse

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running SQL queries in a few mins.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Lake-Formation-architecture

Building a Data Lake: How to with Lake Formation on AWS

What is an AWS Lake Formation?

Briefly, AWS lake formation helps users when building a data lake. Specifically, how to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, to get started with building a data lake, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Data Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers as the next step of building a data lake. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location

Benefits of Building a Data Lake with AWS Lake Formation

Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

what is a data lake

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

data lake permissions

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

AWS Lake Formation Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

Building a Data Lake

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.

Building a Data Lake: Summary

AWS Lake Formation really simplifies the process of building a data lake, whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.

What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Reporting and dashboarding diagram

The Open Data Lakehouse – a quick intro

Understanding the necessity of building a Data Lakehouse is critical to today’s data landscape. If you’re looking to get started with constructing a data lakehouse analytics stack book time with an engineer to expedite the development process.

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

0216reporting and dashboarding

The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3. 

What is an Open Data Lakehouse

The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.

Four key elements include:

Open source – The technologies on the stack we will be exploring for Open Data Lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on. 

Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud. 

Why Open Data Lakehouses?

Open data lakehouses allow consolidation of structured and unstructured data in a central repository, the open data lake, at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

  • Bringing compute to your data (decouple of compute and storage)
  • Flexibility at the governance/transaction layer
  • Flexibility and low cost to store structured and semi/unstructured data
  • Flexibility at every layer – pick and choose which technology works best for your workloads/use case

Open Data Lakehouse architecture

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

The next EDW is the Open Data Lakehouse. Learn the data lakehouse format.

BI/Application tools – Data Visualization, Data Science tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload. 

Presto – SQL Query Engine for the Data Lake

Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores. 

Presto is built for high performance interactive querying with in-memory execution 

Key characteristics include: 

  • High scalability from 1 to 1000s of workers
  • Flexibility to support a wide range of SQL use cases
  • Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
  • Federation of data sources particularly data lakes via Presto connectors
  • Seamless integration with existing SQL systems with ANSI SQL standard
deploying presto clusters

A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi – Streaming Transactions in the Open Data Lake

One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated  and consistent without that operational overhead. 

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following: 

  • Ingesting incremental data
  • Changing data capture, both insert and deletion
  • Incremental data processing
  • ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.

Hudi data management

(Image source: Apache Hudi)

  • Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
  • Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

  • Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

AWS S3 – The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started: How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Ingest your data into AWS S3 and query with Presto

Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.

Conclusion:

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

Data Lakehouse

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand and is compatible with multitudes of AWS tools and technologies. AWS Redshift is considered the preferred cloud data warehouse of choice for most customers, but the pricing is not simple, since it tries to accommodate different use cases and customers. Let us try to understand the pricing details of Amazon Redshift.

Understanding Redshift cluster pricing

The Redshift cluster consists of multiple nodes allowing it to process data faster. This means Redshift performance depends on the node types and number of nodes. The node types can be dense compute nodes or Redshift managed storage nodes.

Dense Compute nodes: These nodes offer physical memory up to 244GB and storage capacity on SSD up to 2.5TB.

RA3 with managed Storage nodes: These nodes have physical memory up to 384GB and storage capacity on SSD up to 128TB. Additionally, when storage runs out on the nodes, Redshift will offload the data into S3. Below pricing for RA3 does not include the cost of managed storage.

Redshift pricing

Redshift spectrum pricing

Redshift spectrum is a serverless offering that allows running SQL queries directly against an AWS S3 data lake. Redshift is priced based on data scanned per TB.

redshift spectrum pricing

Concurrency scaling

Amazon Redshift allows you to grab additional resources as needed and release them when they are not needed. Every day of typical usage up to one hour is free. However, every second beyond that is charged for additional resource usage.

A pricing example as stated by Amazon Redshift, A 10 DC2.8XL node Redshift cluster in the US-East will cost $48 per hour. Consider a scenario where two transient clusters are utilized for 5 mins beyond the free concurrency scaling credits. The per-second on-demand rate will be $48 X 1/3600 = $0.13 per second. The additional cost for concurrency scaling, in this case, is 0.013 per second X 300 seconds x 2 transient clusters = $8

Amazon Redshift managed storage pricing(RMS)

Managed storage comes with RA3 node types. Usage of managed storage is calculated hourly based on the total GB stored. Managed storage does not include backup storage charges due to automated and manual snapshots.

Amazon Redshift cost

Pricing example for managed storage pricing

100 GB stored for 15 days: 100 GB X 15days x (24hours/day) =36000 GB/hours

100 TB stored for 15days: 100TB X 1024GB/TB X 15 days x (24 hours/day) = 36,864000 GB-hours

Total usage in GB-hours: 36,000 GB-Hours + 36,864000 GB-hours = 36,900,000 GB-hours

Total usage in GB-Month = 36,900,000/720 hours per month = 51,250 GB-months

Total charges for the month will be 51,250 GB-month X $0.024 = $1230

Limitation of Redshift Pricing

As you can see, Redshift has few selected instance types with limited storage.

Customers could easily hit the ceiling in terms of the node storage, and Redshift managed storage is expensive for data growth.

Redshift spectrum (a serverless option) $5 scan per TB will be an expensive option and removes the ability for the customer to scale up/down the nodes to meet their performance requirements.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced use cases in Redshift, or managing increasing cloud storage costs, check out Ahana. Ahana is a powerful managed service for Presto which provides SQL on S3. Unlike Redshift spectrum, Ahana allows customers to choose the right instance type and scale up/down as needed and comes with a simple pricing model on the number of compute instances.

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

On-Demand-lake-to-shining-lakehouse

On-Demand Presentation

From Lake to Shining Lakehouse, A New Era In Data

From data warehousing to data lakes, and now with so-called Data Lakehouses, we’re seeing an ever greater appreciation for the importance of architecture. The success of Snowflake proved that data warehousing is a live and well; but that’s not to say that data lakes aren’t viable. The key is to find a balance of both worlds, thus enabling the rapid analysis afforded by warehousing, and the strategic agility via explorative ad hoc queries that data lakes provide. During this episode of DM Radio you will learn from experts Raj K of General Dynamics Information technology, and Wen Phan of Ahana.


Speakers

K Raj

gd it logo color
K Raj headshot

Wen Phan

ahana logo
Wen-Phan_Picture

Eric Kavanaugh

bloor group logo1
eric kavanaugh

PrestoDB on AWS

PrestoDB on AWS

What is PrestoDB on AWS?

Tip: If you are looking to better understand PrestoDB on AWS then check out the free, downloadable ebook, Learning and Operating Presto. This ebook will breakdown what Presto is, how it started, and best use cases.

To tackle this common question, what is PrestoDB on AWS, let’s first define Presto. PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. Presto was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. It was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables self-service ad-hoc analytics for its users on large amounts of data. With Presto, you can query data where it lives. This is including Hive, Amazon S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

AWS and Presto is a powerful combination. If you want to run PrestoDB on AWS it’s easy to spin up a managed Presto cluster. This can be done either through the Amazon Management Console, the AWS CLI, or the Amazon EMR API. It is not too difficult to run AWS Presto CLI EMR.

You can also give Ahana Cloud a try. Ahana is a managed service for Presto that takes care of the devops for you and provides everything you need to build your SQL Data Lakehouse using Presto.

Running Presto on AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does PrestoDB on AWS Work?

This is another very common question. The quickest answer is that PrestoDB is the compute engine on top of the data storage of your SQL Data Lakehouse. In this case, the storage is AWS S3. See the image below for an overview.

PrestoDB on AWS

There are some AWS services that work with PrestoDB on AWS, like Amazon EMR and Amazon Athena. Amazon EMR and Amazon Athena are the best Amazon services to deploy Presto in the cloud. They are managed services that do the integration, testing, setup, configuration, and cluster tuning for you. Amazon Athena Presto and EMR are widely used, but both come with some challenges, such as price performance and cost.

There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, and within minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune.

Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or PrestoDB. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for PrestoDB on AWS.

To get started with Presto for your SQL Data Lakehouse on AWS quickly, check out the services from Ahana Cloud. Ahana has two versions of their solution: a Full-Edition and a Free-Forever Community Edition. Each option has components of the SQL Lakehouse included, as well as support from Ahana. Explore Ahana’s managed service for PrestoDB.

Related Articles 

PrestoDB on Spark

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads.

Price-Performance Ratio of AWS Athena Presto vs Ahana Cloud for PrestoDB

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project. The biggest difference between the two is that Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers.

Athena Query Limits | Comparing AWS Athena & PrestoDB

In this blog, we discuss AWS Athena vs Presto and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service, like AWS pricing.

Ahana Cloud for PrestoDB

What is Presto and How Does It Work?

What is Presto and how does It work?

How does PrestoDB work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, PrestoDB is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

PrestoDB originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. PrestoDB between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s PrestoDB Foundation.

Is Presto In-Memory? 

When it comes to memory, how it works is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. PrestoDB itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When PrestoDB executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, PrestoDB will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 
– Capturing data
– Storing data
– Analysis
– Search
– Sharing
– Transfer
– Visualization
– Querying
– Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to PrestoDB and that you can query. There are a ton in the PrestoDB ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive? 

Presto Hive typically refers to using PrestoDB with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

PrestoDB and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

PrestoDB is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Autoscale your Presto cluster in Ahana Cloud

Autoscaling is now available on Ahana Cloud. This feature will monitor the worker nodes’ average CPU Utilization of your worker nodes and scale-out when reaching the 75% threshold.

Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources.

Presto vs Snowflake: Data Warehousing Comparisons

Data Lakehouse

Snowflake vs Presto

This article touches on several basic elements to compare Presto and Snowflake.

To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your data lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).

Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

Use cases: Snowflake vs. Presto

Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic. 

Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.

Open Source Or Vendor lock-in

Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in. 

Vendor Lock-In can lead to:

  • Excessive cost as you grow your data warehouse
  • When ingested into another system, data is typically locked into the formats of a closed source system
  • No community innovations or ways to leverage other innovative technologies and services to process that same data

Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem. 

Open File Formats

Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc. 

On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration. 

Open transaction format

Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does. 

Data Ownership

While Snowflake did decouple storage and compute, they did not decouple data ownership. . They  still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles. 

Scalability and Concurrency

With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.

Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time

Cost of Snowflake

Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.

Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database. 

Snowflake vs. Presto: In Summary

Snowflake is a well-known cloud data warehouse, but sometimes users need more than that – 

  1. Immediate data access as soon as it is written in a federated manner
  2. Eliminate lag associated with ETL migration when you can directly query from the source
  3. Flexible environment to run unstructured/ semi-structured or machine learning workloads
  4. Support for open file formats and storage standards to build open data lakehouse
  5. Open-source technologies to avoid vendor lock-in
  6. The cost-effective solution that is optimized for high concurrency and scalability. 

Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way. 

SaaS for Presto

If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. 

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

AWS Athena vs AWS Glue: What Are The Differences?

Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

Redshift-internal-architecture-

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?

You might already be familiar with Redshift basics. However, in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.

Redshift Data Warehouse Architecture and Main Components

Redshift internal architecture explained

As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow. Data can also be uploaded through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.

Breaking Down the Redshift Cluster Components

  • The leader node is Redshift’s ‘brain’ and manages communications with external client programs. It also manages the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
  • Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
  • The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
  • Intermediate results are sent back to the leader node. This then performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
  • Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
  • Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.

This AWS presentation offers more details about Redshift’s internal architecture. It also provides a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:

Additional Performance Features

In addition to these core components, Redshift has multiple built-in features meant to improve performance:

  • Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
  • Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations. 
  • Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time. This ensures final aggregations are returned faster.
  • Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries. 
  • Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..

Redshift vs Traditional Data Warehouses

While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:

  • Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
  • Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
  • Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.

Redshift Costs and Performance

Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:

  • On-demand instances are charged by the hour, with no long-term commitment or upfront fees. 
  • Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time. 
  • Serverless instances are charged based on usage, so customers only pay for the capacity they consume.

The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters. 

Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query. 

Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.

Reducing Redshift Costs with a Lakehouse Architecture

We’ve worked with many companies who started out using Redshift. This was sufficient for their needs when they didn’t have much data, but they found it difficult and costly to scale as their needs evolved. 

Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.

At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse. This would leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.

Redshift Lakehouse Architecture Explained

In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data. It allows them to do so without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.

Still Choosing Between a Data Warehouse and a Data Lake?

Watch the webinar, Data Warehouse or Data Lake, which one do I choose? In this webinar, we’ll discuss the data landscape and why many companies are moving to an open data lakehouse.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand, but the pricing is not simple. This article breaks down how their pricing works.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Data Lakehouse

What Is Trino? Is Trino a Database?

What is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB. As such, it was designed from the ground up for fast queries against any amounts of data. It supports any types of data sources including relational and non-relational sources via its connector architecture.

What is the history of Trino?

Trino is a hard fork of the widely popular open source Presto project which started out at Facebook, running large-scale interactive analytic queries against a 300PB data lake using Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive. In November, 2013, Facebook open sourced Presto under the Apache 2 license, and made it available in the public GitHub code repository named “prestodb”. In early 2019, the hard fork named Trino was started by the creators of Presto who later became cofounder/CTOs of the commercial vendor Starburst. In the meantime, Presto became part of the openly governed Presto Foundation, hosted under the guidance and experience of The Linux Foundation. Trino has subsequently diverged from Presto. Many of the innovations the community is driving in Presto are not available in outside of Presto. Today, only Presto is running at companies like Facebook, Uber, Twitter, and Alibaba.

Why is Trino so fast?

As it is a hard fork of the original Presto project, it carries with it some of the original elements which make Presto so fast, namely the in-memory execution architecture. Prior to Presto, distributed query engines such as Hive were designed to store intermediate results to disk.

How does Trino work?

It’s a distributed system that runs on Hadoop, and uses an architecture similar to massively parallel processing (MPP) databases. It has one coordinator node working with multiple worker nodes. Users submit SQL to the coordinator which uses query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It supports standard ANSI SQL, including complex queries, joins aggregations, and outer joins.

What is Apache Trino?

Actually, this is a misnomer in that Trino is not a project hosted under the well-known Apache Software Foundation (ASF). Apache Incubator and top level projects are subject to the naming conventions “Apache [Project Name].” An example of which is Apache Mesos. Instead trino project which is a hard fork of Presto is with a vendor controlled non-profit called the Trino Software Foundation. It is not affiliated with any well-known project hosting organizations like ASF for The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which they are licensed with.

Is Trino OLAP?

It’s an open source distributed SQL query engine. It is a hard fork of the original Presto project created by Facebook. It lets developers run interactive analytics against large volumes of data. With Trino, organizations can easily use their existing SQL skills to query data without having to learn new complex languages. The architecture is quite similar to traditional online analytical processing (OLAP) systems using distributed computing architectures, in which one controller node coordinates multiple worker nodes.

What is the Trino Software Foundation?

The Software Foundation is a non-profit corporation which is controlled by the cofounders of the commercial vendor Starburst. The Trino Software Foundation has the open source Trino project. It is a hard fork of the Presto project, which is separate and hosted by the Linux Foundation. From the trino website there’s only two sentences about the foundation: “The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.” What is not mentioned is any form of charter or governance. These are tables stakes for Linux Foundation projects, where the project governance is central to the project.

What SQL does Trino use?

Just like the original Presto, is built with a familiar SQL query interface that allows interactive SQL on many data sources. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations.

What Is A Trino database?

Their distributed system runs on Hadoop/HDFS and other data sources. It uses a classic MPP model (massively parallel processing). The java-based system has a coordinator node (master) working in conjunction with a scalable set of worker nodes. Users send their SQL query through a client to the Trino coordinator which plans and schedules a distributed query plan across all its worker nodes. Both, Trino and Presto are SQL query engines and thus are not databases by themselves. They do not store any data, but from a user perspective, Trino can appear as a database because it queries the connected data stores.

What is the difference between Presto and Trino?

There are technical innovations and differences between Presto and Trino that include:
– Presto is developed, tested, and runs at scale at Facebook, Uber, and Twitter
– Presto uses 6X less memory and repartitions 2X faster with project Aria
– “Presto on Spark” today can run massive batch ETL jobs.
– Presto today is 10X faster with project RaptorX, providing caching at multiple levels
– The Presto community is making Presto more reliable and scalable with multiple coordinators instead of the single point of failure of one coordinator node.  

Trino can query data where it is stored, without needing to move data into separate warehouse or analytics database. Queries are executed in parallel with the memory of distributed worker machines. Most results return in seconds of time. Whereas Trino is a new fork, Presto continues to be used by many well-known companies: Facebook, Uber, Twitter, AWS. Trino is vendor driven project, as it is hosted in a non-profit organization which is owned by the cofounders of the Trino vendor Starburst. In comparison, Presto is hosted by Presto Foundation, a sub-foundation under The Linux Foundation. There are multiple vendors who support Presto, including the Presto as a Service (SaaS) offerings: Ahana Cloud for Presto and AWS Athena, which is based on Presto, not Trino.

As the diagram below illustrates, Presto saves time by running queries in the memory of the worker machines, running operations on intermediate datasets in-memory which is much faster, instead of persisting them to disk. It also shuffles data amongst the workers as needed. This also obviates the writes to disk between the stages. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

Whereas the pipelining approach between Presto and Trino is shared, Presto has a number of performance innovations that are not shared, such as caching.  For more about the differences, see the April 2021 talk by Facebook at PrestoCon Day, which describe what they, along with others like Ahana, are doing to push the technology forward.

Trino is a distributed SQL query engine that is used best for running interactive analytic workloads on your data lakes and data sources. It is used for similar use cases that the original Presto project was designed for. It allows you to query against many different data sources whether its HDFS, Postgres, MySQL, Elastic, or a S3 based data lake. Trino is built on Java and can also integrate with other third party data sources or infrastructure components. 

Trino SQL

After the query is parsed, Trino processes the workload into multiple stages across workers. Computing is done in-memory with staged pipelines.

To make Trino extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, they has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), 
Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server. Like the original community-driven open source Presto project, the data is queried where it is stored, without the need to move it into a separate analytics system.  

What to hear more about Ahana – the easiest Presto managed service ever made? Learn more about Ahana Cloud
data-warehouse-or-data-lake-on-demand

Webinar On-Demand
Data Warehouse or Data Lake, which one do I choose?

(Hosted by Dataversity)

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.


Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.


In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

Speaker

Ali LeClerc

Head of Community, Ahana

ali leclerc

presto-query-analyzer-logo

Ahana Announces New Presto Query Analyzer to Bring Instant Insights into Presto Clusters

Free-to-use Presto Query Analyzer by Ahana enables data platform teams to analyze Presto workloads and ensure top performance

San Mateo, Calif. – May 18, 2022 – Ahana, the only SaaS for Presto, today announced a new tool for Presto users called the Presto Query Analyzer. With the Presto Query Analyzer, data platform teams can get instant insights into their Presto clusters including query performance, bandwidth bottlenecks, and much more. The Presto Query Analyzer was built for the Presto community and is free to use.

Presto has become the SQL query engine of choice for the open data lakehouse. The open data lakehouse brings the reliability and performance of the data warehouse together with the flexibility and simplicity of the data lake, enabling data warehouse workloads to run alongside machine learning workloads. Presto on the open data lakehouse enables much better price performance as compared to expensive data warehousing solutions. As more companies are moving to an open data lakehouse approach with Presto as its engine, having more insights into query performance, workloads, resource consumption, and much more is critical.

“We built the Presto Query Analyzer to help data platform teams get deeper insights into their Presto clusters, and we are thrilled to be making this tool freely available to the broader Presto community,” said Steven Mih, Cofounder & CEO, Ahana. “As we see the growth and adoption of Presto continue to skyrocket, our mission is to help Presto users get started and be successful with the open source project. The Presto Query Analyzer will help teams get even more out of their Presto usage, and we look forward to doing even more for the community in the upcoming months.”

Key benefits of the Presto Query Analyzer include:

  • Understand query workloads: Break down queries by operators, CPU time, memory consumption, and bandwidth. Easily cross-reference queries for deep drill down.
  • Identify popular data: See which catalog, schema, tables, and columns are most and least frequently used and by who.
  • Monitor research consumption: Track CPU and memory utilization across the users in a cluster.

The Presto Query Analyzer by Ahana is free to download and use. Download it to get started today.

More Resources

Free Download: Presto Query Analyzer by Ahana

Presto Query Analyzer sample report

Tweet this:  @AhanaIO announces #free #Presto Query Analyzer for instant insights into your Presto clusters https://bit.ly/3lo2rMM

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Data Lakehouse

What is Amazon Redshift Used For?

Introduction

Amazon Redshift is one of the most widely-used services in the AWS ecosystem, and is a familiar component in many cloud architectures. In this article, we’ll cover the key facts you need to know about this cloud data warehouse, and the use cases it is best suited for. We’ll also discuss the limitations and scenarios where you might want to consider alternatives.

What is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse offered by AWS. First introduced in 2012. Today Redshift is used by thousands of customers, typically for workloads ranging from hundreds of gigabytes to petabytes of data.

Redshift is based on PostgreSQL 8.0.2 and supports standard SQL for database operations. Under the hood, various optimizations are implemented to provide fast performance even at larger data scales,. This includes massively parallel processing (MPP) and read-optimized columnar storage.

What is a Redshift Cluster?

A Redshift cluster represents a group of nodes provisioned as resources for a specific data warehouse. Each cluster consists of a leader and compute nodes. When a query is executed, Redshift’s MPP design means it distributes the processing power needed to return the results of an SQL query between the available nodes. It does this automatically.

Determining cluster size depends on the amount of data stored in your database. This also depends on the number of queries being executed, and the desired performance. 

Scaling and managing clusters can be done through the Redshift console, the AWS CLI, or programmatically through the Redshift Query API.

What Makes Redshift Unique?

When Redshift was first launched, it represented a true paradigm shift from traditional data warehouses provided by the likes of Oracle and Teradata. As a fully managed service, Redshift allowed development teams to shift their focus away from infrastructure and toward core application development. The ability to add compute resources automatically with just a few clicks or lines of code, rather than having to set up and configure hardware, was revolutionary and allowed for much faster application development cycles.

Today, many modern cloud data warehouses offer similar linear scaling and infrastructure-as-a-service functionality. A few notable products including Snowflake and Google BigQuery. However, Redshift remains a very popular choice and is tightly integrated with other services in the AWS cloud ecosystem.

Amazon continues to improve Redshift, and in recent years has introduced federated query capabilities, serverless, and AQUA (hardware accelerated cache).

Redshift Use Cases

Redshift’s Postgres roots mean it is optimized for online analytical processing (OLAP) and business intelligence (BI) – typically executing complex SQL queries on large volumes of data rather than transactional processing which focuses on efficiently retrieving and manipulating a single row.

Some common use cases for Redshift include:

  • Enterprise data warehouse: Even smaller organizations often work with data from multiple sources such as advertising, CRM, and customer support. Redshift can be used as a centralized repository that stores data from different sources in a unified schema and structure to create a single source of truth. This can then feed enterprise-wide reporting and analytics.
  • BI and analytics: Redshift’s fast query execution against terabyte-scale data makes it an excellent choice for business intelligence use cases. Redshift is often used as the underlying database for BI tools such as Tableau (which otherwise might struggle to perform when querying or joining larger datasets).
  • Embedded analytics and analytics as a service: Some organizations might choose to monetize the data they collect by exposing it to customers. Redshift’s data sharing, search, and aggregation capabilities make it viable for these scenarios, as it allows exposing only relevant subsets of data per customer while ensuring other databases, tables, or rows remain secure and private.
  • Production workloads: Redshift’s performance is consistent and predictable, as long as the cluster is adequately-resourced. This makes it a popular choice for data-driven applications, which might use data for reporting or perform calculations on it.
  • Change data capture and database migration: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. This is typically done to provide more flexible analytical capabilities, or when migrating from legacy data warehouses.

Redshift Challenges and Limitations 

While Amazon Redshift is a powerful and versatile data warehouse, it still suffers from the limitations of any relational database, including:

  • Costs: Because storage and compute are coupled, Redshift costs can quickly grow very high. This is especially noted when working with larger datasets, or with streaming sources such as application logs.
  • Complex data ingestion: Unlike Amazon S3, Redshift does not support unstructured object storage. Data needs to be stored in tables with predefined schemas. This can often require complex ETL or ELT processes to be performed when data is written to Redshift. 
  • Access to historical data: Due to the above limiting factors, most organizations choose to store only a subset of raw data in Redshift, or limit the number of historical versions of the data that they retain. 
  • Vendor lock-in: Migrating data between relational databases is always a challenge due to the rigid schema and file formats used by each vendor. This can create significant vendor lock-in and make it difficult to use other tools to analyze or access data.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced Redshift use cases, or managing increasing cloud storage costs, check out Ahana. Ahana Cloud is a powerful managed service for Presto which provides SQL on S3. 

Start Running SQL on your Data Lakehouse

Go from 0 to Presto in 30 minutes and drop the limitations of the data warehouse

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

There can be some confusion with the difference between AWS Redshift Spectrum and AWS Athena. Learn more about the differences in this article.

ETL process diagram

ETL and ELT in Data Warehousing

What is the difference between ETL and ELT?

What is ETL used for?

If you’re looking to understand, ETL and ELT differences let’s start with explaining what they are. ETL, or Extract Transform Load, is when an ETL tool or series of homegrown programs extracts data from a data source(s), often a relational database, and performs transformation functions. Those transformations could be data cleansing, standardizations, enrichment, etc., and then write (load) that data into a new repository, often a data warehouse. 

In the ETL process, an ETL tool or series of programs extracts the data from different RDBMS source systems, and then transforms the data, by applying calculations, concatenations, etc., and then loads the data into the Data Warehouse system.

What is ETL

What is ELT used for?

ELT, or Extract Load Transform turns the ETL process around a little bit and has you extract the raw data out from the data source and directly load it into the destination, without any processing in between. The transformation process is then done “in place” in the destination repository. Generally, the raw data is stored indefinitely so various transformations and enrichments can all be done by users with access to it, using tools they are familiar with.

What is ELT?

Both are data integration styles and have much in common with their ultimate goals, but are implemented very differently. Knowing what they are, and understanding the ETL and ELT processes, let’s dive deeper into how they differ from one another.

What is the difference between ETL and ELT?

So how does ETL vs ELT break down?

CategoryETLELT
DefinitionData is extracted from ‘n’ number of data sources. Transformed in a separate process, then loaded into the destination repository.Data is extracted from ‘n’ number of data sources and directly loaded into the destination repository. Transformation occurs inside the destination.
TransformationData is transformed within an intermediate processing step that is independent of extract and load.Data can be transformed on an ad-hoc basis during reads, or in batch and stored in another set of tables.
Code-Based TransformationsPrimarily executed in the compute-intensive transformation process.Primarily executed in the database but also done ad-hoc through analysis tools.
Data Lake SupportOnly in the sense that it can be utilized as storage for the transformation step.Well oriented for the data lake.
CostSpecialized servers for transformation can add significant costs.Object stores are very inexpensive, requiring no specialized servers.
MaintenanceAdditional servers add to the overall maintenance burden.Fewer systems mean less to maintain.
LoadingData has to be transformed prior to loading. Data is loaded directly into the destination system.
MaturityETL tools and methods have been around for decades and are well understood.Relatively new on the scene, with emerging standards and less experience.

Use Cases

Let’s take HIPAA as an example of data that would lend itself to ETL rather than ELT. The raw HIPAA data contains a lot of sensitive information about patients that isn’t allowed to be shared, so you would need to go through the transformation process prior to loading it to remove any of that sensitive information. Say your analysts were trying to track cancer treatments for different types of cancer across a geographic region. You would scrub your data down in the transformation process to include treatment dates, location, cancer type, age, gender, etc., but remove any identifying information about the patient.

An ELT approach makes more sense with a data lake where you have lots of structured, semi-structured, and unstructured data. This can also include high-velocity data where you are trying to make decisions in near real-time. Consider an MMORPG where you want to offer incentives to players in a particular region that have performed a particular task. That data is probably coming in through a streaming protocol such as Kafka and analysts are doing transforming jobs on the fly to distill it down to the necessary information to fuel the desired action.

Differences between ETL and ELT

Summary

In summary, the difference between ETL and ELT in data warehousing really comes down to how you are going to use the data as illustrated above. They satisfy very different use cases and require thoughtful planning and a good understanding of your environment and goals. If you’re exploring whether to use a data warehouse or a data lake, we have some resources that might be helpful. Check out our white paper on Unlocking the Business Value of the Data Lake which discusses the data lake approach in comparison to the data warehouse. 

Ready to Modernize Your Data Stack?

In this free whitepaper you’ll learn what the open data lakehouse is, and how it overcomes challenges of previous solutions. Get the key to unlocking lakehouse analytics.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

Data Lakehouse

Presto has evolved into a unified engine for SQL queries on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.

Presto’s Redshift connector allows conducting SQL queries on the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

How to Run SQL Queries in Redshift with Presto

Step 1: Setup a Presto cluster with Kubernetes

Set up your own Presto cluster on Kubernetes using our Presto on Kubernetes tutorial or you can use Ahana’s managed service for Presto

Step 2: Setup a Amazon Redshift cluster

Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Amazon Redshift cluster – “redshift-presto-demo” 

SQL queries in Redshift

 Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.

You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules. 

If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.

Step 3: Configure Presto Catalog for Amazon Redshift Connector

At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.

Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connection-password=secret
connector.name=redshift
connection-url=jdbc:postgresql://example.net:5439/database
connection-user=root

This is how my catalog properties look like –

  my_redshift.properties: |
      connector.name=redshift   
      connection-user=awsuser
      connection-password=admin1234 
connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev

Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database

After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

IN the below example you can see a new catalog for Redshift Database got initiated called “my_redshift. ”

presto> show catalogs;
   Catalog   
-------------
 ahana_hive  
 jmx         
 my_redshift 
 system      
 tpcds       
 tpch        
(6 rows)
 
Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Further you can check all available schemas for your Amazon Redshift from Presto to work with.

presto> show schemas from my_redshift;
       Schema       
--------------------
 catalog_history    
 information_schema 
 pg_catalog         
 pg_internal        
 public             
(5 rows)
 
Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:01 [5 rows, 85B] [4 rows/s, 72B/s]

Here, I have used sample data that comes with Redshift Cluster setup. I have chosen the schema “public” which is a part of “dev” Redshift Database. 

presto> show tables from my_redshift.public;
  Table   
----------
 category 
 date     
 event    
 listing  
 sales    
 users    
 venue    
(7 rows)
 
Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:03 [7 rows, 151B] [2 rows/s, 56B/s]

Further, you can explore tables as “sales” in the below example.

presto> select * from my_redshift.public.sales LIMIT 2;
 salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission |        saletime         
---------+--------+----------+---------+---------+--------+---------+-----------+------------+-------------------------
   33095 |  36572 |    30047 |     660 |    2903 |   1827 |       2 | 234.00    | 35.10      | 2008-01-01 01:41:06.000 
   88268 | 100813 |    45818 |     698 |    8649 |   1827 |       4 | 836.00    | 125.40     | 2007-12-31 23:26:20.000 
(2 rows)
 
Query 20210810_185527_00212_krtkp, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]

Following are some more complex queries you can run against sample data:

presto:public> -- Find top 10 buyers by quantity
            ->SELECT firstname, lastname, total_quantity 
            -> FROM   (SELECT buyerid, sum(qtysold) total_quantity
            ->         FROM  sales
            ->         GROUP BY buyerid
            ->         ORDER BY total_quantity desc limit 10) Q, users
            -> WHERE Q.buyerid = userid
            -> ORDER BY Q.total_quantity desc;
 firstname | lastname | total_quantity 
-----------+----------+----------------
 Jerry     | Nichols  |             67 
 Armando   | Lopez    |             64 
 Kameko    | Bowman   |             64 
 Kellie    | Savage   |             63 
 Belle     | Foreman  |             60 
 Penelope  | Merritt  |             60 
 Kadeem    | Blair    |             60 
 Rhona     | Sweet    |             60 
 Deborah   | Barber   |             60 
 Herrod    | Sparks   |             60 
(10 rows)
 
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
 
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
            -> SELECT eventname, total_price 
            -> FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
            ->        FROM (SELECT eventid, sum(pricepaid) total_price
            ->              FROM   sales
            ->              GROUP BY eventid)) Q, event E
            ->        WHERE Q.eventid = E.eventid
            ->        AND percentile = 1
            -> ORDER BY total_price desc;
      eventname       | total_price 
----------------------+-------------
 Adriana Lecouvreur   | 51846.00    
 Janet Jackson        | 51049.00    
 Phantom of the Opera | 50301.00    
 The Little Mermaid   | 49956.00    
 Citizen Cope         | 49823.00    
 Sevendust            | 48020.00    
 Electra              | 47883.00    
 Mary Poppins         | 46780.00    
 Live                 | 46661.00    
(9 rows)
 
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]

Step 5: Run SQL queries to join data between different systems like Redshift and Hive

Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto. 

Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users

 and ahana_hive.default.customer table respectively within their schema.

Following very simple SQL queries to join these tables, the same way you join two tables from the same database. 

presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x  join my_redshift.public.users y on 
x.nationkey = y.userid;
Advanced SQL on Redshift

Understanding Redshift’s Limitations

Running SQL queries on Redshift has its advantages, but there are some shortcomings associated with Amazon Redshift. If you are looking for more information about Amazon Redshift, check out the pros and cons and some of the limitations of Redshift in more detail.


Start Running SQL Queries on your Data Lakehouse

We made it simple to run SQL queries on Presto in the cloud.
Get started with Ahana Cloud and start running SQL in a few mins.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse, permitting the execution of SQL queries, offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. Users can easily run SQL queries on Redshift, but there are some limitations.

Data Lakehouse

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

Before we begin: Redshift Spectrum vs Redshift

While the thrust of this article is an AWS Redshift Spectrum vs Athena comparison, there can be some confusion with the difference between AWS Redshift Spectrum and AWS Redshift. Very briefly, Redshift is the storage layer/data warehouse. Redshift Spectrum, on the other hand, is an extension to Redshift that is a query engine.

Redshift spectrum

What is Amazon Athena?

Athena is Amazon’s standalone, serverless SQL query engine implementation of Presto. This is used to query data stored on Amazon S3. It is fully managed by Amazon, there is nothing to setup, manage or configure. This also means that the performance can be very inconsistent as you have no dedicated compute resources.

What is Amazon Redshift Spectrum?

Redshift Spectrum is an extension of Amazon Redshift. It is a serverless query engine that can query both AWS S3 data and tabular data in Redshift using SQL. This enables you to join data stored in external object stores with data stored in Redshift to perform more advanced queries.

Key Features & Differences: Redshift vs Athena

Athena and Redshift Spectrum offer similar functionality, namely, serverless query of S3 data using SQL. That makes them easy to manage. This also is more cost-effective as there is nothing to set up and you are only charged based on the amount of data scanned. S3 storage is significantly less expensive than a database on AWS for the same amount of data.

  • Pooled vs allocated resources: Both are serverless, however Spectrum resources are allocated based on your Redshift cluster size. Athena, however, relies on non-dedicated, pooled resources.
  • Cluster management: Spectrum actually does need a bit of cluster management, but Athena is truly serverless.
  • Performance: Performance for Athena depends on your S3 optimization, while Spectrum, as previously noted, depends on your Redshift cluster resources and S3 optimization. If you need a specific query to run more quickly, then you can allocate additional compute resources to it.
  • Standalone vs feature: Redshift Spectrum runs in tandem with Amazon Redshift, while Athena is a standalone query engine for querying data stored in Amazon S3.
  • Consistency: Spectrum provides more consistency in query performance while Athena has inconsistent results due to the pooled resources.
  • Query types: Athena is great for simpler interactive queries, while Spectrum is more oriented towards large, complex queries.
  • Pricing: The cost for both is the same. They run $5 per compressed terabyte scanned, however with Spectrum, you must also consider the Redshift compute costs.
  • Schema management: Both use AWS Glue for schema management, and while Athena is designed to work directly with Glue, Spectrum needs external tables to be configured for each Glue catalog schema.
  • Federated query capabilities: Both support federated queries.

Athena vs Redshift: Functionality

The functionality of each is very similar, namely using standard SQL to query the S3 object store. If you are working with Redshift, then Spectrum can join information in S3 with tables stored in Redshift directly. Athena also has a Redshift connector to allow for similar joins. However if you are using Redshift, it would likely make more sense to use Spectrum in this case.

Athena vs Redshift: Integrations

Keep in mind that when working with S3 objects, these are not traditional databases, which means there are no indexes to be scanned or used for joins. If you are working with files with high-cardinality and trying to join them, you will likely have very poor performance.

When connecting to data sources other than S3, Athena has a connector ecosystem to work with. This system provides a collection of sources that you can directly query with no copy required. Federated queries were added to Spectrum in 2020 and provide a similar capability with the added benefit of being able to perform transformations on the data and load it directly into Redshift tables.

AWS Athena vs Redshift: To Summarize

If you are already using Redshift, then Spectrum makes a lot of sense, but if you are just getting started with the cloud, then the Redshift ecosystem is likely overkill. AWS Athena is a good place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena and/or Redshift users that saw challenges around price performance (Redshift) and concurrency/deployment control (Athena). Keep in mind that Athena and Redshift Spectrum provide the same $5 terabyte scanned cost while Ahana is priced purely at instance hours. The power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

amazon athena logo edited

Understanding AWS Athena Costs with Examples

Athena costs can be unpredictable and hard to control. Ahana is the PrestoDB solution that gives you back control over both performance and your cloud bill with pay-as-you-go pricing and the ability to provision more resources to tackle harder queries. Learn more or request a demo today

What Is Amazon Athena? 

Since you’re reading this to understand Athena costs, you likely already know, so we’ll just very briefly touch on what it is. Amazon Athena is a managed serverless version of Presto. It provides a SQL query engine for analyzing unstructured data in AWS S3. The best use case is where reliable speed and scalability are not particularly important, meaning that, since there are no dedicated resources for the service, it will not perform in a consistent fashion. So, testing ideas, small use cases and quick ad-hoc analysis are where it makes the most sense.

How Much Does AWS Athena Cost?

An Athena query costs from $5 to $7 per terabyte scanned, depending on the region. Most materials you read will only quote the $5, but there are regions that cost $7, so keep that in mind. For our examples, we’ll use the $5 per terabyte as our base. There are no costs for failed queries, but any other charges such as the S3 storage will apply as usual for any service you are using.

AWS Athena Pricing Example

In this example, we have a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows. At $5 per TB scanned, we would pay $20 for a query that scans 4 TB of data. If we are running that query 20 times per month, we get to 20 * 20 = $400 per month.

Price per TB scanned:$5
Queries per month:20
TB of data scanned, per query:4
Total monthly cost:$400
Service settings_Amazon Athena

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1 compression rate would cut your costs in half to just $200 per month. Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where we now only have to look at half our data, and now our cost is down to $100 per month.

Let’s go ahead and try a larger example, and not even a crazy big one if you are using the data lake and doing serious processing. Let’s say you have 20 queries per day, and you are working on 100TB of uncompressed, row based data:

pricing calculator

That’s right, $304,000 per month. Twenty queries per day isn’t even unrealistic if you have some departments that are wanting to run some dashboard queries to get updates on various metrics. 

Summary

While we learned details about Athena pricing, we also saw how easy it would be to get hit with a giant bill unexpectedly. If you haven’t compressed your data, or reformatted it to reduce those costs and just dumped a bunch of CSV or JSON files into S3, then you can have a nasty surprise. If you unleash connections to Athena to your data consumers without any controls, you can also end up with some nasty surprises if they are firing off a lot of queries on a lot of data. It’s not hard to figure out what the cost will be for specific usage, and Amazon has provided the tools to do it.

If you’re an Athena user who’s not happy with costs, you’re not alone. We see many Athena users wanting more control over their deployment and in turn, costs. That’s where we can help – Ahana is SaaS for Presto (the same technology that Athena is running) that gives you more control over your deployment. Typically our customers see up to 5.5X price performance improvements on their queries as compared to Athena. 

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Architecture

5 Components of Data Warehouse Architecture

Data Warehouse Architecture

Tip: If you are struggling to get value from your data warehouse due to vendor lock-in or handling semi and unstructured data, set up a time to chat with an engineer about migrating to a Data Lakehouse.

What are the components of a data warehouse?

Most data warehouses will be built around a relational database system, either on-premise or in the cloud, where data is both stored and processed. Other components would include metadata management and an API connectivity layer allowing the warehouse to pull data from organizational sources and provide access to analytics and visualization tools.

A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

The data warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load).

To learn more about the internal architecture of a data warehouse and its various components such as nodes and clusters, check out our previous article on Redshift data warehouse architecture.

Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.

As data warehouse architecture evolved and grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Here’s a quick high level overview of the data warehouse architecture:

Data Warehouse Architecture
In this article we’ll look at the contextual requirements of data warehouse architecture, and the five components of a data warehouse. 

The 5 components of a data warehouse architecture are:
  1. ETL
  2. Metadata
  3. SQL Query Processing
  4. Data layer
  5. Governance/security

ETL 

As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.

Metadata

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL Query Processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data Layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.

Governance/Security

This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

+ Data Warehouse Access Tools

While access tools are external to your data warehouse, they can be seen as its business-user friendly front end. This is where you’d find your reporting and visualization tools, used by data analysts and business users to interact with the data, extract insights, and create visualizations that the rest of the business can consume. Examples of these tools include Tableau, Looker, and Qlik.

Challenges with a Data Warehouse Architecture

Now that I’ve laid out the five key components of a data warehouse architecture, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. 

We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Open Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:

Data Warehouse

The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an Open Data Lakehouse is the way to go.


If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse, or read about the differences between a data lakehouse, a data mesh, and a data warehouse.


Related Articles

Data Warehouse: A Comprehensive Guide

Data warehouse architecture is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is and how it’s used.

Data Warehouse Concepts for Beginners

A relational database that is designed for query and analysis rather than for transaction processing. Learn more here.

What is a Data Lakehouse Architecture?

Overview

The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse architecture is, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.

What is a Data Lakehouse?

First, it’s best to explain a Data Warehouse and a Data Lake.

Data Warehouse

A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.

Data Lake

On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).

Data Lakehouse explained_diagram

Data Lakehouse

Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.

Data Lakehouse Architecture Explained

Here’s an example of a Data Lakehouse architecture:

An example of a Data Lakehouse architecture

You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools. 

There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.

Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.

Building your Data Lakehouse with Ahana Cloud for Presto

At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:

Building your Data Lakehouse with Ahana Cloud for Presto_diagram

Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.

We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.

Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

webinar-introduction-toahana-cloud-on-aws-ondemand

Webinar On-Demand
An introduction to Ahana Cloud for Presto on AWS

The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity of the Data Lake, enabling data warehouse workloads to run on the data lake. At the heart of the data lakehouse is Presto – the open source SQL query engine for ad hoc analytics on your dataDuring this webinar we will share how to build an open data lakehouse with Presto and AWS S3 using Ahana Cloud.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. Ahana is a managed service for Presto that gives data platform teams of all sizes the power of SQL for their data lakehouse.

In this webinar we will cover:

  • What an Open Data Lakehouse is
  • How you can use Presto to underpin the lakehouse in AWS
  • A demo on how to get started building your Open Data Lakehouse in AWS

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

Data Lakehouse

Enterprise Data Lake Formation & Architecture on AWS

What is Enterprise Data Lake

An enterprise data lake is simply a data lake for enterprise-wide information sharing and storing of data. The key purpose of “Enterprise data lake” is to incorporate analytics on it to unlock business insights from the stored data.

Why AWS Lake formation for Enterprise Data Lake

The key purpose of “Enterprise Data Lake” is to run analytics to gain business insights. As part of that process,governance of data becomes more important to secure the access of data between different roles in the enterprise. AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for data lakes on AWS S3. 

Enterprise Data Lake Formation & Architecture

Enterprise data platforms need a simpler, scalable, and centralized way to define and enforce access policies on their data lakes . A policy based approach to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, it may not be an option for enterprises.

Data Lake Storage diagram

AWS Lake Formation allows enterprises to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). 

Benefits of Lake formation for Enterprise Data Lakes

  •  One schema – shareable with no dependency on architecture
  •  Share Lake Formation databases and tables to any Amazon accounts 
  •  No Amazon S3 policy edits required 
  •  Receivers of the data can use analytic service provider like Ahana to run analytics.
  •  There is no dependency between roles on how the data will be further shared.
  •  Centralized logging.

AWS Enterprise Lake Formation: To Summarize

AWS Lake Formation has been integrated with AWS partners like Ahana cloud, a managed service for SQL on data lakes. These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

Related Articles

The Role of Blueprints in Lake Formation

A Lake Formation Blueprint allows you to easily stamp out and create workflows. Learn more about what is in this article.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become very popular over the last year or so. Learn more about what it is and how it’s used.

webinar build analytics stack – 1

Webinar On-Demand
How to build an Open Data Lakehouse Analytics stack

As more companies are leveraging the Data Lake to run their warehouse workloads, we’re seeing many companies move to an Open Data Lakehouse stack. The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity the Data Lake, enabling data warehouse workloads to run on the data lake.

Join us for this webinar where we’ll show you how you can build an open data lakehouse stack. At the heart of this stack is Presto, the open source SQL query engine for the data lake, and the transaction manager / governance layer, which includes technologies like Apache Hudi, Delta Lake, and AWS Lake Formation.

You’ll Learn:

  • What an Open Data Lakehouse Analytics Stack is
  • How Presto, the de facto query engine for the data lakehouse, underpins that stack
  • How to get started building your open data lakehouse analytics stack today

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

Data Lakehouse

How to Query Your JSON Data Using Amazon Athena

Querying semi-structured data in Athena can get expensive. Ahana provides the same SQL-based interactive analytics on your S3 data, with predictable pricing and performance. Take control of your analytics – learn more about Ahana on our website or in a quick, no-strings-attached call with a solution architect.

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.


JSON Data

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous AWS Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Related Articles

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

Tutorial: How to run SQL queries with Presto on BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

SCaFoWYr3cmprK3ZSEvpBBsgC1ftG8KxE4HhUAmU1htxU2IUbn6mmLZJ7FhIhGM9WCRY6l8Tk3MlUTbPNxzqX851Uq

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

0aUUfLabQ91Z p5WH5epQ2Q7 X79jr6HEOGtombODt2rM083VBraTw9QWtBWrXsa36xPbebDuJeiz4 7SK5q5BadLGxX6BuaraZBpItTnQqORU0kLO1Rx6yLOOvNTJNSY
  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

9GKit7efkigufFrSRaq4q55DLfQcBY Dm3XMy4krolUbmc4yI3QQA00AEM 3L7Ro5V EwLAuYUs7ZYXOebNVF361bLec6HN8aI6p7NWbGWb4pbswHZFqVhW4E5fZ3L2k96cqsn1X
  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

1zwAjBUbwaRRFfPjp1VquvCI dH8Ba7C5m41 FBCgBvb6vAr9UeNGW8bE 7ESJsgkeyPTbge5EJBk5JHUKO6k0Hd6ayFB0 7Xk6JNspcmWqEp n 5sdMRRtBmTMqd8k8dL4glSA

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

SBpCdf6 qZkSnSR4JMaEauAk20CKGoTx8quE7mmUNJWJJ27feHkwmCcOuLyMqkpjF3COl

show schemas from bigquery;

WJvDcPeSzfkSngHjSV oeacdp rVXqd53dwYFcyePYl6kWzGVy9rp29qcOWOJbhl7hKJemNVSb1OgOQYXrJIwryluVlD3q06jmzAYblGbHSMwVIX0q6WnfPdF7ET8Z0b3YroT085

show tables from bigquery.rohan88;

pVW1k3JaXGSB2K1k M3G2f0NhEogho7cACY fEdnB4GBz9MCs4B46R1zOvm1HLxHi9JYD0aWDjNUsTF bMOx5fMva3K6kdnREmucbxfLcmjCYw 36qds70m6om55TKD0db3tVld

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

8DDiuBMCMikR rp2 oOjLNgDqM2S9j0XTW1oJbLmw 8tJv6E

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password
mNFFvwlr1vHPt CRq07RdbIcE93B f7AA6bZaw vmzgMo5sipvg8SIpGOZ0ngZCeKOqpkIb7fy8DHPVsy4B44ZPzAQ7Yv

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

9In mgUZOQvbDt1sqLyqxFMXI6mCBGdvCKkFWlVvNQ1 SB3pNeQNmcwmNn URw54h u6a9aJrrZef0p23YFf790WeyLQCpGIPGxKl

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Related Articles

How do I query a data lake with Presto?

Learn how to get up and running with Presto.

What is Presto (and FAQ’s about Presto)

This article will explain what Presto is and what it’s used for.

What is an Open Data Lake in the Cloud?

Data Lake in the Cloud

Problems that necessitate a data lake

In today’s competitive landscape, more and more companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. This is obvious when understanding the environment in which these companies operate in. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities through the development of evidence-based strategies. Also, analytics dashboards can be presented to customers for added value.

Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you would expect – historical information about products, services, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data formats and is far beyond Enterprise Data. Some current examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.

The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive when used to handle this amount. And to add to this, data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether. 

Data Lake insights

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture. 

The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.

With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data. 

Why Open for a Data Lake

As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake. 

If you want to learn more about why you should be thinking about building an Open Data Lake in the cloud, check out our free whitepaper on Unlocking the Business Value of the Data Lake – how open and flexible cloud services help provide value from data lakes.

Helpful Links

Best Practices for Resource Management in PrestoDB

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

Data Lakehouse

AWS Athena vs AWS Glue: What Are The Differences?

If you’re looking to improve your cloud architecture, you need to check out Ahana. Rather than struggling with the limitations of other interactive query tools, you can build your analytics foundation on the same open-source SQL engine that powers petabyte-scale queries at Meta and Uber – and use Ahana to hit the ground running with a managed platform. Find out why companies choose Ahana over Athena by scheduling a call with an Ahana solution architect.

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

FeaturesAWS AthenaAWS Glue
ImplementationServerless implementation of PrestoEcosystem of tools for schema discovery and ETL
QueryingPrimarily used as a query tool for analyticsMore of a transformation and data movement tool
ComponentsOnly includes AthenaIncludes Glue Metastore and Glue ETL
MetadataUses AWS Glue Catalog as a metadata catalogUsed as a central hive-compatible metadata catalog
Data FormatSupports structured and unstructured dataSupports CSV, parquet, orc, avro, or json
PricingCosts $5 per terabyte scannedPriced purely at instance hours

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

CREATE EXTERNAL TABLE sampleTable (
  col1 INT,
  col2 INT,
  str1 STRING,
  ) STORED AS AVRO
  TBLPROPERTIES (
  'classification'='avro')
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

CSV table answers AthenaVGlue

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Best Practices for Resource Management in PrestoDB

Data Lakehouse

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources. Resource management in a distributed environment makes accessibility of data easier and manages resources over the network of autonomous computers (i.e. Distributed System). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. In order to be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups as well as best practices and considerations to think about before setting up a production system with resource grouping.

Getting Started

Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the  example below from the PrestoDB doc pages.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "jmxExport": true,
      "subGroups": [
        {
          "name": "data_definition",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 1,
          "schedulingWeight": 10,
          "subGroups": [
            {
              "name": "other",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 2,
              "maxQueued": 1,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 1,
                  "maxQueued": 100
                }
              ]
            },
            {
              "name": "bi-${tool_name}",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 10,
              "maxQueued": 100,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 3,
                  "maxQueued": 10
                }
              ]
            }
          ]
        },
        {
          "name": "pipeline",
          "softMemoryLimit": "80%",
          "hardConcurrencyLimit": 45,
          "maxQueued": 100,
          "schedulingWeight": 1,
          "jmxExport": true,
          "subGroups": [
            {
              "name": "pipeline_${USER}",
              "softMemoryLimit": "50%",
              "hardConcurrencyLimit": 5,
              "maxQueued": 100
            }
          ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
    {
      "user": "bob",
      "group": "admin"
    },
    {
      "source": ".*pipeline.*",
      "queryType": "DATA_DEFINITION",
      "group": "global.data_definition"
    },
    {
      "source": ".*pipeline.*",
      "group": "global.pipeline.pipeline_${USER}"
    },
    {
      "source": "jdbc#(?<tool_name>.*)",
      "clientTags": ["hipri"],
      "group": "global.adhoc.bi-${tool_name}.${USER}"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ],
  "cpuQuotaPeriod": "1h"
}

Okay, so there is clearly a LOT going on here so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limitation.

Penalties

Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.

Resource Group Specifications

The example below is a resource specification of how to evenly distribute CPU resources between two different users.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1h"
}

The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:

  • hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
  • maxQueued sets the limit on how many queries can be queued
  • schedulingPolicy ‘fair’ determines how queries within the same group are prioritized

Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended as well.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently this is set to 1 minute and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long amount of time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values are shown  below.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

With testing, the above resource group spec results in two queries completing – using a total of 127m CPU time. From there, all further queries block for about 2 minutes before they run again. This blocked time adds up because for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back down to zero so queries could submit again.

Conclusion

Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads which depend on variable input data so that a specified scheduled job doesn’t arbitrarily end up taking over a large chunk of resources. For resource partitioning between multiple users/teams the best approach still seems to be to run and maintain multiple segregated Presto clusters.


Ready to get started with Presto? Check out our tutorial series where we cover the basics: Presto 101: Installing & Configuring Presto locally.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become popular over the last year as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse is, the common architecture and how companies are using the it in production today.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Query editor

Querying Parquet Files using Amazon Athena

Struggling to maintain consistent performance in Amazon Athena? Ahana is the Presto solution that gives you control and scalability. Increase the number of nodes to handle any amount of queries through a simple dashboard to get all the benefits of Athena, without the limitations. Learn more or schedule a demo now!

If you’re working with large volumes of data, you’re likely using the Apache Parquet file format. Here are steps to quickly get set up and query your parquet files with a service like Amazon Athena.

Is Parquet better than CSV for Athena?

Generally, the answer is yes. Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc.

How to query Parquet files

Parquet is a columnar storage that is optimized for analytical querying. Data warehouses such as Redshift support Parquet for optimized performance. Parquet files stored on Amazon S3 can be queried directly by serverless query engines such as Amazon Athena or open source Presto using regular SQL.

Below we show a simple example of running such a query.

Prerequisites

Sample Parquet DataDownload from Ahana
AWS Accounts & Roles (Required)Amazon S3, Amazon Athena
AWS Accounts & Roles (Optional)AWS Glue (Optional but highly recommended)

Setting up the Storage

For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket. 

Log in to your AWS account and select the S3 service in the Amazon Console.

  1. Click on Create Bucket
  2. Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created. 
  3. Scroll to the bottom and click Create Bucket
  4. Click on your newly created bucket
  5. Create a folder in the S3 bucket called ‘test-data’
  6. Click on the newly created folder
  7. Choose Upload Data and upload your parquet file(s).

Running a Glue Crawler

Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.

  1. On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
  2. Click Add Crawler
  3. Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
  4. In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
  5. In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
  6. In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
  7. For Frequency leave the setting as default and choose Next
  8. For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
  9. Review and then choose Finish.
  10. AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
  11. Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.

Querying the Parquet file from AWS Athena

Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.

  1. Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
Query Editor
  1. Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
  2. Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
  3. The final result should look something like this: in the S3 storage step.
Query Editor v2

Conclusion

Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations. 

Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.

Learn how you can get better price/performance when querying S3: schedule a free consultation call with an Ahana solution architect.

Configuring RaptorX – a multi-level caching with Presto

Multi-level-Data-Lake-Cashing-with-RaptorX

RaptorX Background and Context

Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it  usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network.  As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how Presto-Hive connector works –

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. Scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer CacheTypeAffinity SchedulingBenefits
Data IO Local DiskRequiredReduced query latency
Intermediate Result SetLocal DiskRequiredReduced query latency and CPU utilization for aggregation queries 
File MetadataIn-memoryRequiredReduced CPU & latency decrease
Metastore In-memoryNAReduced query latency
File ListIn-memoryNAReduced query latency
Table: Summary of Presto Multi Layer Cache Implementation

Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) cache

This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks. 

To enable this cache, worker configuration needs to be updated with below properties at

etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache
-Dalluxio.user.app.id=presto

#2 Fragment result set cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.

Enable file list cache by setting below set of properties at

/catalog/<catalog-name>.properties 

# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File metadata cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.

This is in memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).

For Parquet, it caches the file and block level metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity scheduler

With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node  to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working. 

In order to enable affinity scheduler set below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How can you test or debug your RaptorX cache setup with JMX metrics?

Each section describes queries to be run and query the jmx metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries to verify Data IO data cache usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries to trigger fragment result cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result cache JMX metrics.

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC metadata cache

Queries to trigger ORC cache usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet metadata cache

Queries to trigger Parquet metadata cache

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List cache

Query File List cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage. 

Multi-level Data Lake Cashing with RaptorX
Figure: Multi-level Data Lake Cashing with RaptorX

Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.

icon-aws-lake-formation.png

The Role of Blueprints in Lake Formation on AWS

Why does this matter?

There are 2 major steps to create a Data Lakehouse on AWS, first is to set up your S3-based Data Lake and second is to run analytical queries on your data lake. A popular SQL engine that you can use is Presto. This article is focused on the first step and how AWS Lake Formation Blueprints can make that easy and automated. Before you can run analytics to get insights, you need your data continuously pooling into your lake!

AWS Lake Formation helps with the time-consuming data wrangling involved with maintaining a Data Lake. It makes that simple and secure. In Lake Formation, there is the Workflows feature. Workflows encompasses a complex set of ETL jobs to load and update data. 

work flow diagram

What is a Blueprint?

A Lake Formation Blueprint allows you to easily stamp out and create workflows. This is an automation capability within Lake Formation. There are 3 types: Database snapshots, incremental database, and log file blueprints.

The database blueprints support automated data ingestion of sources like MySQL, PostgreSQL, SQL service to the Open Data Lake. It’s a point and click service with simple forms in the AWS console.

A Database snapshot does what it sounds like, it loads all the tables from a JDBC source to your lake. This is good when you want time stamped end-of-period snapshots to compare later.

An Incremental database also does what it sounds like, taking only the new data or the deltas into the data lake. This is faster and keeps the latest data in your data lake. The Incremental database blueprint uses bookmarks on columns for each successive incremental blueprint run. 

The Log file blueprint takes logs from various sources and loads them into the data lake. ELB logs, ALB logs, and Cloud Trail logs are an example of popular log files that can be loaded in bulk. 

Summary and how about Ahana Cloud?

Getting data into your data lake is easy, automated, and consistent with AWS Lake Formation. Once you have your data ingested, you can use a managed service like Ahana Cloud for Presto to enable fast queries on your data lake to derive important insights for your users. Ahana Cloud has integrations with AWS Lake Formation governance and security policies. See that page here: https://ahana.io/aws-lake-formation 

lake formation diagram
Mysql

Presto equivalent of mysql group_concat

As you may know, PrestoDB supports ANSI SQL and includes support for several SQL dialects. These dialects include MySQL dialect, making it easy to group and aggregate data in a variety of ways. However not ALL functions in MySQL are supported by PrestoDB! 

Mysql

Now, let us look at the really useful MySQL and MariaDB SQL function GROUP_CONCAT(). This function is used to concatenate data in column(s) from multiple rows into one field. It is an aggregate (GROUP BY) function which returns a String. This is assuming the group contains at least one non-NULL value (otherwise it returns NULL). GROUP_CONCAT() is an example of a function that is not yet supported by PrestoDB and this is the error you will see if you try using it to get a list of customers that have ordered something along with their order priorities:

presto> use tpch.sf1;

presto:sf1> select custkey, GROUP_CONCAT(DISTINCT orderpriority ORDER BY orderpriority SEPARATOR ',') as OrderPriorities from orders GROUP BY custkey;

Query 20200925_105949_00013_68x9u failed: line 1:16: Function group_concat not registered

Is there a way to handle this? If so what’s the workaround? There is!

array_join() and array_agg() to the rescue! 

presto:sf1> select custkey,array_join(array_distinct(array_agg(orderpriority)),',') as OrderPriorities from orders group by custkey;
 custkey |                OrderPriorities                 
---------+------------------------------------------------
   69577 | 2-HIGH,1-URGENT,3-MEDIUM,5-LOW,4-NOT SPECIFIED 
   52156 | 4-NOT SPECIFIED,3-MEDIUM,1-URGENT,5-LOW,2-HIGH 
  108388 | 5-LOW,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,1-URGENT 
  111874 | 5-LOW,1-URGENT,2-HIGH,4-NOT SPECIFIED          
  108616 | 1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM,2-HIGH 
(only the first 5 rows displayed) 

If you do not want to use the DISTINCT operator (you want duplicates in your result set in other words) then there is an easy solution. To skip the DISTINCT operator, simply drop the array_distinct() function from your query:

presto:sf1> select custkey,array_join(array_agg(orderpriority),',') as OrderPriorities from orders group by custkey;
 custkey | OrderPriorities                             
---------+-------------------------------------------------------------------------------- 
   24499 | 5-LOW,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,2-HIGH,4-NOT SPECIFIED,3-MEDIUM,1-URGENT,2-HIGH,3-MEDIUM,1-URGENT,5-LOW,3-MEDIUM,4-NOT SPECIFIED,4-NOT SPECIFIED,4-NOT SPECIFIED,3-MEDIUM,3-MEDIUM,5-LOW,1-URGENT,1-URGENT,4-NOT SPECIFIE
   58279 | 4-NOT SPECIFIED,2-HIGH,5-LOW,1-URGENT,1-URGENT,5-LOW,5-LOW,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,5-LOW,3-MEDIUM,1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,5-LOW,5-LOW,3-MEDIUM,3-MEDIUM,1-URGENT,3-MEDIUM,2-HIGH,5-LOW
  142027 | 1-URGENT,2-HIGH,2-HIGH,1-URGENT,3-MEDIUM,1-URGENT,5-LOW,4-NOT SPECIFIED,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,2-HIGH,1-URGENT,3-MEDIUM,5-LOW,3-MEDIUM,4-NOT SPECIFIED,2-HIGH,1-URGENT,5-LOW,2-HIGH,5-LOW,1-URGENT,4-NOT SPECIFIED,2-HIG
   94169 | 1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,5-LOW,4-NOT SPECIFIED,2-HIGH,5-LOW,4-NOT SPECIFIED                                                                                                                                                                                                        
   31607 | 4-NOT SPECIFIED,2-HIGH,4-NOT SPECIFIED,2-HIGH,2-HIGH,5-LOW 

You can, of course specify the separator character. In the example shown above, I have used a comma as the separator.

It’s worth noting that, like PrestoDB, there wasn’t a T-SQL equivalent of the MySQL GROUP_CONCAT() function in Microsoft SQL Server either. However, T-SQL now has the STRING_AGG() function which is available from SQL Server 2017 onwards.

And hey, presto, you now have a working Presto equivalent of mysql group_concat.


Understanding the Presto equivalent of mysql group_concat – Now what?

If you are looking for more tips are tricks, here’s your next step. Check out our Answers section to learn more about PrestoDB, competitor reviews and comparisons, and more technical guides and resources to get you started.


Related Articles

What is Presto?

What’s Presto, how did it start, and what is it for? Ready to answer these questions? Take a deeper dive into understanding Presto. Learn what PrestoDB is, how it got started, and the benefits for Presto users.

How to Build a Data Lake Using Lake Formation on AWS

What is AWS Lake Formation? AWS lake formation helps users to build, manage, and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. Learn more about AWS Lake Formation, including the pros and cons of Amazon Lake Formation.

Data Warehouse: A Comprehensive Guide

Looking to learn more about data warehouses? Start here for a deeper look. This article will cover topics like what it is: a data warehouse is a data repository. Also get more info about their use: typically a warehouse is used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is, how it’s used, and the pros and cons of a data warehouse compared to a data lake.

AWS Athena Alternatives

Querying Amazon S3 Data Using AWS Athena

The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. In this article we’ll look more closely at AWS S3 and AWS Athena.

How Does AWS Athena work with Amazon S3

In AWS land, AWS S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.

The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.

How to query S3 using AWS Athena

The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option. 

From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:

  1. Set up your Database, Table, and Folder Names & Locations
  2. Choose the data format you’ll be querying
  3. Define your columns so Athena understands your data schema
  4. Set up your Data Partitions if needed

Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.

What happens when AWS Athena hits its limits

While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS. 

Here’s a quick comparison between the two offerings:

AWS Athena replacement

Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.

We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.

Related Articles


What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

Managed service for SQL

Ahana Announces New Security Capabilities to Bring Next Level of Security to the Data Lake

Advancements include multi-user support, deep integration with Apache Ranger, and audit support 

San Mateo, Calif. – February 23, 2022 Ahana, the only SaaS for Presto, today announced significant new security features added to its Ahana Cloud for Presto managed service. They include multi-user support for Presto and Ahana, fine-grained access control for data lakes with deep Apache Ranger integration, and audit support for all access. These are in addition to the recently announced one-click integration with AWS Lake Formation, a service that makes it easy to set up a secure data lake in a matter of hours.

The data lake isn’t just the data storage it used to be. More companies are using the data lake to store business-critical data and running critical workloads on top of it, making security on that data lake even more important. With these latest security capabilities, Ahana is bringing an even more robust offering to the Open Data Lake Analytics stack with Presto at its core.

“From day one we’ve focused on building the next generation of open data lake analytics. To address the needs of today’s enterprises that leverage the data lake, we’re bringing even more advanced security features to Ahana Cloud,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “The challenge with data lake security is in its shared infrastructure, and as more data is shared across an organization and different workloads are run on the same data, companies need fine-grained security policies to ensure that data is accessed by the right people. With these new security features, Ahana Cloud will enable faster adoption of advanced analytics with data lakes with advanced security built in.”

“Over the past year, we’ve been thrilled with what we’ve been able to deliver to our customers. Powered by Ahana, our data platform enables us to remain lean, bringing data to consumers when they need it,” said Omar Alfarghaly, Head of Data Science, Cartona. “With advanced security and governance, we can ensure that the right people access the right data.”

New security features include:

  • Multi-user support for Presto: Data platform admins can now seamlessly manage users without complex authentication files and add or remove users for their Presto clusters. Unified user management is also extended across the Ahana platform and can be used across multiple Presto clusters. For example, a data analyst gets access to the analytics cluster but not to the data science cluster.
  • Multi-user support for Ahana: Multiple users are now supported in the Ahana platform. An admin can invite additional users via the Ahana console. This is important for growing data platform teams.
  • Apache Ranger support: Our open source plugin allows users to enable authorization in Ahana-managed Presto clusters with Apache Ranger for both the Hive Metastore or Glue Catalog queries, including fine-grained access control up to the column level across all clusters. In this newest release of the Ahana and Apache Ranger plug-in, all of the open source Presto and Apache Ranger work is now available in Ahana and it’s now incredibly easy to integrate through just a click of a button. With the Apache Ranger plugin, customers can easily add role-based authorization. Policies from Apache Ranger are also now cached in the plugin to enable little to no query time latency impact.  Previously, support for Apache Ranger was only available in open source using complicated config files.
  • Audit support: With extended Apache Ranger capabilities, Ahana customers can enable centralized auditing of user access on Ahana-managed Presto clusters for comprehensive visibility. For example, you can track when users request access to data and if those requests are approved or denied based on their permission levels.
  • AWS Lake Formation integration: Enforce AWS Lake Formation fine-grained data lake access controls with Ahana-managed Presto clusters.

“We’re seeing an increasing proportion of organizations using the cloud as their primary data lake platform to bring all of an enterprise’s raw structured and unstructured data together, realizing significant benefits such as creating a competitive advantage and helping lower operational costs,” said Matt Aslett, VP and Research Director, Ventana Research. “Capabilities such as governance mechanisms that allow for fine-grained access control remain important given the simplicity of the cloud. Innovations that allow for better data governance on the data lake, such as those Ahana has announced today, will help propel usage of more sophisticated use cases.”

Supporting Resources:

Tweet this:  @AhanaIO announces new security capabilities for the data lake #analytics #security #Presto https://bit.ly/3H0Hr7p

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

AWS Lake Formation vs AWS Glue – What are the differences?

Last updated: October 2022

As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies:

  • AWS Lake Formation for security and governance; and
  • AWS Glue. a data catalog.

While both of these services are typically used to build, manage, and operationalize AWS data lakes, they fulfil completely different roles. AWS Lake Formation is built around AWS Glue, and both services share the same AWS Glue Data Catalog; however, Lake Formation provides a wider breadth of governance and data management functionality, whereas Glue is focused on ETL and data processing.

What is AWS Lake Formation? 

AWS Lake Formation makes it easier for you to build, secure, and manage data lakes. It provides a means to address some of the challenges around unstructured data lake storage – including security, access control, governance, and performance.

How it works: AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services

For AWS users who want to get governance on their data lake, AWS Lake Formation makes it easy to set up a secure data lake very quickly (in a matter of days). 

In order to provide better query performance when using services such as Athena or Presto, Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake.

When should you use AWS Lake Formation? 

At its core, Lake Formation is built to simplify the process of moving your data to a data lake, cataloging the data, and making it available for querying. Typical scenarios where this comes into play include:

  • Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  • Add Authorization on your Data Lake  – You can centrally define and enforce security, governance, and auditing policies.
  • Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

To understand how this works in practice, check out our article on using Redshift Spectrum in Lake Formation.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake.  Using the AWS Glue Data Catalog, users can easily find and access data.

Glue ETL can be used to run managed Apache Spark jobs in order to prepare the data for analytics, perform transformations, compact data, and convert it into columnar formats such as Apache Parquet.

Read more: What’s the difference between Athena and Glue?

When should you use AWS Glue?

To make data in your data lake accessible, some type of data catalog is essential. Glue is often the default option as it’s well-integrated into the broader AWS ecosystem, although you could consider open-source alternatives such as Apache Iceberg. Glue ETL is one option to process data, where alternatives might include running your own Spark cluster on Amazon EMR or using Databricks.

Typical scenarios where you might use include:

  • Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
  • Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
  • Build ETL pipelines to ingest data into your S3 data lake. 

The data workflows initiated from AWS Lake Formation blueprint are executed as AWS Glue jobs. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.

AWS Lake Formation vs AWS Glue: A Summary

AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics. While both of these services are used as data lake building blocks, they are complimentary. Glue provides basic functionality needed in order to enable analytics, including data cataloging and ETL; Lake Formation offers a simplified way to manage your data lake, including the underlying Glue jobs.

Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Presto platform

Presto Platform Overview

The Presto platform is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. PrestoDB was developed from the ground up by the engineers at Meta. Currently, some of the world’s most well known, innovative and data-driven companies like Twitter, Uber, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook , as example, still uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

The Presto platform was designed and written from scratch for handle interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Airbnb or Twitter.

Presto allows users to effectively query data where it lives. This is including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources. This, in turn, allows for quick and accurate analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at data analysts and data scientists who expect response times ranging from sub-second to minutes. The Presto platform breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source. Examples of such are Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Curious about how you can get going with the Presto platform? Ahana offers a managed service for Presto in the cloud. You can get started for free today with the Ahana Community Edition or a free trial for the full edition. The Ahana Community Edition is a free forever version of the Ahana Cloud managed service.

What is an Open Data Lake in the Cloud?

Have you been hearing the term “Open Data Lakehouse” more often? Learn what The Open Data Lake in the cloud actually is, and how it’s a solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses including their strengths and weaknesses.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple. Amazon Redshift tries to accommodate different use cases, but the pricing model does not fit all users. Learn more about the pricing of Amazon Redshift.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product. Redshift is based on PostgreSQL version 8.0.2. Learn more about the pros and cons of Amazon Redshift.

Data Lakehouse

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.

lake formation_image

What is AWS Lake Formation?

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

We’re seeing more companies move to the data lake because it’s flexible, cheaper, and much easier to use than a data warehouse. You’re not locked into proprietary formats, nor do you have to ingest all of your data into a proprietary technology. As more companies are leveraging the data lake, then security becomes even more important because you have more people needing access to that data and you want to be able to control who sees what. 

AWS Lake Formation can help address security on the data lake. For Amazon S3 users, it’s a seamless integration that allows you to get granular security policies in place on your data. AWS Lake Formation gives you three key capabilities:

  1. Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  2. Simplify security management – You can centrally define and enforce security, governance, and auditing policies.
  3. Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

If you’re currently using AWS S3 or planning to, we recommend looking at AWS Lake Formation as an easy way to get security policies in place on your data lake. As part of your stack, you’ll also need a query engine that will allow you to get analytics on your data lake. The most popular engine to do that is Presto, an open source SQL query engine built for the data lake.

At Ahana, we’ve made it easy to get started with this stack: AWS S3 + Presto + AWS Lake Formation. We provide SaaS for Presto with out of the box integrations with S3 and Lake Formation, so you can get a full data lake analytics stack up and running in a matter of hours.

AWS lake formation diagram

Check out our webinar where we share more about our integration with AWS Lake Formation and how you can actually enforce security policies across your organization.

Data Lakehouse

How does Presto Work With LDAP?

What is LDAP?

To learn how does Presto work with LDAP, let’s first cover what LDAP is. The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry standard application protocol used for directory services authentication. In LDAP user authentication, the LDAP server authenticates users to directly communicate with the Presto server. 

Presto & LDAP

Presto can be configured to enable LDAP authentication over HTTPS for clients, such as the Presto CLI, or the JDBC and ODBC drivers. At present only a simple LDAP authentication mechanism involving username and password is supported. The Presto client sends a username and password to the coordinator and the coordinator validates these credentials using an external LDAP service.

To enable LDAP authentication for Presto, the Presto coordinator configuration file needs to be updated with LDAP-related configurations. No changes are required to the worker configuration; only the communication from the clients to the coordinator is authenticated. However, if you want to secure the communication between Presto nodes then you should configure Secure Internal Communication with SSL/TLS.

Summary of Steps to Configure LDAP Authentication with Presto:

Step 1: Gather configuration details about your LDAP server

Presto requires Secure LDAP (LDAPS), so make sure you have TLS enabled on your LDAP server as well.

Step 2: Configure SSL/TSL on Presto Coordinator

Access to the Presto coordinator must be through HTTPS when using LDAP authentication.

Step 3: Configure Presto Coordinator with config.properties for LDAP

Step 4: Create a Password Authenticator Configuration (etc/password-authenticator.properties) file on the coordinator

Step 5: Configure Client / Presto CLI with either a Java Keystore file or Java Truststore for its TLS configuration.

Step 6: Restart your Presto Cluster and invoke the CLI with LDAP enabled CLI with  either –keystore-* or –truststore-* or both properties to secure TLS connection.

Reference: https://prestodb.io/docs/current/security/ldap.html

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our presentation with AWS on how to get started in 30min with Presto in the cloud.

Apache Ranger plugin-diagram

What is Apache Ranger?

What is Apache Ranger? In a Nutshell

Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the data platform. It is an open-source authorization solution that provides access control and audit capabilities for big data platforms through centralized security administration.

Its open data governance model and plugin architecture enabled the extension of access control to other projects beyond the Hadoop ecosystem, and the platform is widely accepted among “major cloud vendors like AWS, Azure, GCP”. 

With the help of the Apache Ranger console, admins can easily manage centralized, fine-grained access control policies, including file, folder, database, table and column-level policies across all clusters. These policies can be defined at user level, role level or group level.

Apache Service Integration

Apache Ranger uses plugin architecture in order to allow other services to integrate seamlessly with authorization controls.

Apache Ranger plugin diagram

Figure: Simple sequence diagram showing how the Apache Ranger plugin enforces authorization policies with Presto Server.

AR also supports centralized auditing of user access and administrative actions for comprehensive visibility of sensitive data usage through a centralized audit store that tracks all the access requests in real time and supports multiple audit stores including Elasticsearch and Solr.

Many companies are today looking to leverage the Open Data Lake Analytics stack, which is the open and flexible alternative to the data warehouse. In this stack, you have flexibility when it comes to your storage, compute, and security to get SQL on your data lake. With Ahana Cloud, the stack includes AWS S3, Presto, and in this case our AR integration. 

Ahana Cloud for Presto and Apache Ranger

Ahana-managed Presto clusters can take advantage of Ranger Integration to enforce access control policies defined in Apache. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack we hosted with Dzone.

Related Articles

What are the differences between Presto and Apache Drill?

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets”. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

What Is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB

Benchmark Presto | Benchmarking Warehouse Workloads

Benchmark Presto

TPC-H Benchmark Presto

To learn how to benchmark Presto, Let’s first start by covering the basics. Presto is an open source MPP Query engine designed from the ground up for high performance with linear scaling. Businesses looking to solve their analytics workload using Presto need to understand how to evaluate Presto performance and this technical guide will help in the endeavor. Learn how to get started with running your own benchmark. 

To help users who would like to benchmark Presto, we have written a detailed, informative guide on how to set up your PrestoDB benchmark using benchto. Benchto is an open source framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environments.

Running a benchmark on PrestoDB can help you to identify things like: 

  • system resource requirements 
  • resource usage during various operations 
  • performance metrics for such operations
  • …and more, depending on your workload and use case

This technical guide provides an overview on TPC-H, the industry standard for benchmarking. It also goes into detail to explain how to configure and use the open-source Benchto tool to benchmark Presto. In addition, it will show an example on comparing results between two different runs of an Ahana-managed Presto cluster with and without cache enabled.

benchmark warehouse workloads

We hope you find this useful! Happy benchmarking.

AWS & Ahana Lake Formation

Webinar On-Demand
How to Build and Query Secure S3 Data Lakes with Ahana Cloud and AWS Lake Formation

AWS Lake Formation is a service that allows data platform users to set up a secure data lake in days. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply.

In this webinar, we’ll share more on the recently announced AWS Lake Formation and Ahana integration. The AWS & Ahana product teams will cover:

  • Quick overview of AWS Lake Formation & Ahana Cloud
  • The details of the integration
  • How data platform teams can seamlessly integrate Presto natively with AWS Glue, AWS Lake Formation and AWS S3 through a demo

Join AWS Solution Architect Gary Stafford and Ahana Principal Product Manager Wen Phan for this webinar where you’ll learn more about AWS Lake Formation from an AWS expert and get an insider look at how you can now build a secure S3 data lake with Presto and AWS Lake Formation.


Webinar Transcript

SPEAKERS

Ali | Ahana, Wen Phan, Ahana, Gary Stafford | AWS

Ali LeClerc | Ahana 

All right I think we have folks joining, so thanks everyone for getting here bright and early, if you’re on the west coast, or if you’re on the East Coast your afternoon I guess we will get started here in just a few minutes.

Ali LeClerc | Ahana 

I’ll play some music to get people in the right mindset to learn about Lake Formation and Ahana Cloud for Presto. Wen, do you want to share the title slide of your slide deck are you going to start with something else? Up to you.

Wen Phan | Ahana 

I’ll bring it up in a second.

Ali LeClerc | Ahana 

Alright folks, thanks for joining. We’re going to just wait a few more minutes until we get things kicked off here, just to let people join, so give us a few minutes enjoy the music.

Ali LeClerc | Ahana 

Alright folks, so we’re just waiting a few more minutes letting people get logged in and join and we’ll get started here in just a few.

Ali LeClerc | Ahana 

All right. We are three minutes past the hour. So let’s go ahead and get started. Welcome folks to today’s Ahana webinar “How to Build and Secure AWS S3 Data Lakes with Ahana Cloud and AWS Lake Formation.” I’m Ali LeClerc, and I will be moderating today’s webinar. So before we get started, just a few housekeeping items. One is this session is recorded. So afterwards, you’ll get a link to both the recording and the slides. No need to take copious amounts of notes, you will get both the slides and the recording. Second is we did have an AWS speaker Gary Stafford, who will be joining us, he unfortunately had something come up last minute, but he will be joining as soon as he can finish that up. So you will have an AWS expert join. If you do have questions, please save those. And he will be available to take them later on. Last, like I just mentioned, we are doing Q&A at the end. So there’s a Q&A box, you can just pop your questions into that Q&A box at the bottom of your control panel. And again, we have allotted a bunch of time at the end of this webinar to take those questions. So with that, I want to introduce our speaker Wen Phan. Wen is our principal product manager at Ahana, has been working extensively with the AWS Lake Formation team to build out this integration and is an expert in all things Ahana Cloud and AWS Lake Formation. Before I turn things over to him to get started, I want to share or launch a poll, just to get an idea of the audience that we have on the webinar today. How familiar are you with Presto, with data lakes, and with Lake Formation? So if you could take just a few seconds to fill that in, that would be super appreciated. And we can kind of get a sense of who we have on today’s webinar. Wen is going to kind of tailor things on the fly based on the results here. So looks like good. We have some results coming in. Wen can you see this? Or do I need to end it for you to see it? Can you see any of the results?

Wen Phan | Ahana 

I cannot see. Okay, the results?

Ali LeClerc | Ahana 

No worries. So I’m going to wait we have 41% – 50% participation. I’m going to wait a few more seconds here. And then I will end the poll and show it looks like just to kind of give real time familiarity with Presto, most people 75% very little data lakes, I think it’s more spread across the board. 38% very little 44% have played around 90% using them today. Familiar already with Lake formation. 50% says very little. So it looks like most folks are fairly new to these concepts. And that is great to know. So I’ll just wait maybe a few more seconds here. Looks like we have 64% participation. Going up a little, do 10 more seconds. Give people a minute and a half of this and then I will end the poll here. We’re getting closer, we’re inching up. All righty. Cool. I’m going to end the poll. I’m going to share the results. So everybody can kind of see the audience makeup here. Alrighty. Cool. So with that, Wen, I will turn things over to you.

Wen Phan | Ahana 

Awesome. Thanks, Ali. Thanks, everyone for taking that poll that was very, very useful. Like Ali said, I’m a product manager here at Ahana. I’m really excited to be talking about Ahana Cloud and Lake Formation today. It’s been a project that I’ve been working on for several months now. And excited to have it released. So here’s the agenda, we’ll go through today. Pretty straightforward. We’ll start with some overview of AWS Lake Formation, what it is, then transition to what Ahana is, and then talk about the integration between Ahana Cloud for Presto and AWS Lake Formation. So let’s get into it, AWS Lake Formation. So this is actually an AWS slide. Like Ali mentioned, Gary had something come up, so I’ll go ahead and present it. The bottom line is everybody, and companies, want more value from their data. And what you see here on the screen are some of the trends that we’re seeing in terms of the data growing, coming from multiple sources, being very diverse. Images and tax. It’s being democratized more throughout the organization, and more workloads are using the data. So traditional BI workloads are still there. But you’ll see a lot more machine learning data science type workloads. The paradigm that is emerging to support, this proliferation of data with low-cost storage, as well as allowing for multiple applications to consume it is the data lake essentially.

Today, folks that are building and securing data lakes, it’s taking a while, and this is what AWS is seeing. This is the impetus of why they built AWS Lake Formation. There are three kind of high level components to Lake Formation. The first one is to just streamline the process and make building data lakes a lot faster. So try to compress what used to take months to today’s and providing tooling that can make it easier to move store, update and catalog data. The second piece is the security piece. This is actually the cornerstone of what we’ll be demonstrating and talking about today. But how do you go about, securing –  once you have your data in your data lake, how do you go about securing it? Enforcing policies and authorization model? And although data lake is very centralized, sharing the data across the organization, is very important. So another tenant of AWS Lake Formation is to actually make it quite easy or easier to discover and share your data.

So that’s a high level of Lake Formation. Now, we’ll go into Ahana and kind of why we went and built this and worked with AWS at a early stage to integrate with Lake Formation. So first, for those of you who don’t know, Ahana is the Presto company. And I think there are a few of you who are very new to Presto. So this is a single slide essentially giving a high level overview of what Presto is. Presto is a distributed query engine. It’s not a database, it is a way for us to allow you to access different data sources using ANSI SQL and querying it. The benefit of this distributed query nature is you can scale up and as you need it for the for the data. So that’s really the second point. Presto offers very low latency, a performance that can scale to a lot of large amounts of data. The third piece is Presto was also created in a pluggable architecture for connectors. And what this really translates to, is it supports many data sources. And one prominent use case for Presto, in addition to low latency interactive querying is federated querying or querying across data sources.

The final high-level kind of takeaway for Presto, it is open source, it was originally developed at Meta, aka Facebook, and it’s currently under the auspices of the Linux Foundation. And at the bottom of this slide, here are typical use cases of why organizations go ahead and deploy Presto, given the properties that I’ve kind of mentioned above. Here is a architecture diagram of Presto, I just saw a question it’s MPP. To answer that question.

Ali LeClerc | Ahana 

Can you repeat the question? So everybody knows what it was.

Wen Phan | Ahana 

Yeah, the question is your architecture MPP or SMP? It’s MPP. And this is the way it’s kind of laid out kind of, again, very high level. So, the bottom layer, you have a bunch of sources. And you can see it’s very, very diverse. We have everything from NoSQL type databases to typical relational databases, things in the cloud, streaming, Hadoop. And so Presto is kind of this query layer between your storage, wherever your data is, be able to query it. And at the top layer other consumers of the query engine, whether it be a BI tool, a visualization tool, a notebook. Today, I’ll be using a very simple CLI to access Presto, use a Presto engine to query the data on the data lake across multiple sources and get your results back. So this all sounds amazing. So today, if you were to use Presto and try to stand up Presto yourself, you’re running to potentially run to some of the challenges. And basically, you know, maintaining, managing, spinning up a Presto environment can still be complex today. First of all, it is open source. But if you were to just get the open-source bits, you still have to do a lot of legwork to get the remaining infrastructure to actually start querying. So you still need a catalog. I know some of you are new to data lakes, essentially, you have essentially files in some kind of file store. Before it used to be distributed file systems like HDFS Hadoop, today, the predominant one is S3, which is an object store. So you have a bunch of files, but those files really don’t really mean anything in terms of a query until you have some kind of catalog. So if you were to use Presto, at least the open source version, you still have to go figure out – well, what catalog am I going to use to map those files into some kind of relational entity, mental model for them for you to query? The other one is Presto, has been actually around for quite a while, and it was born of the Hadoop era, it has a ton of configurations. And so if you were to kind of spin this up, you’d have to go figure out what those configurations need to be, going to have to figure out the settings, there’s a lot of complexity there, and, tied to the configuration, you wouldn’t know how to really tune it. What’s good out of the box, might have poor out of the box performance. So all of these challenges, in addition to the proliferation of data lakes, is why Ahana was born and the impetus for our product, which is Ahana Cloud for Presto.

We aim to get you from zero to Presto, in 30 minutes or less. It is a managed cloud service, I will be using it today, you will be able to see it in action. But as a managed cloud service, there is no installation or configuration. We specifically designed this for data teams of all experience levels. In fact, a lot of our customers don’t have huge engineering teams and just really need an easy way of managing this infrastructure and providing the Presto query engine for their data practitioners. Unlike other solutions, we take away most of the complexity, but we still give you enough knobs to tune things, we allow you to select the number of workers, the size of the workers that you want, things like that. And obviously, we have many Presto experts within the company to assist our customers. So that’s just a little bit about Ahana Cloud for Presto, if you want to try it, it’s pretty simple. Just go to our website at that address above, like Ali said, you’ll get this recording, and you can go ahead to that site, and then you can sign up. You will need an AWS account. But if you have one, we can go ahead and provision the infrastructure in your account. And you can get up and running with your first Presto cluster pretty quickly. And a pause here, see if there’s another question.

Ali LeClerc | Ahana 

Looks like we have a few. What format is the RDBMS data stored in S3?

Wen Phan | Ahana 

Yeah, so we just talked about data. I would say the de facto standard, today’s Parquet. You can do any kind of delimited format, CSV, ORC files, things like that. And that then just depends on your reader to go ahead and interpret those files. And again, you have to structure that directory layout with your catalog to properly map those files to a table. And then you’ll have another entity called the database on top of the table. You’ll see some of that, well. I won’t go to that low level, but you’ll see the databases and tables when I show AWS Lake Formation integration.

Ali LeClerc | Ahana 

Great. And then, I just want to take a second, Gary actually was able to join. So welcome Gary. Gary is Solutions Architect at AWS and, obviously, knows a lot about Lake Formation. Great to have you on Gary. And he’s available for questions, if anybody has specific Lake Formation questions, so carry on Wen.

Wen Phan | Ahana 

Hey, Gary, thanks for joining. Okay, so try to really keep it tight. So, just quickly about Lake Formation, since many of you are new to it. And again, there are three pieces – making it easier to stand up the data lake, the security part, and the third part being the sharing. What we’re focused on primarily, with our integration, and you’ll see this, is the security part. How do we use Lake Formation as a centralized source of authorization information, essentially. So what are the benefits? Why did we build this integration? And what is the benefit? So first of all, many folks we’re seeing have invested in AWS as their data lake infrastructure of choice. S3 is huge. And a lot of folks are already using Glue today. Lake Formation leverages both Glue and AWS. So it’s, it’s a, it was a very natural decision for us seeing this particular trend. And so for folks that have already invested put into S3, and Glue, this is a basic native integration for you guys. So this is a picture of how it works. But essentially, you have your files stored in your data lake storage – parquet, CSV, or RC – the data catalog is mapping that to databases and tables, all of that good stuff. And then the thing that we’re going to be applying is Lake Formations access control. So you have these databases, you have these tables. And what we’ll see is can you control it can you control access to which user has access to which table? Actually will be able to see which users have access to which columns and which rows. And so that’s basically, the integration that we’ve built in. So someone – the data lake admin – will go ahead and not only define the schemas but define the access and Ahana for Presto will be able to take advantage of those policies that have been centrally defined.

We make this very easy to use, this is a core principle in our product as well, as I kind of alluded to at the beginning. We’re trying to really reduce complexity and make things easy to use and really democratize this capability. So doing this is very few clicks, and through a very simple UI. So today, if you were going to Ahana, and I’m going to show this with the live the live application, if we show you the screens. Essentially, it’s an extension of Glue, so you would have Glue, we have a single click called “Enabled AWS Lake Formation.” When you go ahead and click that, we make it very easy, we actually provide a CloudFormation template, or stack, that you can run that will go ahead and hook up Ahana, your Ahana Cloud for Presto, to your Lake Formation. And that’s it. The second thing that we do is you’ll notice that we have a bunch of users here. So, you have all these users. And then you can map them to essentially your IAM role, which are what the policies are tied to in Lake Formation. So, in Lake Formation, you’re going to create policies based on these roles. You can say, for example, the data admin can see everything, the HR analyst can only see tables in the HR database, whatever. But you have these users that then will be mapped to these roles. And once we know what that mapping is, when you log into presto, as these users, the policies tied to those roles are enforced in your queries. And I will show this. But again, the point here is we make it easy, right? There’s a simple user interface for you to go ahead and make the mapping. There’s a simple user interface where then for you to go ahead and enable the integration.

Wen Phan | Ahana 

Are we FedRAMP certified in AWS? At this moment, we are we are not. That is inbound requests that we have had, and that we are exploring, depending on, I think, the need. Today, we are not FedRAMP certified. Then the final piece is the fine-grained access control. So, leveraging Lake Formation. I mentioned this, you’re going to build your data lake, you’re going to have databases, you’re going to have tables. And you know, AWS Lake Formation has had database level security and table level security for quite some time we offer that. More recently, they’ve added more fine-grained access control. So not only can you control the database and the table you have access to, but also the columns and the specific roles you have access to. The role level one being just announced, a little over a month ago at the most recent re:Invent. We’re actually one of the earliest partners to go ahead and integrate with this feature that essentially just went GA. I’ll show this. Okay, so that was a lot of talking. I’m going to do a quick time check, we’re good on time. I’m going to pause here. Let me go see before I go into the demo, let me see what we have for other questions. Okay, great. I answered the FedRAMP one.

Ali LeClerc | Ahana 

Here’s one that came in – Can Presto integrate with AzureAD AWS SSO / AWS SSO for user management and SSO?

Wen Phan | Ahana 

Okay, so the specific AD question, I don’t know the answer to that. This is probably a two to level question. So, there’s, there’s Presto. Just native Presto that you get out of the box and how you can authenticate to that. And then there is the Ahana managed service. What I can say is single sign-on has been a request and we are working on providing more single sign on capabilities through our managed service. For the open-source Presto itself, I am not aware of any direct capability to AzureAD kind of integration there. If you are interested, I can definitely follow up with a more thorough answer. I think we have who asked that, if you actually are interested, feel free to email us and we can reach out to you.

Ali LeClerc | Ahana 

Thanks, Wen.

Wen Phan | Ahana 

Okay, so we’re going to do the demo. Before I get to the nitty gritty of demo. Let me give you some kind of overview and texture. So let me just orient you, everyone, to the application first, let’s go ahead and do that. So many of you are new to go move this new to Ahana. Once you have Ahana installed, this is what the UI looks like. It’s pretty simple, right? You can go ahead and create a cluster, you can name your cluster, whatever, [example] Acme. We have a few settings, and how large you want your instances, what kind of auto scaling you want. Like we mentioned out of the box, if you need a catalog, we can provide a catalog for you. You can create users so that users can log into this cluster, we have available ones here, you can always create a new one. Step one is create your cluster. And then we’ve separated the notion of a cluster from a data source. That way, you can have multiple clusters and reuse configuration that you have with your data source. For example, if you go to a data source, I could go ahead and create a glue data source. And as you select different data sources, you provide the configuration information specific to that data source. In my case, I’ll do a Lake Formation one. So, I’m going to Lake formation, you’ll select what region your Lake Formation services in. You can use Vanilla Glue as well, you don’t have to use Lake Formation, if you don’t want to use the fine-grained access control. If you want to, and you want to use your policies, you enable Lake Formation, and then you go ahead and run the CloudFormation script stack. And they’ll go ahead and do the integration for you. If you want to do it yourself, or you’re very capable, we do provide information about that in our documentation. So again, we try to make things easy, but we also try to be very transparent. If you want more control on you on your own. But that’s it. And then you can map the roles, as I mentioned before, and then you go ahead and add the data source. And it will go ahead and create the data source. In the interest of time, I’ve already done this.

You can see I have a bunch of data sources, I have some Netflix data on Postgres, it’s not really, real data, it’s just, it’s what we call it. We have another data source for MySQL, I have Vanilla Glue, and I have a Lake Formation one. I have a single cluster right now that’s been idle for some time for two hours called “Analysts.” Once it’s up, you can see by default has three workers. It’s scaled down to one, not a really big deal, because these queries I’m going to run aren’t going to be very, very large. This is actually a time saving feature. But once it’s up, you can connect it you’ll have an endpoint. And whatever tool you want, can connect via JDBC, or the endpoint, we have Superset built in. I’m going to go ahead and use a CLI. But that was just a high-level overview of the product, since folks probably are new to it. But pretty simple. The final thing is you can create your users. And you can see how many clusters your users are attached to. All right, so let’s go back to slideware for a minute and set the stage for what you’re going to see. We’re going to query some data, and we’re going to see the policies in Lake Formation in action.

I’ve set up some data so we can have a scenario that we can kind of follow along and see the various capabilities, the various fine grained access control in Lake Formation. So imagine we work for a company, we have different departments, sales department and HR department. And so let’s say the sales department has their own database. And inside there, they have transactions data, about sales transactions, you have information on the customers, and we have another database for human resources or HR to have employees. So here’s a here’s a sample of what the transaction data could look like. You have your timestamp, you have some customer ID, you have credit card number, you have perhaps the category by which that transaction was meant and you have whatever the amount for that transaction was. Customer data, you have the customer ID, which is just a primary key the ID – first name last name, gender, date of birth, where they live – again fictitious data, but will represent kind of representative of maybe some use cases that you’ll run into. And then HR, instead of customers, pretend you have another table with just your employees. Okay? All right. So let’s say I am the admin, and my name is Annie. And I want to log in, and I’m an admin. I should have access to everything, let’s go ahead and try this. So again, my my cluster is already up, I have the endpoints.

Wen Phan | Ahana 

I’m going to log in as Annie. And let’s take a look at what we see.  Presto has different terminology. And it might seem a little confusing. And I’ll go ahead and decode it for everyone, for those of you that are not familiar. Each connector essentially becomes what is called a catalog. Now, this is very different than a data catalog that we talk about. It’s just what they call it. In my case, the Lake Formation data source, that I created, is called LF for Lake Formation. I also called it LF, because I didn’t want to type as much, just to tie this back to what you are seeing. If we go back to here, you notice that the data source is called LF, and I’ve attached it to this cluster that I created, this analyst cluster that I created and attached it. So that’s why you see the catalog name as LF. So that’s great. And LF is attached to Lake Formation, which is has native integration to Glue and S3.  If I actually look at what databases, they’re called schemas in Presto, I have in LF, I should see the databases that I just showed you. So, you see them and you see, ignore the information schema that’s just kind of metadata information, you see sales, and you see HR. And I can actually take a look at what tables I have in the “sales database,” and I have customers and transactions. And you know, I’m an admin. So I should be able to see everything in the transactions table, for example. And I’ve set this policy already in Lake Formation. So I go here, and I should see all the tables, the same data that I showed you in the PowerPoint. So you see the transaction, the customer ID, the credit card number category, etc. So great, I’m an admin, I can do stuff.

Let me see some questions. What do you type to invoke Presto? Okay, so let’s, let’s be very specific for this question. So Presto is already up, right. So I’ve already provisioned this cluster through Ahana. So when I went and said, create cluster, this spun up all the nodes of the Presto cluster set up, configured it did the coordinator, all of that behind the scenes, that’s Presto. It’s a cluster, it’s a query engine distributed cluster. Then you Presto exposes endpoints, [inaudible] endpoint, a JDBC endpoint, that then you can have a client attached to them. Okay, you can have multiple clients. Most BI tools will be able to access this.

In this case, for the simplicity of this demo, I just use a CLI. So I would basically download the CLI, which is just another Java utility. So you need to have Java installed. And then you run the CLI with some parameters. So I have the CLI, it’s actually called Presto, that’s what the binary is, then I pass it some parameters. And I said, Okay, what’s the server? Here’s the endpoint. So it’s actually connecting from my local desktop to that cluster in AWS with this, but you can’t just access it, you need to provide some credentials. So I’m saying I’m going to authenticate myself with a password.

The user I want to access that cluster with is “Annie,” why is this valid? Well, this is valid, because when I created this cluster, When I created this cluster, I specified which users are available in that cluster. So I have Annie, Harry, I have Olivia, I have Oscar, I have Sally, I have Wally. Okay, so to again, just to summarize, I didn’t invoke Presto, from my desktop, my local machine, I’m just using a client, in this case, the Presto CLI to connect to than a cluster that I provisioned via Ahana Cloud – in the cloud. And I’m just accessing that. As part of that, that cluster is already configured to use Lake Formation. The first thing I did was log log in as Annie, and as we mentioned, Annie is my admin. And as an admin, she can access everything she has access to all the databases, all the tables, etc.

Wen Phan | Ahana 

Okay, so let’s do a much more another interesting case.  And let’s say instead of Annie, I log in as Sally, who is a sales analyst. As a platform owner, I know that Sally in order to do her job, all she needs to look at are transactions. Because let’s say she’s going to forecast what the sales are, or she’s going to do some analysis on what type of transactions have been good. So if we go back and look at the transactions table, this is what it looks like. Now, when I do this, though, I do notice that there’s credit card number, and I know that I don’t really want to expose a credit card number to my analysts, because they don’t need it for their work. So I’m going to go ahead – and also in this policy, for financial information – say, you know, any sales analysts, in this case, Sally, can only have access to the transactions table. And when she accesses the transactions table, she will not be able to see the credit card number. Okay. So let’s go see what this looks like. So instead of Annie, I’m going to log in as Sally. Let’s go ahead and just see what we did here. If we actually look at the data source, Annie got mapped to the role of data admin, so she can see everything. Sally is mapped to the role of “sales analysts,” and therefore can only do what a sales analyst is defined to do in Lake Formation. But the magic is it’s defined in Lake Formation. But Ahana Cloud for Presto can take advantage of that policy.

So I’m going to go ahead and log into Sally. Let’s first take a look at the databases that I can see, they’re called schemas in LF. So first thing you’ll notice is Sally does not see HR, because she doesn’t need to, and she has been restricted, so she can only see sales. Now let’s see what tables let’s see what tables Sally can see. Sally can only see transactions, Sally cannot actually see the customers table. But she doesn’t know this. She’s just running queries. And she’s saying, “Well, this is all I can see. And it’s what I need to do my job. So I’m okay with it.” So let’s actually try to query the transactions table now. So sales, LF sales, transactions. When I tried to do this, I actually get an Access Denied. Why? The reason I get an access denied here is because I cannot actually look at all the columns, I’ve been restricted, there’s only a subset of the columns that I can look at. As I mentioned, we are not able to see the credit card number. So when I tried to do a select star, I can’t really do a star because I can’t see the credit card number, we are making this an improvement where we won’t do an explicit deny. And we’ll just return the columns that you have access to. Otherwise, this can be a little annoying. But the end of the day, you can see the authorization being enforced. You have Presto, and it’s and the policies are being enforced, that are set in Lake Formation.

So now instead of doing a star and actually specifically, paste the columns I have access to – I can see it and I can do whatever I need to do. I can do a group by to see what categories are great. I can do a time series analysis on revenue, I’d get in and then do a forecast for the next three months, whatever I need to do as sales analyst. So that’s great. Okay, so I’m going to go ahead and log out. So let’s go back to this. So we know Sally’s world. So now let’s say you know, the marketing manager Ali here has to marketing analyst and there she’s got them responsible for different regions, and we want to get some demographics on our customers. So we have Wally. And if you look at the customer’s data, there’s a bunch of PII – first name, last name, date of birth. So couple of things, we can automatically say, You know what, they don’t need to see this PII, we’re going to go ahead and mask it with Lake Formation. Okay, and like I mentioned, you know, Ali’s kind of segments of her analysts to have different regions across the Pacific West Coast. So Wally is really responsible for only Washington. So we decided to say hey, on a need to know basis, you’re only really going to get rows back that are from customers that live in Washington. Alright, so let’s go ahead and do that. 

Wen Phan | Ahana 

I’m going to log in as Wally, and let’s go actually see the databases again, just to justice to see it and I’m just showing you different layers of the authorization. So Wally can see skills, not HR, well, let’s see what tables while he can see. So while he should only see customers – Wally can only see customers, he cannot see the transactions, because he’s been restricted to it. Let’s try the same thing again, select star from sales customers. And we expect this to air out why? Because again, PII data, we, we cannot do a star. We don’t allow first name, last name, date of birth, all of that, if I do this, and I go ahead and take the customers out. I’ll see the columns that I want, and I only see the rows that come from Washington, I technically did not have to select the state, I just want to prove that I’m only getting records from Washington.

Let’s try another analyst Olivia. And guess what, Olivia is responsible only for Oregon. So she’s basically up here to Wally. But she’s responsible for Oregon. So I’m going to go ahead and do the same query, which is saved and see what happens. So in this case, Olivia can only see Oregon. What you’re seeing here is basically the fine-grained access control, you’re seeing database restriction, you’re seeing table level restriction, you’re seeing column number restriction, and you’re seeing role level restriction. And you can do as many of these as, as you want. So we talked about Wally, and we know Olivia can only see Oregon, one more persona, actually two more personas, and then we’re done. I think you guys all get the point.

I think I’ve probably done enough sufficient proof that we can in fact enforce policies. So last one is just Harry who’s in HR. So if I actually log in as Harry. Harry should only be able to see the HR data set. So if I go Harry. And I show the tables. Well, first of all, let’s just again, just to be complete, I’m only going to say HR, I want to see the sales data. So you can it’s hairy one see the transactions he couldn’t. And then I can go ahead. And since I already know what the schema is, look at all the employees in this database. And I’ll see everything because I’m in HR, so I can see your personal information. And it doesn’t restrict me.

Okay. And the final thing is, what happens if I have a user that I haven’t mapped any policies to? So I actually have one user here, who is Oscar, and I actually didn’t give Oscar any policies whatsoever. So let me go ahead here. So notice that Oscar is in the cluster. But Oscar is not mapped to any role whatsoever. I go back to my cluster, I go here. Oscar, he is here. So Oscar is a valid user in the cluster. But Oscar has no role. And so by default, if you have no role we deny you access. That’s essentially what’s going to happen. But just to prove it. Oscar is using this cluster, show catalogs, you’ll see the LF? Well, let’s say I try to I try to see what’s in LF, what’s in that connector, Access Denied. Because there is no mapping, you can’t see anything, we’re not going to tell you anything. We’re going to tell you what databases are in there. No tables, nothing. So that’s the case where you know, it’s very secure, you don’t have explicit access, you don’t you don’t get any information. Okay, so I’ve been in demo mode for a while, just wanted to check if there’s any questions or chat. All right, none.

So. So let’s just do a summary of what we saw. And then kind of wrap it up for Q&A. We’re good on time, actually. And give you some information of where you can get more information if you want to, you want to dig in, deep.

So first, the review. So we had all these users, you see the roles, we saw a case where you have all access, you saw the case where you have no access. And I did a bunch of other demos where you saw different varying degrees of access, table, database, column role, all of that stuff. And so that’s what that’s what this really integration brings to folks that have a data lake today. You’ve gotten all your data there. Inside your data lake, you’ve decided that Presto is the way to go in terms of interactive querying, because it scales, it can manage all your data. And now you want a role that’s all your analysts or your data practitioners, but you want to do it in a secure way. And you want to enforce it and you want to do it in one place. Lake Formation doesn’t only integrate with Ahana it can integrate with other tools, within the AWS ecosystem. Sure, defining these policies in one place, and Ahana managed Presto clusters can take advantage of that.

There was a more A technical talk on this, if you’re interested in some of the technical details that we just presented at Presto Con, with my colleague, Jalpreet, who is the main engineer on this, as well as another representative from AWS, Roy. If you’re interested, go ahead and just Google this and go to YouTube. And you can go watch this. And they’ll give you more of the nitty gritty underneath the hood, if you’re interested in that. And that is all I have for plans, content.

Ali LeClerc | Ahana 

Wen what a fantastic demo, thanks for going through all of those. Fantastic. So I wanted to give Gary kind of a chance to share his perspective on the integration and his thoughts on you know what this kind of means, from the AWS point of view. So Gary, if you don’t mind putting on your video, that would be awesome. If you can just say hi to everyone and let you kind of share your thoughts.

Gary Stafford | AWS 

That’s much better than that corporate picture that was up there. Yeah, thank you. And I would also recommend as Wen said to view the PrestoCon video with Roy and Jalpreet. I think they go into a lot a lot of detail in respect to how the integration works under the covers. And also, maybe share two links Ali, I’ll paste them in there. One link, kind of what’s new with AWS Lake formation, Roy mentioned some of the new features that were announced, I’ll drop a link in there to let folks know what’s new, it’s a very actively developed project, there’s a lot of new features coming out. So I’ll share that link. And also, Jalpreet mentioned a lot of the API features. Lake Formation has a number of API’s, I’ll drop a link in there too, that discusses some of those available endpoints and API’s a little better. I’ll just I’ll share my personal vision. And I think of services like Amazon Event Bridge that has a partner integration, which makes it very easy for SaaS partners to integrate with customers on AWS platform, I think it’d be phenomenal at some point if Lake Formation progresses to that standpoint with some of the features that that Roy mentioned and Wen demonstrated today. Where partners like Ahana could integrate with Lake Formation, and get an out of the box data lake, a way to create a data lake, a way to secure a data lake and simply add their analytics engine with their special sauce on top of that, and not have to do that heavy lifting. And I hope that’s the direction that Lake Formation is headed in. I think that’ll be phenomenal to have a better integration story with our partners on AWS.

Ali LeClerc | Ahana 

Fantastic. Thanks, Gary. With that, we have a few questions that have come in. Again, if you do have a question, you can pop it into the Q&A box, or even the chat box. So Wen, I think this one’s for you, can you share a little bit more about the details on what happens with the enabling of the integration?

Wen Phan, Ahana 

Sure, I will answer this question in two ways. I will show you what we’re doing under the hood. So that you know, and kind of this API exchange. And this is a recent release. So let me go ahead and share my screen again. I think and whoever asked the question, if I didn’t answer the question, let me know. So when you go to the data source, like I mentioned, it’s pretty simple. And we make we do that on purpose. So when you enable Lake Formation, you can go ahead and launch this CloudFormation template, which will go ahead and do the integration. What does it actually doing under the hood? So first of all, this is actually a good time for me to introduce our documentation. If you go to ahana.io, all of this is documented. So you go to docs, Lake Formation is tightly coupled with Glue, go to manage data sources, you go to Glue, this will tell you walk you through it. And there’s a section here, that tells you if you didn’t want to use this, like you didn’t want to actually use the CloudFormation. Or you just simply want it to understand what is this really doing, you can go ahead and read about it. The essentially, like Roy mentioned, there’s a bunch of API’s, one of the API’s is this data lake settings API with Lake Formation. If you use the AWS CLI, you can actually see this, and you’ll get a response, what we’re doing is there’s a bunch of flags that you need to set, you have to allow Ahana Presto to actually do the filtering on your behalf. So we’re going to get the data, we’re going to look at the policies and block out anything you’re not supposed to see. And we also are a partner. So the service needs to know that this is a valid partner that is interacting with the Lake Formation service. So that’s all this is doing. You could do this all manually if you really wanted to with the CLI. We just take care of this for you, on your behalf. So that’s what’s going on to enable the integration. The second part, and again, this goes into a lot more detail in this talk is what’s actually happening under the hood. I’m just going to show a quick kind of slide for this. But essentially what’s happening is when you make a query, so you defined everything in AWS when you make a query, our service so in our case, we’re a third party application, we go ahead and talk to Lake Formation, you set this up, we go talk for Lake Formation, we get temporary credentials. And then we know what the policies are. And we are able to access only the data that you’re allowed to see. And then we process it with a query. And then you see kind of in the client, in my case, that’s what you saw in the CLI.

Ali LeClerc | Ahana 

Cool, thanks Wen, thorough answer. Next question that came in is this is this product, a competitor to Redshift? I’m assuming when you say product, do you mean Ahana? But maybe you can talk about both Ahana and Presto Wen?

Wen Phan | Ahana 

Yeah, I mean, it all comes down to your use case. So Redshift is kind of more like a data warehouse. And that’s great. It has its own use cases. And again, Presto can connect to Redshift. So it depends on what you want. I mean, Presto can talk to data lake. So if you have use cases that make more sense on a data lake – Presto, is one way to access it. And actually, if you have use cases that need to span both the data lake and Redshift, Presto can federate that query as well. So it’s just another piece in the ecosystem. I don’t necessarily think it’s a competitor, I think it’s, as with many things, what’s your what’s your use-case and pick the right tool for your use-case.

Ali LeClerc | Ahana 

Great. I think you just mentioned something around Glue, Wen, So somebody asked, do I need to use Glue from my catalog? If I’m using Lake Formation with Ahana Cloud?

Wen Phan | Ahana 

Yes, you do. Yes, you do. It’s a tightly coupled AWS stack, which works very well. And so you do have to use Glue.

Ali LeClerc | Ahana 

All right. So I think we’ve answered a ton of questions along the way, as well as just now. If there are no more, and it looks like no more have come in, then I think we can probably wrap up here. So any last kind of parting thoughts Wen and Gary before we say goodbye to everybody? So on that note, I’m going to post our, our link in here. I don’t know if Wen mentioned, maybe he did, we have a 14-day free trial. So no commitment, you can check out Ahana Cloud for Presto on AWS free for 14 days, play around with it, get started Lake Formation. If you’re interested in learning more, we’ll make sure to put you in touch with Wen who again is the local expert at that at Ahana. And then Gary, of course, is always able to help as well. And so, so feel free to check out our 14-day free trial. And with that, I think that’s it. All right, everyone. Thanks Wen fantastic demo, fantastic presentation. Appreciate it. Gary, thanks for being available. Appreciate all of your support and getting this integration off the ground and into the hands of our customers. So fantastic. Thanks, everybody for joining for sticking through with us till the end. You’ll get a link to the recording and the slides and we’ll see you next time.

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Wen Phan

Principal Product Manager, Ahana

Wen-Phan_Picture
andy sacks

Ahana Responds to Growing Demand for its SaaS for Presto on AWS with Appointment of Chief Revenue Officer

Enterprise Sales Exec Andy Sacks Appointed to Strengthen Go-to-Market Team

San Mateo, Calif. – January 11, 2022 Ahana, the only SaaS for Presto, today announced the appointment of Andy Sacks as Chief Revenue Officer, reporting to Cofounder and CEO Steven Mih. In this role, Andy will lead Ahana Cloud’s global revenue strategy. With over 20 years of enterprise experience, Andy brings expertise in developing significant direct and indirect routes to market across both pre and post sales organizations.

Ahana Cloud for Presto is a cloud-native managed service for AWS that gives customers complete control, better price-performance, and total visibility of Presto clusters and their connected data sources. “We’ve seen rapidly growing demand for our Presto managed service offering which brings SQL to AWS S3, allowing for interactive, ad hoc analytics on the data lake,” said Mih. “As the next step, we are committed to building a world-class Go-To-Market team with Andy at the helm to run the sales organization. His strong background building enterprise sales organizations, as well as his deep experience in the open source space, makes him the ideal choice.”

“I am excited to join Ahana, the only company that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Sacks. “I am looking forward to leveraging my experiences to help drive Ahana’s growth through innovative Presto use cases for customers without the complexities of managing cloud deployments.”

Prior to Ahana, Andy spent several years as an Executive Vice President of Sales. Most recently at Alloy Technologies, and prior to that at Imply Data and GridGain Systems, he developed and led each company’s global Sales organization, while posting triple digit growth year over year. At both Imply and GridGain, he created sales organizations from scratch. Prior to GridGain, he spent over six years at Red Hat, where he joined as part of the JBoss acquisition. There he developed and led strategic sales teams while delivering substantial revenue to the company. Prior to Red Hat, he held sales leadership roles at Bluestone Software (acquired by HP), RightWorks (acquired by i2) and Inktomi (acquired by Yahoo! and Verity), where he was instrumental in developing the company’s Partner Sales organization. Andy holds a Bachelor of Science degree in Computer Science from California State University, Sacramento.

Supporting Resources

Download a head shot of Andy Sacks https://ahana.io/wp-content/uploads/2022/01/Andy-Sacks.jpg 

Tweet this:  @AhanaIO bolsters Go-To-Market team adding Chief Revenue Officer Andy Sacks #CRO #newhire #executiveteam https://bit.ly/3zJCvBL

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Managed service for SQL

Ahana Cofounders Make Data Predictions for 2022

Open Data Lake Analytics Stack, Open Source, Data Engineering and More SaaS and Containers Top the List

San Mateo, Calif. – January 5, 2022 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Executive Officer, Steven Mih predict major developments in cloud, data analytics, databases and data warehousing in 2022. 

The COVID-19 pandemic continues to propel businesses to make strategic data-driven shifts. Today more companies are augmenting the traditional cloud data warehouse with cloud data lakes for much greater flexibility and affordability. Combined with more Analytics and AI applications, powerful, cloud-native open source technologies are empowering data platform teams to analyze that data faster, easier and more cost-effectively in SaaS environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2022:

  • OpenFlake – the Open Data Lake for Warehouse Workloads: Data warehouses like Snowflake are the new Teradata with proprietary formats. 2022 will be about the Open Data Lake Analytics stack that allows for open formats, open source, open cloud and no vendor lock-in.
  • More Open Source Behind Analytics & AI – As the momentum behind the Open Data Lake Analytics stack to power Analytics & AI applications continues to grow, we’ll see a bigger focus on leveraging Open Source to address flexibility and cost limitations from traditional enterprise data warehouses. Open source cloud-native technologies like Presto, Apache Spark, Superset, and Hudi will power AI platforms at a larger scale, opening up new use cases and workloads.
  • Database Engineering is Cool Again – With the rise of the Data Lake tide, 2022 will make database engineering cool again. The database benchmarking wars will be back and the database engineers who can build a data lake stack with data warehousing capabilities (transactions, security) but without the compromises (lock-in, cost) will win. 
  • A Post-Pandemic Data-Driven Strategic Shift to Out-Of-The-Box Solutions – The pandemic has brought about massive change across every industry and the successful “pandemic” companies were able to pivot from their traditional business model. In 2022 we’ll see less time spent on managing complex, distributed systems and more time focused on delivering business-driven innovation. That means more out-of-the-box cloud solution providers that reduce cloud complexities so companies can focus on delivering value to their customers.
  • More SaaS, More Containers – When it comes to 2022, abstracting the complexities of infrastructure will be the name of the game. Containers provide scalability, portability, extensibility and availability advantages, and technologies like Kubernetes alleviate the pain around building, delivering, and scaling containerized apps. As the SaaS space continues to explode, we’ll see even more innovation in the container space. 

Steven Mih, Co-founder and Chief Executive Officer, outlines a major trend he sees on the horizon in 2022:

  • Investment & Adoption of Managed Services for Open Source Will Soar – More companies will adopt managed services for open source in 2022 as more cloud-native open source technologies become mainstream (Spark, Kafka, Presto, Hudi, Superset). Open source companies offering easier-to-use, managed service versions of installed software enable companies to take advantage of these powerful systems without the resource overhead so they can focus on business-driven innovation.

Tweet this: @AhanaIO announces 2022 #Data Predictions #cloud #opensource #analytics https://bit.ly/3pT0KtZ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Managed service for SQL

Ahana and Presto Praised for Technology Innovation and Leadership in Open Source, Big Data and Data Analytics with Recent Industry Awards

San Mateo, Calif. – December 15, 2021 Ahana, the only SaaS for Presto, today announced the addition of many new industry accolades in 2H 2021. Presto, originally created by Meta (Facebook) who open sourced and donated the project to Linux Foundation’s Presto Foundation, is the SQL query engine for the data lake. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

Recent award recognitions, include:

  • 2021 BIG Awards for Business, “Start-up of the Year” –  Ahana is recognized by the Business Intelligence Group as a winner of the 2021 BIG Awards for Business Program in the Start-up of the Year category as a company leading its respective industry.
  • CRN, “Emerging Vendors for 2021 – As part of CRN’s Emerging Vendors for 2021, here are 17 hot big data startups, founded in 2015 or later, that solution providers should be aware of. Ahana is listed for its cloud-native managed service for the Presto distributed SQL query engine for Amazon Web Services.
  • CRN, “2021 Tech Innovator Awards” – From among 373 applicants, CRN staff selected products spanning the IT industry—including in cloud, infrastructure, security, software and devices—that offer both strong differentiation and major partner opportunities. Ahana Cloud for Presto was named a finalist in the Big Data category. 
  • DBTA, “Trend Setting Products in Data and Information Management for 2022” – These products, platforms and services range from long-established offerings that are evolving to meet the needs of their loyal constituents to breakthrough technologies that may only be in the early stages of adoption. However, the common element for all is that they represent a commitment to innovation and seek to provide organizations with tools to address changing market requirements. Ahana is included in this list of most significant products. 
  • Infoworld, “The Best Open Source Software of 2021” – InfoWorld’s 2021 Bossie Awards recognize the year’s best open source software for software development, devops, data analytics, and machine learning. Presto, an open source, distributed SQL engine for online analytical processing that runs in clusters, is recognized with a prestigious Bossie award this year. The Presto Foundation oversees the development of Presto. Meta, Uber, Twitter, and Alibaba founded the Presto Foundation and Ahana is a member.  
  • InsideBIGDATA, “IMPACT 50 List for Q3 and Q4 2021” – Ahana earned an Honorable Mention for both of the last two quarters of the year as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 
  • Solutions Review, “Coolest Data Analytics and Business Intelligence CEOs of 2021” – This list of the coolest data analytics CEOs which includes Ahana’s Cofounder and CEO Steven Mih is based on a number of factors, including the company’s market share, growth trajectory, and the impact each individual has had on its presence in what is becoming the most competitive global software market. One thing that stands out is the diversity of skills that these chief executives bring to the table, each with a unique perspective that allows their company to thrive. 
  • Solutions Review, “6 Data Analytics and BI Vendors to Watch in 2022” – This list is an annual listing of solution providers Solutions Review believes are worth monitoring, which includes Ahana. Companies are commonly included if they demonstrate a product roadmap aligning with Solutions Review’s meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication.

“We are proud that Ahana’s managed service for Presto has been recognized by top industry publications as a solution that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Steven Mih, cofounder and CEO, Ahana. “In less than a year, Ahana’s innovation has been proven with innovative use cases delivering interactive, ad-hoc analytics with Presto without having to worry about the complexities of managing cloud deployments.”

Tweet this:  @AhanaIO praised for technology innovation and leadership with new industry #awards @CRN @DBTA @BigDataQtrly @insideBigData @Infoworld @SolutionsReview #Presto #OpenSource #Analytics #Cloud #DataManagement https://bit.ly/3ESDnWy 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

aws ahana color

Ahana Cloud for Presto Delivers Deep Integration with AWS Lake Formation Through Participation in Launch Program

Integration enables data platform teams to seamlessly integrate Presto with their existing AWS data services while providing granular security for data

San Mateo, Calif. – December 9, 2021 Ahana, the only SaaS for Presto, today announced Ahana Cloud for Presto’s deep integration with AWS Lake Formation, an Amazon Web Services, Inc. (AWS) service that makes it easy to set up a secure data lake, manage security, and provide self-service access to data with Amazon Simple Storage Service (Amazon S3). As an early partner in the launch program, this integration allows data platform teams to quickly set up a secure data lake and run ad hoc analytics on that data lake with Presto, the de facto SQL query engine for data lakes.

Amazon S3 has quickly become the de facto storage for the cloud, widely used as a data lake. As more data is stored in the data lake, query engines like Presto can directly query the data lake for analytics, opening up a broader set of Structured Query Language (SQL) use cases including reporting and dashboarding, data science, and more. Security of all this data is paramount because unlike databases, data lakes do not have built-in security and the same data can be used across multiple compute engines and technologies. This is what AWS Lake Formation solves for.

AWS Lake Formation enables users to set up a secure data lake in days. It simplifies the security on the data lake, allowing users to centrally define security, governance, and auditing policies in one place, reducing the effort in configuring policies across services and providing consistent enforcement and compliance. With this integration, AWS users can integrate Presto natively with AWS Glue, AWS Lake Formation and Amazon S3, seamlessly bringing Presto to their existing AWS stack. In addition to Presto, data platform teams will get unified governance on the data lake for many other compute engines like Apache Spark and ETL-focused managed services in addition to the already supported AWS native services like Amazon Redshift and Amazon EMR.

“We are thrilled to announce our work with AWS Lake Formation, allowing AWS Lake Formation users seamless access to Presto on their data lake,” said Dipti Borkar, Cofounder and Chief Product Officer at Ahana. “Ahana Cloud for Presto coupled with AWS Lake Formation gives customers the ability to stand up a fully secure data lake with Presto on top in a matter of hours, decreasing time to value without compromising security for today’s data platform team. We look forward to opening up even more use cases on the secure data lake with Ahana Cloud for Presto and AWS Lake Formation.”

The Ahana Cloud and AWS Lake Formation integration has already opened up new use cases for customers. One use case centers around making Presto accessible to internal data practitioners like data engineers and data scientists, who can then in turn develop downstream artifacts (e.g. models, dashboards). Another use case is exposing the data platform to external clients, which is how Ahana customer Metropolis is leveraging the integration. In Metropolis’ case, they can provide their external customers transparency into internal operational data and metrics, enabling them to provide an exceptional customer experience.

“Our business relies on providing analytics across a range of data sources for our clients, so it’s critical that we provide both a transparent and secure experience for them,” said Ameer Elkordy, Lead Data Engineer at Metropolis. “We use Amazon S3 as our data lake and Ahana Cloud for Presto for ad hoc queries on that data lake. Now, with the Ahana and AWS Lake Formation integration, we get even more granular security with data access control that’s easy to configure and native to our AWS stack. This allows us to scale analytics out to our teams without worrying about security concerns.”

Ahana Cloud for Presto on AWS Lake Formation is available today. You can learn more and get started at https://ahana.io/aws-lake-formation

Supporting Resources:

TWEET THIS: @Ahana Cloud for #Presto delivers deep integration with AWS Lake Formation  #OpenSource #Analytics #Cloud https://bit.ly/3Ix9L35

About Ahana

Ahana, the only SaaS for Presto, offers a managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the Ahana Cloud for Presto integration with AWS Lake Formation

Screen Shot 2021 12 08 at 9.15.34 PM

We’re excited to announce that Ahana Cloud for Presto now integrates with AWS Lake Formation, including support for the recent general availability of row-level security.

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services. Over the past several months we’ve worked closely with the AWS Lake Formation team to bring Lake Formation capabilities to Presto on AWS.  Further, we’re grateful to our customers who were willing to preview early versions of our integration.

Today, Ahana Cloud for Presto allows customers to use Presto to query their data protected with AWS Lake Formation fine-grained permissions with a few clicks.  Our customers can bring Presto to their existing AWS stack and scale their data teams without compromising security.  We’re thrilled that the easiest managed service for Presto on AWS just got easier and more secure.

Here’s a quick video tutorial that shows you how easy it is to get started with AWS Lake Formation and Ahana:

Additionally, we’ve put together a list of resources where you can learn more about the integration.

What’s Next?

If you’re ready to get started with AWS Lake Formation and Ahana Cloud, head over to our account sign up page where you can start with a free 14-day trial of Ahana Cloud. You can also drop us a note at product@ahana.io and we can help get you started. Happy building!

JSON data

Advanced SQL Tutorial

Advanced SQL: JSON

Advanced SQL queries with JSON

Presto has a wide-range of JSON functions supporting advanced SQL queries. Consider this Json test input data (represented in the query using the  VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the  the first (0th) key/value pair we would code:

SELECT json_extract(v, '$.dogs) AS all_json, 
  json_extract(v, '$.dogs[0].name') AS name_json, 
  json_extract_scalar(v, '$.dogs[0].name') AS name_scalar 
FROM 
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
 
                         all_json                         | name_json | name_scalar 
----------------------------------------------------------+-----------+-------------
 [{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle"  | Beagle      
(1 row)

All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html 

Advanced SQL: Arrays, Un-nesting, and Lambda functions 

Consider the following array of test data elements, and simple query to multiple each element by 2:

SELECT elements,
    ARRAY(SELECT v * 2
          FROM UNNEST(elements) AS v) AS my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);
 
    elements    | my_result
----------------+---------------------
 [1, 2]         | [2, 4]
 [1, 3, 9]      | [2, 6, 18]
 [1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)

The above advanced SQL query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data. 

Now here’s the same query, but written using a lambda expression. Why use lambda expressions?  This method makes advanced SQL querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:

SELECT elements, 
transform(elements, v -> v * 2) as my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);

Both queries return the same result. The transform function and “x -> y” notation simply means  “do y to my variable x”.

To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html 

Advanced SQL: Counting Distinct Values

Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines.  The Presto-cli will even display a warning reminding you of this.  

A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values.  The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data). 
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows.  Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):

presto:amazon> select count(distinct product_id) from review;
 
  _col0   
----------
 21460962 
(1 row)
 
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
 
 
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
 
presto:amazon> select approx_distinct(product_id) from review;
  _col0   
----------
 21567368 
(1 row)
 
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

logo prestocon

Ahana Joins Leading Members of the Presto® Community at PrestoCon as Platinum Sponsor, Will Share Use Cases and Technology Innovations

Ahana to deliver keynote and co-present sessions with Uber, Intel and AWS; Ahana customers to present case studies

San Mateo, Calif. – December 1, 2021 Ahana, the only SaaS for Presto, announced today their participation at PrestoCon, a day dedicated to the open source Presto project taking place on Thursday, December 9, 2021. Presto was originally created by Facebook who open sourced and donated the project to Linux Foundation’s Presto Foundation. Since then it has massively grown in popularity with data platform teams of all sizes.

PrestoCon is a day-long event for the PrestoDB community by the PrestoDB community that will showcase more of the innovation within the Presto open source project as well as real-world use cases. In addition to being the platinum sponsor of the event, Ahana will be participating in 5 sessions and Ahana customer Adroitts will also be presenting their Presto use case. Ahana and Intel will also jointly be presenting on the next-generation Presto which includes the native C++ worker.

“PrestoCon is the marquee event for the Presto community, showcasing the latest development and use cases in Presto,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana, Program Chair of PrestoCon and Chair of the Presto Foundation Outreach Committee. “In addition to contributors from Meta, Uber, Bytedance (TikTok) and Twitter sharing their work, we’re excited to highlight more within the Presto ecosystem including contributions like Databricks’ delta lake connector for Presto, Twitter’s Presto Iceberg Connector, and Presto on Spark. Together with our customers like Adroitts, Ahana will be presenting the latest technology innovations including governance on data lakes with Apache Ranger and AWS Lake Formation. We look forward to the best PrestoCon to date.”

“PrestoCon continues to be the showcase event for the Presto community, and we look forward to building on the success of this event over the past year to share even more innovation and use of the open source project with the larger community,” said Chris Aniszczyk, Vice President, Developer Relations, The Linux Foundation. “Presto Foundation continues to focus on community adoption, and PrestoCon is a big part of that in helping bring the Presto community together for a day of deep learning and connecting.”

“As members of the Presto Foundation focused on driving innovation within the Presto open source project, we’re looking forward to sharing our work on the new PrestoDB C++ execution engine with the community at this year’s PrestoCon,” said Arijit Bandyopadhyay, CTO of Enterprise Analytics & AI, Head of Strategy – Cloud and Enterprise, Data Platforms Group, Intel. “Through collaboration with other Presto leaders Ahana, Bytedance, and Meta on this project, we’ve been able to innovate at a much faster pace to bring a better and faster Presto to the community.”

Ahana Customers Speaking at PrestoCon

Ahana Sessions at PrestoCon

  • Authoring Presto with AWS Lake Formation by Jalpreet Singh Nanda, software engineer, Ahana and Roy Hasson, Principal Product Manager, Amazon 
  • Updates from the New PrestoDB C++ Execution Engine by Deepak Majeti, principal engineer, Ahana and Dave Cohen, senior principal engineer, Intel.
  • Presto Authorization with Apache Ranger by Reetika Agrawal, software engineer, Ahana
  • Top 10 Presto Features for the Cloud by Dipti Borkar, cofounder & CPO, Ahana

Additionally, industry leaders Bytedance (TikTok), Databricks, Meta, Uber, Tencent, and Twitter will be sharing the latest innovation in the Presto project, including Presto Iceberg Connector, Presto on Velox, Presto on Kafka, new Materialized View in Presto, Data Lake Connectors for Presto, Presto on Elastic Capacity, and Presto Authorization with Apache Ranger.

View all the sessions in the full program schedule.  

PrestoCon is a free virtual event and registration is open

Other Resources

Tweet this: @AhanaIO announces its participation in #PrestoCon #cloud #opensource #analytics #presto https://bit.ly/3l50AwJ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is Presto on Spark?

Presto Queries

Overview

Presto was originally designed to run interactive queries against data warehouses. However, now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

1. Reporting and dashboarding

This includes serving custom reporting for both internal and external developers for business insights and also many organizations using Presto for interactive A/B testing analytics. A defining characteristic of this use case is a requirement for low latency. It requires tens to hundreds of milliseconds at very high QPS, and not surprisingly this use case is almost exclusively using Presto. That’s what Presto is designed for.

2. Data science with SQL notebooks

This use case is one of ad hoc analysis and typically needs moderate latency ranging from seconds to minutes. These are the queries of data scientist, and business analysts who want to perform compact ad hoc analysis to understand product usage, for example, user trends and how to improve the product. The QPS is relatively lower because users have to manually initiate these queries.

3. Batch processing for large data pipelines

These are scheduled jobs that are running every day, hour, or whenever the data is ready. They often contain queries over very large volumes of data and the latency can be up to tens of hours and processing can range from CPU days to years and terabytes to petabytes of data.

Presto works exceptionally effectively for ad-hoc or interactive queries today, and even some batch queries, with the constraint that the entire query must fit in memory and run quickly enough that fault tolerance is not required. Most ETL batch workloads that don’t fit in this box are running on “very big data” compute engines like Apache Spark. Having multiple compute engines with different SQL dialects and APIs makes managing and scaling these workloads complicated for data platform teams. Hence, Facebook decided to simplify and build Presto on Spark as the path to further scale Presto. Before we get into Presto on Spark, let me explain a bit more about the architecture of each of these two popular engines.

Presto’s Architecture

Presto Architecture

Presto is designed for low latency and follows the classic MPP architecture; it uses in-memory streaming shuffle to achieve low latency. Presto has a single shared coordinator per cluster with an associated pool of workers. Presto tries to schedule as many queries as possible on the same Presto worker (shared executor), in order to support multi-tenancy.

This architecture provides very low latency scheduling of tasks and allows concurrent processing of multiple stages of a query, but the tradeoff is that the coordinator is a SPOF and bottleneck, and queries are poorly isolated across the entire cluster.

Additionally streaming shuffle does not allow for much fault tolerance further impacting the reliability of long running queries.

Spark’s Architecture

Spark Architecture

On other hand, Apache Spark is designed for scalability from the very beginning and it implements a Map-Reduce architecture. Shuffle is materialized to disk fully between stages of execution with the capability to preempt or restart any task. Spark maintains an isolated Driver to coordinate each query and runs tasks in isolated containers scheduled on demand. These differences improve reliability and reduce overall operational overhead.

Why Presto alone isn’t a good fit for batch workloads?

Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1]. To simplify this let’s examine the below aggregation query. Essentially this query goes over the orders table in TPCH and does aggregation grouping on custom keys, and summing the total price. Presto leverages in-memory shuffle and executes shuffle on the custom key, after reading the data and doing aggregation for the same key, on each worker.

Parallel Processing

Doing in-memory shuffle means the producer will buffer data in memory and wait for the data to be fetched by the consumer as a result. We have to execute all the tasks, before and after the exchange at the same time. So thinking about in the mapreduce world all the mappers and the reducer have to be run concurrently. This makes in-memory shuffle an all-or-nothing exclusion model.

This causes inflexible scheduling and scaling query size becomes more difficult because everything is running concurrently. In the aggregation phase the query may exceed the memory limit because everything has to be held in the memory in hash tables in order to track each group (custkey).

Additionally we are limited by the size of a cluster in how many nodes we can hash partition the data across to avoid having to fit it all in memory. Using distributed disk (Presto-on-Spark, Presto Unlimited) we can partition the data further and are only limited by the number of open files and even that is a limit that can be scaled quite a bit by a shuffle service.

For that reason it makes Presto difficult to scale to very large and complex batch pipelines. Such pipelines remain running for hours, all to join and aggregate over a huge amount of data. This motivated the development of Presto Unlimited which adapts Presto’s MPP design to large ETL workloads, and improves user experience at scale.

Presto Unlimited

While Presto Unlimited solved part of the problem by allowing shuffle to be partitioned over distributed disk, it didn’t fully solve fault tolerance. Additionally, it did nothing to improve isolation and resource management.

Presto on Spark

Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library with Spark’s RDD API used to manage execution of Presto’s embedded evaluation. This is similar to how Google chose to embed F1 Query inside their MapReduce framework.

The high level goal is to bring a fully disaggregated shuffle to Presto’s MPP run time and we achieved this by adding a materialization step right after the shuffle. The materialized shuffle is modeled as a temporary partition table, which brings more flexible execution after shuffle and allows to partition level retries. With Presto on Spark, we can do a fully disaggregated shuffle on custom keys for the above query both on mapper and reducer side, this means all mappers and reducers can be independently scheduled and are independently retriable.

Presto on Spark

Presto On Spark at Intuit

Superglue is a homegrown tool at Intuit that helps users build, manage and monitor data pipelines. Superglue was built to democratize data for analysts and data scientists. Superglue minimizes time spent developing and debugging data pipelines, and maximizes time spent on building business insights and AI/ML.

Many analysts at Intuit use Presto (AWS Athena) to explore data in the Data Lake/S3. These analysts would spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines in Superglue. To minimize SQL dialect conversion issues and associated productivity loss for analysts, the Intuit team started to explore various options including query translation, query virtualization, and presto on spark. After a quick POC, Intuit decided to go with Presto on Spark. This is because it leverages Presto’s compiler/evaluation as a library (no query conversion is required) and Spark’s scalable data processing capabilities.

Presto on Spark is now in production at Intuit. In three months, there are hundreds of critical pipelines that have thousands of jobs running on Presto On Spark via Superglue.

Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster. Scheduled batch data pipelines are launched on ephemeral clusters to take advantage of resource isolation, manage cost, and minimize operational overhead. DDL statements are executed against Hive and DML statements are executed against Presto. This enables analysts to write Hive-compatible DDL and the user experience remains unchanged.

This solution helped enable a performant and scalable platform with seamless end-to-end experience for analysts to explore and process data. It thereby improved analysts’ productivity and empowered them to deliver insights at high speed.

When To Use Presto on Spark

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. Presto on Spark heavily benefits pipelines written in Presto that operate on terabytes/petabytes of data. It takes advantage of Spark’s large scale processing capabilities. The biggest win here is that no query conversion is required and you can leverage Spark for

  • Scaling to larger data volumes
  • Scaling Presto’s resource management to larger clusters
  • Increase reliability and elasticity of Presto as a compute engine

Why Presto on Spark matters

We tried to achieve the following to adapt ‘Presto on Spark’ to Internet-scale batch workloads [2]:

  • Fully disaggregated shuffles
  • Isolated executors
  • Presto resource management, Different Scheduler, Speculative Execution, etc.

A unified option for batch data processing and ad hoc is important for creating the experience of queries that scale. More queries will fail without requiring rewrites between different SQL dialects. We believe this is only a first step towards more confluence between the Spark and the Presto communities. This is also major step towards enabling unified SQL experience between interactive and batch use cases. Today many internet giants like Facebook, have moved over to Presto on Spark. We have even seen many organizations including Intuit start running their complex data pipelines in production with Presto on Spark.

“Presto on Spark” is one of the most active development areas in Presto, feel free check it out and please give it a star! If you have any questions, feel free to ask in the PrestoDB Slack Channel.

Reference

[1] MapReduce: Simplified Data Processing on Large Clusters 

[2] Presto-on-Spark: A Tale of Two Computation Engines

AWS Data Analytics & Competency

Ahana Achieves AWS Data & Analytics ISV Competency Status

AWS ISV Technology Partner demonstrates AWS technical expertise 

and proven customer success

San Mateo, Calif. – November 10, 2021 Ahana, the Presto company, today announced that it has achieved Amazon Web Services (AWS) Data & Analytics ISV Competency status. This designation recognizes that Ahana has demonstrated technical proficiency and proven success in helping customers evaluate and use the tools, techniques, and technologies of working with data productively, at any scale, to successfully achieve their data and analytics goals on AWS.

Achieving the AWS Data & Analytics ISV Competency differentiates Ahana as an AWS ISV Partner in the AWS Partner Network (APN) that possesses deep domain expertise in data analytics platforms based on the open source Presto SQL distributed query engine, having developed innovative technology and solutions that leverage AWS services.

AWS enables scalable, flexible, and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise. 

“Ahana is proud to achieve the AWS Data & Analytics ISV Competency, which adds to our AWS Global Startups and AWS ISV Accelerate Partner status,” said Steven Mih, Co-Founder and CEO at Ahana. “Our team is dedicated to helping companies bring SQL to their AWS S3 data lake for faster time-to-insights by leveraging the agility, breadth of services, and pace of innovation that AWS provides.”

TWEET THIS: @Ahana Cloud for #Presto achieves AWS Data and #Analytics Competency Status #OpenSource #Cloud https://bit.ly/3EZXpy1

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Tutorial: How to define SQL functions with Presto across all connectors

share img define sql functions 1200x625 2

Presto is a widely used SQL query engine for data lakes that comes equipped with many built-in functions to serve most use cases. However, there may be certain situations where you need to define your own function. Luckily, Presto allows you to define dynamic expressions as SQL functions, which are stored separately from the Presto source code and managed by a functions namespace manager. You can even set up this manager with a MySQL database. This feature is so popular at Facebook that they have over a thousand functions defined in their instance of Presto.

Function Namespace Manager

By organizing data in catalogs and schemas, Presto allows users to easily access and manipulate data from multiple sources as if they were a single database.

A catalog is a logical namespace that represents a collection of data sources that can be queried in Presto. Each catalog contains one or more schemas, which are essentially named containers that hold tables, views, and other objects.

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. It is possible to make each catalog.schema a function namespace. A function namespace manager is a type of plugin that handles a collection of these function catalog schemas. Catalogs can be assigned to connectors in Presto which allows the Presto engine to carry out tasks like creating, modifying, and deleting functions.

This user defined function management is separated from connector API for flexibility, hence these SQL functions can be used across all connectors. Further, the query is guaranteed to use the same version of the function throughout the execution and any modification to the functions is versioned. 

Implementation

Today, function namespace manager is implemented with the help of MySQL, so users need to have a running MySQL service to initialize the MySQL based function namespace manager. 

Step 1: Provision MySQL server and generate jdbc url for further access.

Suppose the MySQL server can be reached at localhost:3306, example database url – 

jdbc:mysql://localhost:3306/presto?user=root&password=password

Step 2: Create database & tables in MySQL database to store function namespace manager related data

 CREATE DATABASE presto;
 USE presto;

Step 3: Configure at Presto [2]

Create Function namespace manager configuration under etc/function-namespace/mysql.properties:

function-namespace-manager.name=mysql database-url=jdbc:mysql://localhost:3306/presto?user=root&password=password
function-namespaces-table-name=function_namespaces
functions-table-name=sql_functions

And restart the Presto Service.

Step 4: Create new function namescape

After starting the Presto server, the following tables will appear under the Presto database, which is used to manage function namespace, in MySQL –

mysql> show tables;
+---------------------+
| Tables_in_presto    |
+---------------------+
| enum_types          |
| function_namespaces |
| sql_functions       |
+---------------------+
93 rows in set (0.00 sec)

To create a new function namespace ”ahana.default”, insert into the function_namespaces table:

INSERT INTO function_namespaces (catalog_name, schema_name)
    VALUES('ahana', 'default');

Step 5: Create a function and query from Presto [1]

SQL functions_blog

Here is simple example of SQL function for COSECANT: 

presto>CREATE OR REPLACE FUNCTION ahana.default.cosec(x double)
RETURNS double
COMMENT ‘Cosecant trigonometric function'
LANGUAGE SQL
DETERMINISTIC
RETURNS NULL ON NULL INPUT
RETURN 1 / sin(x);

More examples can be found at https://prestodb.io/docs/current/sql/create-function.html#examples [1]

Step 6: Apply the newly created function and SQL query

SQL functions_blog

It is required for users to use fully qualified function name while using in SQL query.

Following the the example of using cosec SQL function in the query. 

presto> select ahana.default.cosec (50) as Cosec_value;
     Cosec_value     
---------------------
 -3.8113408578721053 
(1 row)

Query 20211103_211533_00002_ajuyv, FINISHED, 1 node
Splits: 33 total, 33 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s

Here is another simple example of creating an EpochTimeToLocalDate function to convert Unix time to local timezone under ahana.default function namespace.

presto> CREATE FUNCTION ahana.default. EpochTimeToLocalDate (x bigint) 
     -> RETURNS timestamp 
     -> LANGUAGE SQL 
     -> DETERMINISTIC RETURNS NULL ON NULL INPUT 
     -> RETURN from_unixtime (x);
CREATE FUNCTION

ahana.default.EpochTimeToLocalDate(1629837828) as date;
          date           
-------------------------
 2021-08-24 13:43:48.000 
(1 row)

Note

 function-namespaces-table-name  <The name of the table that stores all the function namespaces managed by this manager> property can be used if there is a use case  to instantiate multiple function namespace managers,  otherwise if we can create functions in only one function namespace manager then it can be utilized across all different databases or connectors. [2]

At Ahana we have simplified all these steps that is MySQL container, Schema, databases, tables and additional configurations required to manage functions and data platforms users just need to create their own SQL functions and use them in SQL queries, that’s it, no need to be wary about provisioning and managing additional MySQL servers. 

Future Roadmap

Remote function Support with remote UDF thrift API 

Allows you to run arbitrary functions that are either not safe or not possible to run within worker JVM: unreliable Java functions, C++, Python, etc.

References

[1] DDL Syntax to use FUNCTIONS

[2] Function Namespace Manager Documentation

Ahana Cofounder Will Present Session At Next Gen Big Data Platforms Meetup hosted by LinkedIn About Open Data Lake Analytics

Screen Shot 2021 10 18 at 10.44.35 AM

San Mateo, Calif. – November 2, 2021 — Ahana, the Presto company, today announced that its Cofounder and Chief Product Officer Dipti Borkar will present a session at Next Gen Big Data Platforms Meetup hosted by LinkedIn about open data lake analytics. The event is being held on Wednesday, November 10, 2021.

Session Title: “Unlock the Value of Data with Open Data Lake Analytics.”

Session Time: Wednesday, November 10 at 4:10 pm PT / 7:10 pm ET

Session Presenter: Ahana Cofounder and Chief Product Officer and Presto Foundation Chairperson, Outreach Team, Dipti Borkar

Session Details: Favored for its affordability, data lake storage is becoming standard practice as data volumes continue to grow. Data platform teams are increasingly looking at data lakes and building advanced analytical stacks around them with open source and open formats to future-proof their platforms. This meetup will help you gain clarity around the choices available for data analytics and the next generation of the analytics stack with open data lakes. The presentation will cover: generation of analytics, selecting data lakes vs data warehouses, share how these approaches differ from Hadoop generation, why open matters, use cases and workloads for data lakes, and intro to the data lakehouse stack. 

To register for the Next Gen Big Data Platforms Meetup, please go to the event registration page to purchase a registration.

TWEET THIS: @Ahana to present at Next Gen Big Data Platforms Meetup about Open Data Lake Analytics #Presto #OpenSource #Analytics #Cloud https://bit.ly/3vXKl8S

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Presto SQL Engine

How to Manage Presto Queries Running Slow

There are a few reasons that Presto queries might lag or performance is impacted. Your Presto queries might be running slow for a few different reasons. Below we’ll share some things to do in terms of diagnosis and Presto tuning, as well as possible solutions to address the issue. These can help with performance and Presto usage.

Troubleshoot Presto Queries

  1. How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analyzing, planning and optimizing queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help. You may also run into issues if there’s Presto multiple users.
  2. Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.
  3. What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
Presto queries
  1. Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
    • Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
    • Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies
  1. Do you have a separate coordinator node? When running Presto queries, keep in mind you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)
  2. Is memory exhausted? If so this will delay your Presto queries and affect the performance. Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large Presto queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 
    • The good news is there is memory information available in at least two places:
      • Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking
      • There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.
  1. When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection. This in turn can increase the speed of your Presto queries.
  2. Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be. This fix can resolve the performance of your Presto queries on your data.
  3. Does IO and CPU utilization look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough.
  4. Have you checked your data volumes recently? An obvious one to check but data volumes can grow in fits and starts and sometimes peaks occur  unexpectedly. The Presto queries may simply be taking longer because there is x% more data than last month. This increase can result in slowing Presto queries.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. Making alterations here, if you are not careful, can affect the speed of your Presto queries. More information can be found in the PrestoDB documentation.

As you can tell, there’s a lot to configure and tune when it comes to addressing Presto performance issues. To make it easier, you can use Ahana Cloud, SaaS for Presto. It’s available in the AWS Marketplace and is pay as you go. Check out our free trial at https://ahana.io/sign-up


Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

Presto 105: Running Presto with AWS Glue as catalog on your Laptop

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto 104: Running Presto with Hive Metastore

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.

We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.

What is AWS Glue?

AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.

In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.   

This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.

Implementation steps

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2: 

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                          

e8f8aa933633: Pull complete                                                          
b7cf38297b9f: Pull complete                                                          
a4205d42b3be: Pull complete                                                          
81b659bbad2f: Pull complete                                                          
ef606708339: Pull complete                                                          
979857535547: Pull complete                                                          
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latest
docker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandboxd
b74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3
ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.

# yum install -y aws-cli

Step 9: 

Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access

aws iam create-user --user-name glueuser
{
    "User": {
        "Path": "/",
        "UserName": "glueuser",
        "UserId": "AXXXXXXXXXXXXXXXX",
        "Arn": "arn:aws:iam::XXXXXXXXXX:user/glueuser",
        "CreateDate": "2021-10-07T01:07:28+00:00"
    }
}

aws iam list-policies | grep AmazonS3FullAccess
            "PolicyName": "AmazonS3FullAccess",
            "Arn": "arn:aws:iam::aws:policy/AmazonS3FullAccess",

aws iam list-policies | grep AWSGlueConsoleFullAccess
            "PolicyName": "AWSGlueConsoleFullAccess",
            "Arn": "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess",

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

Step 10:

Create access key

% aws iam create-access-key --user-name glueuser
{
   "AccessKey": {
       "UserName": "glueuser",
        "AccessKeyId": "XXXXXXXXXXXXXXXXXX", 
       "Status": "Active",
        "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "CreateDate": "2021-10-13T01:50:45+00:00"
    }
}

Step 11:

Run aws configure and enter the access and secret key configured.

aws configure
AWS Access Key ID [None]: XXXXXXXXXXXXXAWS
Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Default region name [None]:
Default output format [None]

Step 12:

Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.

connector.name=hive-hadoop2
hive.metastore=glue
hive.non-managed-table-writes-enabled=true
hive.metastore.glue.region=us-east-2
hive.metastore.glue.aws-access-key=<your AWS key>
hive.metastore.glue.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use glue as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog glue

Step 15:

Create a schema using S3 location.

presto:default> create schema glue.demo with (location= 's3://Your_Bucket_Name/demo');
CREATE SCHEMA
presto:default> use demo;

Step 16:

Create table under glue.demo schema

presto:demo> create table glue.demo.part with (format='parquet') AS select * from tpch.tiny.part;
CREATE TABLE: 2000 rows
    
Query 20211013_034514_00009_6hkhg, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:06 [2K rows, 0B] [343 rows/s, 0B/s]

Step 17:

Run select statement on the newly created table.

presto:demo> select * from glue.demo.part limit 10; 
partkey |                   name                   |      mfgr      |  brand
---------+------------------------------------------+----------------+---------
       1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13
       2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13
       3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42
       4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34
       5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32
       6 | bisque cornflower lawn forest magenta    | Manufacturer#2 | Brand#24
       7 | moccasin green thistle khaki floral      | Manufacturer#1 | Brand#11
       8 | misty lace thistle snow royal            | Manufacturer#4 | Brand#44
       9 | thistle dim navajo dark gainsboro        | Manufacturer#4 | Brand#43
      10 | linen pink saddle puff powder            | Manufacturer#5 | Brand#54

Summary

In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

zero to presto aws 1200x627 1

0 to Presto in 30 minutes with AWS & Ahana Cloud

On-Demand Webinar

Data lakes are widely used and have become extremely affordable, especially with the advent of technologies like AWS S3. During this webinar, Gary Stafford, Solutions Architect at AWS, and Dipti Borkar, Cofounder & CPO at Ahana, will share how to build an open data lake stack with Presto and AWS S3.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. 

In this webinar, you’ll learn how:

  • What an Open Data Lake Analytics stack is
  • How you can use Presto to underpin that stack in AWS
  • A demo on how to get started building your Open Data Lake Analytics stack in AWS

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar, Ahana
share img build open datalake analytics stack od 1

Webinar On-Demand
How to Build an Open Data Lake Analytics Stack

While data lakes are widely used and extremely affordable, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake?

The answer is the Open Data Lake Analytics stack. In this webinar, we’ll discuss how to build this stack using 4 key components: open source technologies, open formats, open interfaces & open cloud. Additionally, you’ll learn why open source Presto has become the de facto query engine for the data lake, enabling ad hoc data discovery using SQL.

You’ll learn:

• What an Open Data Lake Analytics Stack is

• How Presto, the de facto query engine for the data lake, underpins that stack

• How to get started building your open data lake analytics stack today

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar

Presto 104: Running Presto with Hive Metastore on your Laptop

Introduction

This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.

Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular. 

In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.   

What is the Hive Metastore?

The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.

Presto 104

Implementation steps

Step 1

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                               e8f8aa933633: Pull complete                                                               b7cf38297b9f: Pull complete                                                               a4205d42b3be: Pull complete                                                               81b659bbad2f: Pull complete                                                               3ef606708339: Pull complete                                                               979857535547: Pull complete                                                              
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandbox
db74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3  ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install and configure hive on the coordinator container.

Install wget procps and tar 

# yum install -y wget procps tar less

Step 9:

Download and install hive and hadoop packages, set HOME and PATH for JAVA,HIVE and HADOOP 

#HIVE_BIN=https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
#HADOOP_BIN=https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz


#wget --quiet ${HIVE_BIN}
#wget --quiet ${HADOOP_BIN}


#tar -xf apache-hive-3.1.2-bin.tar.gz -C /opt
#tar -xf hadoop-3.3.1.tar.gz -C /opt
#mv /opt/apache-hive-3.1.2-bin /opt/hive
#mv /opt/hadoop-3.3.1 /opt/hadoop


#export JAVA_HOME=/usr
#export HIVE_HOME=/opt/hive
#export HADOOP_HOME=/opt/hadoop
#export PATH=$PATH:${HADOOP_HOME}:${HADOOP_HOME}/bin:$HIVE_HOME:/bin:.
#cd /opt/hive

Step 10:

Download additional jars needed to run with S3

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.10.6/aws-java-sdk-core-1.10.6.jar

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar

#wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.4/hadoop-aws-2.8.4.jar

#cp aws-java-sdk-core-1.10.6.jar /opt/hadoop/share/hadoop/tools/lib/
#cp aws-java-sdk-s3-1.10.6.jar  /opt/hadoop/share/hadoop/tools/lib/
#cp hadoop-aws-2.8.4.jar  /opt/hadoop/share/hadoop/tools/lib/

echo "export
HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.ja

r:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3
1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar" >>/opt/hive/conf/hive-env.sh

Step 11:

Configure and start hive

cp /opt/hive/conf/hive-default.xml.template /opt/hive/conf/hive-site.xml
mkdir -p /opt/hive/hcatalog/var/log
bin/schematool -dbType derby -initSchema
bin/hcatalog/sbin/hcat_server.sh start

Step 12:

Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.

If you choose to validate using AWS S3 bucket provide security credentials for the same.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Key>
hive.s3.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use hive as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog hive

Step 15:

Create schema using local or S3 location.

presto:default> create schema tpch with (location='file:///root');
CREATE SCHEMA
presto:default> use tpch;

If you have access to S3 bucket then use the following create command using s3 as destination 

presto:default> create schema tpch with (location='s3a://bucket_name');
CREATE SCHEMA
presto:default> use tpch;

Step 16:

Hive has option to create two types of table, they are

  • Managed tables 
  • External tables

Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.

External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.

The transactional semantics(ACID) is only supported on managed tables.

We will create managed table under hive.tpch schema

Create table under hive.tpch schema

presto:tpch> create table hive.tpch.lineitem with (format='PARQUET') AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows
Query 20210921_051649_00015_uvkq7, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1:48 [6M rows, 0B] [55.4K rows/s, 0B/s]

Step 17:

Do a desc table to see the table.

presto> desc hive.tpch.lineitem     
-> ;    
Column     |    Type     | Extra | Comment
---------------+-------------+-------+--------- 
orderkey      | bigint      |       | 
partkey       | bigint      |       | 
suppkey       | bigint      |       | 
linenumber    | integer     |       | 
quantity      | double      |       | 
extendedprice | double      |       | 
discount      | double      |       | 
tax           | double      |       | 
returnflag    | varchar(1)  |       | 
linestatus    | varchar(1)  |       | 
shipdate      | date        |       | 
commitdate    | date        |       | 
receiptdate   | date        |       | 
shipinstruct  | varchar(25) |       | 
shipmode      | varchar(10) |       | 
comment       | varchar(44) |       |
(16 rows)
Query 20210922_224518_00002_mfm8x, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:08 [16 rows, 1.04KB] [1 rows/s, 129B/s]

Summary

In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

share img unlocking value data lake od 1

Webinar On-Demand
Unlocking the Value of Your Data Lake

Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.

During this webinar, Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.

Dipti will cover:

  • Open Data Lake analytics – what it is and what use cases it supports
  • Why companies are moving to an open data lake analytics approach
  • Why the open source data lake query engine Presto is critical to this approach

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar

Connect Superset to Presto

Presto with Superset

This blog post will provide you with an understanding of how to connect Superset to Presto.

TL;DR

Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.

Superset and SQLAlchemy

Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.

Apache superset

SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.

The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.

In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.

superset@localhost:superset> \\l

+-----------+----------+----------+------------+------------+-----------------------+
| Name      | Owner    | Encoding | Collate    | Ctype      | Access privileges     |
|-----------+----------+----------+------------+------------+-----------------------|
| postgres  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| superset  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| template0 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
| template1 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
+-----------+----------+----------+------------+------------+-----------------------+

We can look into the superset database and see the tables in that database.

The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.

superset@localhost:superset> \c superset

You are now connected to database "superset" as user "superset"

+--------+----------------------------+-------+----------+
| Schema | Name                       | Type  | Owner    |
|--------+----------------------------+-------+----------|
| public | Clean                      | table | superset |
| public | FCC 2018 Survey            | table | superset |
| public | ab_permission              | table | superset |
| public | ab_permission_view         | table | superset |
| public | ab_permission_view_role    | table | superset |
| public | ab_register_user           | table | superset |
| public | ab_role                    | table | superset |
| public | ab_user                    | table | superset |
| public | ab_user_role               | table | superset |
| public | ab_view_menu               | table | superset |
| public | access_request             | table | superset |
| public | alembic_version            | table | superset |
| public | alert_logs                 | table | superset |
| public | alert_owner                | table | superset |
| public | alerts                     | table | superset |
| public | annotation                 | table | superset |
| public | annotation_layer           | table | superset |
| public | bart_lines                 | table | superset |
| public | birth_france_by_region     | table | superset |
| public | birth_names                | table | superset |
| public | cache_keys                 | table | superset |
| public | channel_members            | table | superset |
| public | channels                   | table | superset |
| public | cleaned_sales_data         | table | superset |
| public | clusters                   | table | superset |
| public | columns                    | table | superset |
| public | covid_vaccines             | table | superset |
:

With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):

  1. postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
  2. postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
  3. postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.

Superset lists its recommended Python packages for database drivers in the public documentation.

Presto Catalogs

Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.

In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.

As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.

connector.name=postgresql
connection-url=jdbc:postgresql://pghost:5432/mydatabase
connection-user=sonny
connection-password=foobar

Prestodb

Superset to Presto

Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>

what is apache superset

The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.

New Superset Database Connection UI

In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.

Try It Out!

In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto Tutorial 103: PrestoDB cluster on GCP

Introduction

This tutorial is Part III of our Getting started with PrestoDB series. As a reminder, Prestodb is an open source distributed SQL query engine. In tutorial 102 we covered how to run a three node prestodb cluster on a laptop. In this tutorial, we’ll show you how to run a prestodb cluster in a GCP environment using VM instances and GKE containers.

Environment

This guide was developed on GCP VM instances and GKE containers.

Presto on GCP with VMs

Implementation steps for prestodb on vm instances

Step1: Create a GCP VM instance using the CREATE INSTANCE tab, name it as presto-coordinator. Next, create three more VM instances as presto-worker1, presto-worker2 and presto-worker3 respectively.

SoyhGYtcBwVT9eTIEixXIQXZSSVWI0yUc SKXD71F 83rsakVmxsXC9wgPiqqPG 3K6c5M7RLbtQRweuz2oZUQf6gD2sbSTeFOWa3LnDzyEtM iBgQkSmqbFwiV

Step 2: By default GCP blocks all network ports, so prestodb will need ports 8080-8083 enabled. Use the firewalls rule tab and enable them.

DvTlwt8tcej6YIcd1V5CUQQvRuflGEpOgdKhCZcbp5bEf5F1TPvgrVZEiIoQNKFp NpYYPvU53mA1Xkl4yKT9Fzl7s7VGgGc9yZBDvbmKScqLplknnbkve3kNTr3eVid0CyHXXpB=s0

Step 3: 

Install JAVA and python.

Step 4:

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  721M  100  721M    0     0   245M      0  0:00:02  0:00:02 --:--:--  245M
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 12.7M  100 12.7M    0     0  15.2M      0 --:--:-- --:--:-- --:--:-- 15.1M
user@presto-coordinator-1:~$

Step 5:

Use gunzip and tar to unzip and untar the presto-server

user@presto-coordinator-1:~$gunzip presto-server-0.235.1.tar.gz ;tar -xf presto-server-0.235.1.tar

Step 6: (optional)

Rename the directory without version number

user@presto-coordinator-1:~$ mv presto-server-0.235.1 presto-server

Step 7:  

Create etc, etc/catalog and data directories

user@presto-coordinator-1:~/presto-server$ mkdir etc etc/catalog data

Step 8:

Define etc/node.config, etc/config.properties, etc/jvm.config and etc/catalog/jmx.properties files as below for presto co-ordinator server.  

user@presto-coordinator-1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/user/presto-server/data

user@presto-coordinator-1:~/presto-server$ cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

user@presto-coordinator-1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-coordinator-1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-coordinator-1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step: 9 

Check the cluster UI status. It should  show the Active worker count at 0 since we enabled only the coordinator.

OgmEUeQvZPNw xsOcMibEdAihAKM9MXxuy2l2Sf7xWukxPWL 00RPcSdU6HbbqWcaHlgEMYr BVFTvrg1F9OacLiavldA

Step 10: 

Repeat steps 1 to 8 on the remaining 3 vm instances which will act as worker nodes.

On the configuration step for worker nodes, set coordinator to false and http-server.http.port to 8081, 8082 and 8083 for worker1, worker2 and worker3 respectively.

Also make sure node.id and http-server.http.port are different for each worker node.

user@presto-worker1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffffd
node.data-dir=/home/user/presto-server/data
user@presto-worker1:~/presto-server$ cat etc/config.properties
coordinator=false
http-server.http.port=8083
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://presto-coordinator-1:8080

user@presto-worker1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-worker1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-worker1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step 11: 

Check cluster status, it should reflect the three worker nodes as part of the prestodb cluster.

oNL q

Step 12:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
-----------------------------------------------------------------------------------------------------------------------------------
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator


Implementation steps for Prestodb on GKE containers

Step 1: