Blog Archive

AWS Athena Alternatives: Best Amazon Athena Alternatives

This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits
Part 3: AWS Athena vs. PrestoDB Blog Series: Athena Partition Limits

If you’re looking for Amazon Athena alternatives, you’ve come to the right place. In this blog post, we’ll explore some of the best AWS Athena alternatives out there.

Athena is a great tool for querying data stored in S3 – typically in a data lake or data lakehouse architecture – but it’s not the only option out there. There are a number of other alternatives that you might want to consider, including serverless options such as Ahana or Presto, as well as cloud data warehouses.

Each of these tools has its own strengths and weaknesses, and really the best choice depends on the data you have and what you want to do with it. In this blog post, we’ll compare Athena with each of these other options to help you make the best decision for your data.

What is AWS Athena?

AWS Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Amazon Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Why would I not want to use AWS Athena?

There are various reasons users look for alternative options to Athena, in spite of its advantages: 

  1. Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
  2. Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
  3. Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
  4. Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
  5. Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.

What are the typical alternatives to AWS Athena?

  1. DIY open-source PrestoDB
  2. Managed Hadoop and Presto
  3. Managed Presto Service
  4. Cloud data warehouse such as Redshift or Snowflake

Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:

DIY open-source PrestoDB

Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment. 

Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.

The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.

Managed Hadoop and Presto

Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.

AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.

This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.

You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.

Managed Presto Service

If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.

Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs. 

A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.

Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight. 

Cloud Data Warehouse (Redshift, Snowflake)

Another alternative to Amazon Athena would be to use a data warehouse such as Snowflake or Redshift. This would a require a shift of paradigm from a decoupled open lakehouse architecture to a more traditional design pattern focused on a centralized storage and compute layer.

If you don’t have a lot of data and are mainly looking to run BI-type predictable workloads (rather than interactive analytics), storing all your data in a data warehouse such as Amazon Redshift or Snowflake would be a viable option. However, companies that work with larger amounts of data and need to run more experimental types of analysis will often find that data warehouses do not provide the required scale and cost-performance benefits and will gravitate towards a data lake.

In these cases, Athena or Presto can be used in tandem with a data warehouse and data engineers can choose where to run each workload on an ad-hoc basis. In other cases, the serverless option can replace the data warehouse completely.

Presto vs Athena: To Summarize

You have a wide variety of options regarding your use of PrestoDB. 

If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you. 

On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go. 

If simplicity and accelerated go-to-market are what you seek without needing to manage a complex infrastructure, then Ahana’s Presto managed service is the way to go. Sign up for our free trial today.

We also have a case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment. You can download it here.

Related Articles

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

There can be some confusion with the difference between AWS Redshift Spectrum and AWS Athena. Learn more about the differences in this article.

AWS Athena vs AWS Glue: What Are The Differences?

Here, we talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

Presto SQL Syntax: Learn to Write SQL Queries in Presto

Presto is powerful, but running it on your own can get complicated. If you’re looking for a managed Presto experience that can let you focus on querying your data rather than managing infrastructure, try Ahana Cloud today.

PrestoDB uses regular ANSI SQL to query big data stored in object storage. If you’ve used SQL in databases before, you should be very comfortable using Presto. However, there are some quirks you need to keep in mind, stemming from the fact Presto is typically used to query semi-structured storage such as Amazon S3 rather than relational databases.

Below you’ll find some of our most popular resources relating to writing Presto SQL

Working with Date and Time Data

Working With Different Data Types

Manipulating Tables and Data

Additional Presto SQL Resources

what is amazon redshift

Top 4 Amazon Redshift Alternatives & Competitors

Introduction

In the last article we discussed the fundamental problems with Amazon Redshift. To add to that article, we’ll provide some information about where to look, if you’re starting to explore new options. We’ll unearth some of the available Redshift Alternatives, and why they are worth looking into. Disclaimer – this is not an ordinal layout.

1. Ahana

Ahana offers the only managed service for Presto as a feature-rich, next-gen SQL query engine in Ahana Cloud. It plays a critical role for data platform users searching for something with ease-of-use, full integration, and a cloud native option for their SQL engine on their AWS S3 data lakes; as well as other data sources. Ahana Cloud has everything the user will  need to get started with SQL on the Open Data Lakehouse. It’s a great choice for a Redshift alternative, or even to augment the warehouse.

 Currently, Ahana is offering a free trial for their enterprise solution, as well as a free community edition.

2. BigQuery

BigQuery is another great AWS Redshift alternative. It’s a cloud data warehouse to ingest and process queries at scale onGoogle Cloud Platform. If you’re on Google Cloud, it doesn’t require much effort to integrate it with other Google products.

You can run queries or analyze terabytes of data in seconds. BigQuery allows the user to leverage the power of Google’s infrastructure to load data. They user can also use Google Cloud Storage to bulk load your data, or stream it in bursts of up to a thousand rows per second.

It’s supported by the BigQuery REST API that comes with client libraries like Java, PHP, and Python. While BigQuery is the most proven tool on this list, it’s not the easiest to use. If your team lacks an experienced data engineer, you’re going to have challenges as the learning curve is significant.

BigQuery pricing: query based on the amount of data processed at $5 per TB (and includes one free TB per month).

3. Azure SQL Data Warehouse

as a Redshift alternative Azure is a good choice. Azure SQL Data Warehouse is perfect for large businesses dealing with consumer goods, finance, utilities, and more. As one of the most used services on Microsoft Azure, it’s a SQL server in the cloud but is fully managed and more intelligent.

Now absorbed into Azure Synapse Analytics, it’s a powerful cloud-based analytics platform you can use to design the data structure immediately (without worrying about potential implementation challenges). Its provisioned resources also allow users to query data quickly and at scale.

If you’re not familiar with the Azure environment, you’ll have to invest some time in understanding it. As it’s fully featured and well documented, there’s enough support to get you over the learning curve.

Like Redshift and Snowflake, Azure Synapse also follows a consumption-based pricing model. So, it’s best to have an experienced data engineer on-board to make “reasonably accurate guesstimates” before committing.
Azure pricing: follows an hourly data consumption model (and offers a 12-month free trial).

4. Snowflake

Like Redshift, Snowflake is a robust cloud-based data warehouse built to store data for effortless analysis. Snowflake is a good Redshift alternative as it was developed for experienced data architects and data engineers, Snowflake leverages a SQL workbench and user permissions to allow multiple users to query and manage different types of data.

Snowflake also boasts robust data governance tools, security protocols, and the rapid allocation of resources. While the platform is powerful and efficient at managing different data types, it still proves to be a significant challenge for users who don’t hail from a strong data background.

Snowflake also lacks data integrations, so your data teams will have to use an external ETL to push the data into the warehouse. Whenever you use third-party tools, you’ll also have to consider the extra costs and overheads (such as setup and maintenance costs) that come with them.

Snowflake follows a consumption-based pricing model similar to that of Redshift. This is great for experienced users who can make an educated guess about this data consumption. Others may have to deal with an unpleasant surprise at the end of the billing cycle.For a more in-depth look into Snowflake as a competitor, check the Snowflake breakdown.

Snowflake pricing: based on a per-second data consumption model (with an option of a 30-day free trial).

Test out the Alternative

Ready to see the alternative in action?

Using a Managed Service for Presto to as a Redshift Alternative

Redshift, while a fantastic tool, does have some significant issues the user is going to have to overcome. Below are the most frequently stated causes of concern expressed by Amazon Redshift users and the catalyst to search out Redshift alternatives:

Price-Performance

Redshift gets expensive quickly. As data volumes increase, the cost of storage and compute in the warehouse becomes problematic. Redshift comes with a premium cost, especially if you use Spectrum outside of AWS Redshift.A solution to this is to reach for a tool focused on reducing overhead cost. As a biased example, Ahana Cloud is easy to run and allows the users to only pay for what they use without upfront costs. Simply, the performance you’re used to and pay less for it!

Closed & Inflexible

 Working with a data warehouse, while having some perks, comes with some drawbacks. In this environment the user loses their flexibility. Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. Redshift does not provide or utilize flexible or open data formats available.

Other modern solutions allow the user to define and manage data sources. Ahana permits data teams to  attach or detach from any cluster with the click of a button; also taking care of configuring and restarting the clusters.

Vendor Lock-in

One of the biggest pain points, and desire to find Redshift alternatives is due to vendor lock-in. Data warehouse vendors, like AWS Redshift, make it difficult to use your data outside of their services. To do so data would need to be pulled out of the warehouse and duplicated, further driving up compute costs. Use the tools & integrations you need to get value from your data without the proprietary data formats. Head over to this additional comparison for a discernible solution dealing with vendor lock-in, price-performance, and flexibility.

Redshift Alternatives

Summary: Redshift Alternatives

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for If you are using Amazon Redshift now, and are looking to solve some of the problems with it, check out this  on-demand webinar providing instructions to augment your Redshift warehouse with an Open Data Lakehouse. This webinar also explains why so many of today’s companies are moving away from warehouses like Snowflake and Amazon Redshift towards other Redshift alternatives – specifically, a SQL Data Lakehouse with Presto.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

what is amazon redshift

The Fundamental Problems with Amazon Redshift

In the last article we discussed the Difference Between the data warehouse and Redshift Spectrum. To continue on this topic, let’s understand the problems with Amazon Redshift and some of the available alternatives for that data teams can explore further.

Amazon Redshift made it easy for anyone to implement data warehouse use cases in the cloud. However, It is unable to provide the same benefits to newer, more advanced cloud data warehouses. When it was a relatively new technology, everyone was going through a learning curve.

Here are some of the fundamental Problems with Amazon Redshift:

redshift alternative

AWS Redshift’s Cost

Amazon Redshift is a traditional MPP platform where the compute is closely integrated with the storage. The advantage of the cloud is that theoretically compute and storage are completely independent of each other, and storage is virtually unlimited. If you want more storage with this data warehouse you will have to purchase more compute power. As data volumes increase, the cost of storage and compute in the warehouse becomes challenging. AWS products, particularly the warehouse of topic and Spectrum come with a premium cost. This is especially true if you use Spectrum outside of AWS Redshift.This results in one of the most expensive cloud data warehouse solutions.

Vendor lock-in with Redshift

Data warehouse vendors, like AWS, make it difficult to use your data outside of their services. Data would need to be pulled out of the warehouse and duplicated, further driving up compute costs.

Proprietary data formats

Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. No flexible or open data formats available.

No Staging Area in Redshift

It is expensive to host data with Amazon, so duplication of data has to be avoided at all cost. In traditional RDBMS systems, we tend to have landing, staging layers and warehouse layers in the same database. But for Amazon’s data warehouse the landing and staging layer has to be on S3. Only the data on which reports and analytics will be built should be loaded in Redshift. This task should happen on a need basis rather than keeping the entire dataset in the warehouse

No Index support in Amazon Redshift

This warehouse does not support indexes like other data warehouse systems, hence it is designed to perform the best when you select only the columns that you absolutely need to query. As Amazon’s data warehouse is columnar storage, a construct called Distribution Key needs to be used, which is nothing but a column based on which data is distributed across different nodes of the clusters.

Manual house-keeping

Performance based issues that need to be handled in proper maintenance like Vacuum and Analyze, SORT Keys, Compressions, Distribution styles, etc. 

Tasks like VACUUM and ANALYZE need to be run regularly which are expensive and time consuming tasks. There’s no good frequency to run this that suits all. This requires a quick cost-benefit analysis before deciding on the frequency.

Disk space capacity planning

Control over disk space is a must with Amazon Redshift especially when you’re dealing with analytical workloads. There are high chances you oversubscribe the system, and not just reduced disk space degrades the performance of the query but also makes it cost prohibitive. Having a cluster filled above 75% isn’t good for performance.

Concurrent query limitation

Above 10 concurrent queries, you start seeing issues. Concurrency scaling may mitigate queue times during bursts in queries. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the constraints on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

Conclusion

These were some of the fundamental problems vocalized by users that you need to keep in mind while using or exploring Amazon Redshift. If you are searching for more information about AWS products regarding challenges or benefits check out the next article in this series about AWS and query limitations check out this article.

Comparing AWS Redshift?

See how it the alternatives rank

Amazon Redshift Pricing: An Ultimate Guide

AWS’ data warehouse is a completely managed cloud service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, it is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Managed service for SQL

Ahana Joins Leading Open Source Innovators in its Commitment to the Velox Open Source Project Created by Meta

Extends engineering resources and names significant contributors  

San Mateo, Calif. – August 31, 2022 Ahana, the only SaaS for Presto, today announced it is strengthening its commitment to the further development of the Velox Open Source Project, created by Meta, with the dedication of more engineers and significant contributors. Ahana joined Intel and ByteDance as the project’s primary contributors when it was open sourced in 2021.

Velox is a state-of-the-art, C++ database acceleration library. It provides high-performance, reusable, and extensible data processing components, which can be used to accelerate, extend, and enhance data computation engines. It is currently integrated with more than a dozen data systems at Meta, from analytical query engines such as Presto and Spark to stream processing platforms and machine learning libraries such as PyTorch. 

“Velox is poised to be another vibrant open source project created by Meta with significant industry impact. It caught our attention as it enables developers to build highly-efficient data processing engines,” said Steven Mih, Cofounder and CEO, Ahana. “It’s well understood that at Meta, there are diverse opportunities to improve data processing at scale, and, as a result, trailblazing innovations are developed. As data becomes central to every organization, we see many enterprise data teams facing similar challenges around consistency of diverse data systems, which Velox-based systems could solve. As a primary contributor from the start, we are furthering our commitment to grow a community of developers to collaboratively accelerate the project.”

“To our knowledge, Velox is a pioneer effort at unifying execution engines in a centralized open source library. Other than efficiency gains due to its state-of-art components, Velox also provides benefits in terms of increased consistency across big data engines, and by promoting reusability,” said Pedro Pedreira, Software Engineer, Meta. “We see Velox as an important step towards a more modular approach to architecting data management systems. Our long-term goal is to position Velox as a de-facto execution engine in the industry, following the path paved by Apache Arrow for columnar memory format. We are excited about the momentum the project is getting, and the engagement and partnership with Ahana’s engineers and other open source contributors.”

“We’re excited to work closely with Ahana, Meta, and Velox community,” said Dave Cohen, Senior Principal Engineer, Intel. “While there are other database acceleration libraries, the Velox project is an important, open-source alternative.”

Velox Project Significant Contributors from Ahana, include:

Deepak Majeti, Principal Engineer and Velox Contributor. Deepak has been contributing to the Velox project since it was open sourced in 2021. Before joining Ahana, he was a technical lead at Vertica. He has expertise in Big Data and High-Performance computing with a Ph.D. from the Computer Science department at Rice University. Deepak is also an Apache ORC PMC member, Apache Arrow and Apache Parquet committer.

Aditi Pandit, Principal Engineer and Velox Contributor. Aditi has also been contributing to the Velox project since it was open sourced in 2021. Before joining Ahana, she was a senior software engineer at Google on the Ads Data Infrastructure team and prior to that, a software engineer with Aster Data and Informatica Corp. 

Ying Su, Principal Engineer and Velox Contributor.  Ying joined Deepak and Aditi as significant contributors to Velox in 2022.  Prior to Ahana, she was a software engineer at Meta and before that, a software engineer at Microsoft. Ying is also a Presto Foundation Technical Steering Committee (TSC) member and project committer. 

Supporting Resources

Meta blog introducing Velox is here.
Tweet this: @AhanaIO joins leading open source innovators in its commitment to the Velox Open Source Project created by Meta #opensource #data #analytics https://bit.ly/3pqwBkH

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

JMeter-Blog_Summary-Report

Using JMeter with Presto

Apache JMeter is an open source application written in Java that is designed for load testing. This article presents how to install it, and how to create and run a test plan for testing SQL workloads on Presto clusters.

You will need Java on the system you are installing JMeter on. If you do not have Java installed on your system, see How do I install Java?

You will need a Presto cluster to configure JMeter to connect to and run the test plan on. You can create a Presto cluster for free in Ahana Cloud Community Edition.

Installing JMeter

To install Jmeter, start by downloading the latest JMeter build and unzipping the downloaded file into a new directory. For this article, the new directory’s name is jmeter.

Next, download the version of the Presto JDBC driver that matches your Presto version. The file you want to download is called presto-jdbc-X.XXX.jar

💡 As of this writing, the Presto JDBC driver version to use with an Ahana Presto cluster is presto-jdbc-0.272.jar. To find the Presto version for a cluster in Ahana, open the Manage view of the Presto cluster and look for Version of Presto in the Information section. The version shown will be similar to 0.272-AHN-0.1. Use the first four numbers to choose the Presto JDBC driver to download.

Copy the downloaded Presto JDBC driver jar file into the jmeter directory’s lib folder.

To run JMeter, change directories to the directory and run the command

bin/jmeter

Create a Test Plan in JMeter

In JMeter, select the Templates icon to show the Templates window. In the dropdown of the Templates window, select JDBC Load Test, then select Create.

Enter the JDBC endpoint in Database URL. In Ahana, you can find and copy the JDBC endpoint in Connection Details of the Presto cluster.

You can include either or both of the catalog and schema names in the Database URL separated by slashes after the port number. For example:

jdbc:presto://report.youraccount.cp.ahana.cloud:443/tpch/tiny

If not, you must specify the catalog and schema names in the SQL query in JDBC Request.

Enter com.facebook.presto.jdbc.PrestoDriver in JDBC Driver class.

Enter Username and Password of a Presto user attached to the Presto cluster.

In the navigation bar on the left, expand Thread Group, select JDBC Request, and enter the SQL query in Query.

💡 Do not include a semicolon at the end of the SQL query that you enter, or the test plan run will fail.

Set how many database requests run at once in Number of Threads (users).

In Ramp-up period (seconds), enter the time that JMeter should take to start all of the requested threads.

Loop Count controls how many times the thread steps are executed.

For example, if Number of Threads (users) = 10, Ramp-up period (seconds) = 100, and Loop Count = 1, JMeter creates a new thread every 10 seconds and the SQL query runs once in each thread.

You can add a report of the performance metrics for the requests to the test plan. To do so, right-click Test Plan in the left navigation bar, then select AddListenerSummary Report.

Select the Save icon in JMeter and enter a name for the test plan.

Run the Test Plan and View Results

To run the test plan, select the Start icon.

In the left navigation bar, select View Results Tree or Summary Report to view the output of the test plan.

Using JMeter with Presto_Summary Report Image

Run JMeter in Command Line Mode

For best results from load testing it is recommended to run without the GUI. After you have created and configured a test plan using the GUI, quit JMeter then run it from the command line.

For an example, to run JMeter with a test plan named testplan, and create a report in a new directory named report, run the following command:

bin/jmeter -n -t bin/templates/testplan.jmx -l log.jtl -e -o report

When the test plan is run, JMeter creates an index.html file in the report directory summarizing the results.

Open-Data-Lakehouse-Hudi-Presto-S3

Virtual Lab On-Demand:

Building an Open Data Lakehouse with Presto, Hudi, and AWS S3

Learn how to build an open data lakehouse stack using Presto, Apache Hudi, and AWS S3 in this on-demand virtual lab.

What you’ll learn:

  • A quick overview on the open data lakehouse stack, including what is Presto (query engine) and what is Apache Hudi (transaction layer)
  • How to get HUDI support on Presto
  • Querying HUDI data with Presto  
  • How to use Presto to query your AWS S3 Data Lake
  • Future – What additional HUDI support is coming to Presto

By the end of this lab, you’ll know how to run queries with Presto and Hudi to optimize your AWS S3 data lake.

Additional Resources

Blog: Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Ahana Community Office Hours: August 24 at 10:30am PT/1:30pm ET

Join us for our Ahana Community Office Hours. Our experts will answer your questions about getting started with Ahana Cloud.

Speakers

Sivabalan Narayanan

Software Engineer, Onehouse

Sivabalan-Narayanan_headshot

Jalpreet Singh Nanda

Software Engineer, Ahana

Jalpreet Headshot
Managed service for SQL

Ahana Awarded Many Industry Recognitions and Accolades for Big Data, Data Analytics and Presto Innovations

San Mateo, Calif. – August 3, 2022 Ahana, the only SaaS for Presto, today announced many new industry accolades in 1H 2022. Presto, originally created by Meta (Facebook) which open sourced and donated the project to Linux Foundation’s Presto Foundation, is the fast and reliable SQL query engine for data analytics and the data lakehouse. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

“Businesses are looking for ways to bring the reliability of the data warehouse together with the scale and flexibility of the data lake,” said Steven Mih, Cofounder and CEO, Ahana. “We believe the Data Lakehouse offers a new paradigm for a self-service data platform built on open-source foundations, leveraging the scalability of modern cloud services.  With the Ahana Cloud for Presto managed service, we’ve delivered an open SQL data lakehouse that brings the best of the data warehouse and the data lake. We are very excited to see its reception in the marketplace as time and time again it is recognized for its innovation and the benefits it delivers to customers.” 

Recent award recognitions, include:

  • CRN, “The 10 Coolest Big Data Tools of 2022 (so far)” – Data is an increasingly valuable asset for businesses and a critical component of many digital transformation and business automation initiatives. CRN named Ahana Cloud for Presto Community Edition to its list of 10 cool tools in the big data management and analytics space that made their debut in the first half of the year.
  • CRN, “Emerging Big Data Vendors to Know in 2022” – As data becomes an increasingly valuable asset for businesses—and a critical component of many digital transformation and business automation initiatives—demand is growing for next-generation data management and data analytics technology. Ahana is listed among 14 startups that are providing it with its Presto SQL query engine on AWS with the vision to simplify open data lake analytics.
  • CRN, “The Coolest Business Analytics Companies of the 2022 Big Data 100”CRN’s Big Data 100 includes a look at the vendors solution providers should know in the big data business analytics space. Ahana offers the Ahana Cloud for Presto, a SQL data analytics managed service based on Presto, the high-performance, distributed SQL query engine for distributed data residing in a variety of sources, and was named to this prestigious list
  • Database Trends & Applications, “DBTA 100 2022: The Companies That Most in Data” – Business leadership understands that creating resilient IT systems and pipelines for high-quality, trustworthy data moving into employees’ workflows for decision making is essential. To help bring new resources and innovation to light, each year, Database Trends and Applications magazine presents the DBTA 100, a list of forward-thinking companies, such as Ahana, seeking to expand what’s possible with data for their customers. 
  • InsideBIGDATA, “IMPACT 50 List for Q1, Q2 and Q3 2022” – Ahana earned an Honorable Mention for all of the last three quarters of the year as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 
  • 2022 SaaS Awards Shortlist – Ahana was recognized by the SaaS Awards as a finalist for Best SaaS Newcomer and Best Data Innovation in a SaaS Product finalist on the 2022 shortlist.
  • 2022 American Business Awards, “Stevie Awards” – Ahana was named the winner of a Silver Stevie® Award in the Big Data Solution category in the 20th Annual American Business Awards®. The winners were determined by the average scores of more than 250 professionals worldwide in a three-month judging process.

Tweet this: @AhanaIO receives many industry #awards and #accolades for innovation in  #BigData #Data #Analytics and #Presto https://bit.ly/3OHiVvX 

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

what is amazon redshift

Redshift vs Redshift Spectrum: A Complete Comparison

Amazon Redshift is a cloud-based data warehouse service offered by Amazon. Redshift is a columnar database which is optimized to handle the sort of queries now running in enterprise star schemas and snowflake schemas.

Redshift Spectrum is an extension of Amazon Redshift. Redshift Spectrum as a feature of Redshift, allows the user  to query data available on S3. With Amazon Redshift Spectrum, you can continue to store and grow your data at S3 and use Redshift as one of the compute options to process your data (other options could be EMR, Athena or Presto.)

There are many differences between Amazon Redshift and Redshift Spectrum, here are some of them:

Architecture

What is Amazon redshift
Image Source: https://docs.aws.amazon.com/

Amazon Redshift cluster is composed of one or more compute nodes. A cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. The client application interacts directly only with the leader node and the compute nodes are transparent to external applications.

What is redshift spectrum
Image Source: aws.amazon.com

Whereas Redshift Spectrum queries are submitted to the leader node of your Amazon Redshift cluster., The Amazon Redshift compute nodes generate multiple requests depending on the number of objects that need to be processed, and submits them concurrently to Redshift Spectrum. The Redshift Spectrum worker nodes scan, filter, and aggregate your data from Amazon S3 for processing back to your Amazon Redshift cluster. Then, the final join and merge operations are performed locally in your cluster and the results are returned to your client.

Redshift Spectrum is a service that uses dedicated servers to handle the S3 portion of your queries. The S3 Glue catalog service is used to maintain the definition of the external tables. Redshift loosely connects to S3 data by the following route:

 AWS Redshift Spectrum alternatives

External database, schema, and table definitions in Redshift use an IAM role to interact with the Presto catalog and Spectrum, which handles the S3 portion of the queries.

Use case 

Amazon Redshift is a full-managed data warehouse that is efficient in storing historical data from various different sources. This tool is designed to ease the process of data warehouse and analytics. 

Redshift Spectrum is used to perform analytics directly on the data in the Amazon S3 cluster using an Amazon Redshift node. This allows users to separate storage and compute.  The user can scale them independently.

You can use Redshift Spectrum, which is an add-on to Amazon redshift, for its capability to query the data from the files of S3 with existing information from the Redshift data warehouse. In addition to querying the data in S3, you can join the data from S3 to tables residing in Redshift.

Performance

Because Amazon Redshift holds dominion  over how data is stored, compressed and queried, it has a lot more options for optimizing a query. On other hand Redshift Spectrum only has control over how the data is queried (because  it is up to AWS S3 how it’s stored). Performance of Redshift Spectrum depends on your Redshift cluster resources and optimization of S3 storage.

That said, Spectrum offers the convenience of not having to import your data into Redshift. Basically you’re trading performance for the simplicity of Spectrum. Lots of companies use Spectrum as a way to query infrequently accessed data and then move the data of interest into Redshift for more regular access.

Conclusion

This article provides a quick recap of the major differences between Amazon Redshift and Redshift Spectrum. It takes into consideration today’s data platform needs. 

Simply, Amazon Redshift can be classified as a tool in the “Big Data as a Service” category, whereas  Amazon Redshift Spectrum is grouped under “Big Data Tools”. 

If you are an existing customer of Amazon Redshift and looking for a best price per performance solution to run SQL on an AWS S3 data lake then try our community edition or 14-day free trial

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

redshift alternative

How to Use AWS Redshift Spectrum in AWS Lake Formation

As we’ve covered previously in What is Redshift Used For?, AWS Redshift is a cloud data warehouse used for online analytical processing (OLAP) and business intelligence (BI). Due to Redshift’s coupled architecture and relatively high costs at larger data volumes, businesses often seek to limit the workloads running on Redshift, while utilizing other analytic services including open-source Presto as part of a data lake house architecture.

Lake Formation makes it easier to set up the data lake, and to incorporate Redshift as part of the compute layer alongside other analytics tools and services. Developers can optimize their costs by using AWS Redshift for frequently accessed data and move less frequently accessed data to the Amazon S3 data lake, where it can be queried using serverless query engines such as Athena, Ahana, and Redshift Spectrum.

Two main reasons you would want to use Redshift with Lake Formation:

  • Granting and revoking permissions: Within Lake Formation, there is an independent permissions model in addition to the general IAM permissions set on an AWS account. This enables granular control over who can read data from a lake. You can grant and revoke permissions to the Data Catalog objects, such as databases, tables, columns, and underlying Amazon S3 storage. With Redshift following the Lake Formation permissions model out-of-the-box, you can ensure that the users querying data in Redshift are only accessing data they are meant to access. 
  • Creating external tables and running queries: Amazon Redshift Spectrum can be used as a serverless query option to join data stored in Redshift with data residing on S3. Lake Formation allows you to create virtual tables that correspond to S3 file locations and register them in the Data Catalog. A Redshift Spectrum query would then be able to consume this data without additional configuration.

How to Integrate AWS Redshift in Lake Formation

Lake Formation relies on the AWS Glue Crawler to store table locations in the Glue Data Catalog, which can then be used to control access to S3 data for other analytics services, including Redshift. This AWS blog post suggests a reference architecture for connecting the various services involved:

  • Data stored in an Amazon S3 lake is crawled using AWS Glue Crawler.
  • Glue Crawler then stores the data in tables and databases in the AWS Glue Data Catalog.
  • The S3 bucket is registered as the data lake location with Lake Formation. Lake Formation is natively integrated with the Glue Data Catalog.
  • Lake Formation grants permissions at the database, table, and column level to the defined AWS Identity and Access Management (IAM) roles.
  • Developers create external schemas within Amazon Redshift to manage access for other business teams.
  • Developers provide access to the user groups to their respective external schemas and associate the appropriate IAM roles to be assumed. 
  • Users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.
  • After the data is registered in the Data Catalog, each time users try to run queries, Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum, and the query runs.

Using Lake Formation as Part of an Open Data Lakehouse 

One of the advantages of a data lake is its open nature, which allows businesses to use a variety of best-in-breed analytics tools for different workloads. This replaces database-centric architectures, which requires storing data in proprietary formats and getting locked-in with a particular vendor.

Implementing Lake Formation makes it easier to move more data into your lake, where you can store it in open-source file formats such as Apache Parquet and ORC. You can then use a variety of tools that interface with the Glue Data Catalog and read data directly from S3. This provides a high level of flexibility, provides vendor lock-in, and strongly decouples storage from compute, reducing your overall infrastructure costs. (You can read more about this topic in our new white paper: The SQL Data Lakehouse and Foundations for the New Data Stack.)

If you’re looking for a truly open and flexible option for serverless querying, you should check out Ahana Cloud. Ahana Cloud and AWS Lake Formationmake it easy build and query secure S3 data lakes. Using the native integration, data platform teams can seamlessly connect Presto with AWS Glue, AWS Lake Formation and AWS S3 while providing granular data security. Enabling the integration in Ahana Cloud is a single-click affair when creating a new Presto cluster.

Learn more about Ahana Cloud’s integration with AWS Lake Formation.

Related Articles


Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

prestoconday22-logo

Ahana to Present About Presto on the Open Data Lakehouse at PrestoCon Day; Ahana Customer Blinkit to Discuss Its Presto on AWS Use Case

July 21 all-things Presto event features speakers from Uber, Meta, Ahana, Blinkit, Platform24, Tencent, Bytedance and more

San Mateo, Calif. – July 14, 2022 Ahana, the only SaaS for Presto, today announced its participation in PrestoCon Day, a day dedicated to all things Presto taking place virtually on Thursday, July 21, 2022. In addition to being a premier sponsor of the event, Ahana will be participating in three sessions and Ahana customer Blinkit will also be presenting its Presto use case.

Ahana and Ahana Customer Sessions at PrestoCon

July 21 at 9:35 am PT – “Free-Forever Managed Service for Presto for your Cloud-Native Open SQL Lakehouse,” by Wen Phan, Director of Product Management, Ahana

Getting started with a do-it-yourself approach to standing up an open SQL Lakehouse can be challenging and cumbersome.  Ahana Cloud Community Edition dramatically simplifies it and gives users the ability to learn and validate Presto for their open SQL Lakehouse—for free.  In this session, Wen will show how easy it is to register for, stand up, and use the Ahana Cloud Community Edition to query on top of a lakehouse.

July 21 at 10:30 am PT – “How Blinkit is Building an Open Data Lakehouse with Presto on AWS,” by Akshay Agarwal, Software Engineer, Blinkit; and Satyam Krishna, Engineering Manager, Blinkit

Blinkit, India’s leading instant delivery service, uses Presto on AWS to help them deliver on their promise of “everything delivered in 10 minutes”. In this session, Satyam and Akshay will discuss why they moved to Presto on S3 from their cloud data warehouse for more flexibility and better price performance. They’ll also share more on their open data lakehouse architecture which includes Presto as their SQL engine for ad hoc reporting, Ahana as SaaS for Presto, Apache Hudi and Iceberg to help manage transactions, and AWS S3 as their data lake.

July 21 at 11:00 am PT – “Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy,” by George Wang, Principal Software Engineer, Ahana

Today Presto supports broadcast join by having a worker to fetch data from a small data source to build a hash table and then sending the entire data over the network to all other workers for hash lookup probed by large data source. This can be optimized by a new query execution strategy as source data from small tables is pulled directly by all workers which is known as replicated reads from dimension tables. This feature comes with a nice caching property given that all worker nodes N are now participating in scanning the data from remote sources. The table scan operation for dimension tables is cacheable per all worker nodes. In addition, there will be better resource utilization because the presto scheduler can now reduce the number plan fragment to execute as the same workers run tasks in parallel within a single stage to reduce data shuffles.

July 21 at 2:25 pm PT – “Presto for the Open Data Lakehouse,” panel session moderated by Eric Kavanagh, CEO, Bloor Group with Dave Simmen, CTO & Co-Founder, Ahana; Girish Baliga, Chair of Presto Foundation & Sr. Engineering Manager, Uber; Biswapesh Chattopadhyay, Tech Lead, DI Compute, Meta; and Ali LeClerc, Chair of Presto Outreach Committee and Head of Community, Ahana


Today’s digital-native companies need a modern data infra that can handle data wrangling and data-driven analytics for the ever-increasing amount of data needed to drive business. Specifically, they need to address challenges like complexity, cost, and lock-in. An Open SQL Data Lakehouse approach enables flexibility and better cost performance by leveraging open technologies and formats. Join us for this panel where leading technologists from the Presto open source project will share their vision of the SQL Data Lakehouse and why Presto is a critical component.

View all the sessions in the full program schedule

PrestoCon Day is a free virtual event and registration is open

Tweet this: @AhanaIO announces its participation in #PrestoCon Day #cloud #opensource #analytics #presto https://bit.ly/3ImlAcU

# # #

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Hands-on Presto Tutorial: How to run Presto on Kubernetes

Data Lakehouse

What is Presto?

Tip: looking for a more technical guide to understanding Presto? Get the free ebook, Learning and Operating Presto.

To learn how to run Presto on Kubernetes, let’s cover the basics first. Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started with Presto on Kubernetes

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Presto configure

Figure 1: Creating a new nodegroup

scale presto

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Presto cluster

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

image5 summary

Figure 4: Assume role privileges

docker

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated

  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are pretty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

What is Kubernetes
  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

Using Presto with Kubernetes

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Presto for S3

How to get started running Presto on Kubernetes

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

how to run Presto on Kubernetes
  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
AWS s3 Kubernetes
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

yaml file to the Kubernetes
  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

Prestodb and Kubernetes
  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

how to use Kubernetes
  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

Kubernetes cluster
  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

get started with Presto on Kubernetes
  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

How to run presto on Kubernetes
  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

Kubernetes
  1. What does the current deployment look like?
Running Kubernetes

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4
How to use Kubernetes

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

Presto on Kubernetes.

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Next Lesson

Ready for your next Presto lesson from Ahana? Check out our guide to running Presto with AWS Glue as catalog on your laptop.

Data Warehouse: A Comprehensive Guide

Introduction

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. It is typically composed of operational data that has been aggregated and organized in such a way that facilitates the requirements of the data teams. Data consumers need/want to be able to do their work at a very high speed to make decisions. By design, there is usually some level of latency involved in data appearing in a warehouse, keep that in mind when designing your systems and what the requirements are for your users. In this article, we’re going to review the data warehouse types, the different types of architecture, and the different warehouse model types.

Data Warehouse Architecture Types

The various data warehouse architecture types break down into three categories:

Single-tier architecture – The objective of this architecture is to dramatically reduce data duplication and produce a dense set of data. While this design keeps the volume of data as low as possible, it is not appropriate for complex data requirements that include numerous data sources.

Two-tier architecture – This architecture design splits the physical data from the warehouse itself, making use of a system and a database server. This design is typically used for a data mart in a small organization, and while efficient at data storage, it is not a scalable design and can only support a relatively small number of users.

Three-tier architecture – The three-tier architecture is the most common type of data warehouse as it provides a well-organized flow of your raw information to provide insights. It is comprised of the following components:

  • Bottom tier – comprises the database of the warehouse servers. It creates an abstraction layer on the various information sources to be used in the warehouse. 
  • Middle tier – includes an OLAP server to provide an abstracted view of the database for the users. Being pre-built into the architecture, this tier can be used as an OLAP-centric warehouse.
  • Top tier – comprises the client-level tools and APIs that are used for data analysis and reporting. 

Data Warehouse Model Types 

The data warehouse model types break down into four categories:

  1. Enterprise Data Warehouse

An EDW is a centralized warehouse that collects all the information on subjects across the entire organization. These tend to be a collection of databases as opposed to one monolith, that provides a unified approach to querying data by subject.

  1. Data Mart 

 Consisting of a subset of a warehouse that is useful for a specific group of users. Consider a marketing data mart that is populated with data from ads, analytics, social media engagement, email campaign data, etc. This enables the (marketing) department to rapidly analyze their data without the need to scan through volumes of unrelated data. A data mart can be further broken into “independent”, where the data stands alone, or “dependent” where the data is coming from the warehouse.

  1. Operational Data Store

The ODS might seem slightly counterintuitive to start with as it is used for operational reporting, and typically we don’t want to do reporting and analytic workloads on operational data. It is a synergistic component for the previously mentioned EDW and used for reporting on operational types of data. Low-velocity data that is managed in real-time, such as customer records or employee records, are typical of this kind of store.

  1. Virtual Warehouse

The Virtual Warehouse is maybe a questionable inclusion, but nonetheless important. This is implemented as a set of views over your operational database. They tend to be limited in what they can make available due to the relationships in the data, and the fact that you don’t want to destroy your operational database performance by having large numbers of analytic activities taking place on it at the same time.

Summary

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for insights. As a data analyst is looking for trends or actionable insights, the ability to navigate through various data dimensions easily is paramount. The warehouse approach allows you to store and analyze vast amounts of information, which also comes at a cost for storage and compute. You can mitigate some of these costs by optimizing your warehouse for data retrieval. Picking a DW design and sticking with it, and ensuring that your data has been cleansed and standardized prior to loading.
Another option to the warehouse is the growing data lake approach, where information can be read in place from an object store such as (AWS) S3. Some advantages are reduced costs and latency as the load to the DW is no longer necessary. The Community Edition of the Presto managed service from Ahana is a great way to try out the data lake to test your requirements.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Typically a data warehouse contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from diverse data sources. It requires the process: Extract, Transform, and Load (ETL) from diverse data sources and create another copy within the data warehouse to support SQL queries and analysis. 

Following data design techniques are required to facilitate data retrieval for analytical processing:

Star Schema: It is the foundational and simplest schema among data warehousing modeling. It contains one or more fact tables indexing to any number of dimensional tables. Its graphical representation looks like a star hence why we call it a star schema. Fact tables are usually very large compared to dimension tables; and dimension tables can contain redundant data as these are not required to be normalized. 

Snowflake Schema: It is an extension of the star schema where a centralized fact table references the number of other dimension tables; however, those dimension tables are further normalized into multiple related tables. The entity-relationship diagram of this schema resembles a snowflake shape, hence we called it a snowflake schema.

Data Warehouse Example

Consider a fact table that stores sales quantities for each product and customer at a certain time. Sales quantities will be measured here and (Primary) keys from the customer, product, and time dimension tables will flow into the fact table. Additionally, all of the products can be further grouped under different product families and stored on a different table, the primary key of product family tables also goes into the product table as a foreign key. Such a construct is called a snow-flake schema as the product table is further snow-flaked into the product family.

snowflake schema design

Figure 1 explains the typical snowflake schema design. 

ETL or ELT—Extracting, Transforming, and Loading Data

Besides the difference in data modeling and schemas, building a data warehouse involves the critical task of ETL – the compiling of data into a warehouse from other sources.  

ETL v ELT diagram

In data extraction, we move data out of source systems. It could be relational databases, NoSQL databases or streaming data sources. The challenge during this step is to identify the right data and manage access control. 

In a data pipeline or batch workloads, we frequently move a large amount of data from different source systems to the data warehouse. Here the challenges are to plan a realistic SLA and to have a reliable and fast network and infrastructure. 

In data transformation, we format data so that it can be represented consistently in the data warehouse. The original data might reside in different databases using different data types or in different table formats, or in different file formats in different file systems. 

We load data into the fact tables correctly with an error-handling procedure in data loading.

Data Warehouse To Data Lake To Data Lakehouse

A data lake is a centralized file system or storage designed to store, process, and secure large amounts of structured, semistructured, or unstructured data. It can store data in its native format and process any variety of it. Examples of a data lake include HDFS, AWS S3, ADLS or GCS.

Data lakes use the ELT (Extract Load Transform) process while data warehouses use ETL (Extract Transform Load) process. With a SQL engine like Presto you can run interactive queries, reports, and dashboards from a data lake, without the need to create yet another data warehouse or copy of your data. and add an operational overhead. 

A data lake is just one element of an Open Data Lakehouse, as it is taking the benefits from both: a data warehouse and a data lake. However, an Open Data Lakehouse is much more than that. It is the entire stack. In addition to hosting a data lake (AWS S3), and a SQL engine (presto), it also allows for governance (AWS Lake Formation), and ACID transactions. Transactionality or transaction support is achieved using technologies and projects such as Apache Hudi; while Presto is the SQL engine that then sits on top of the cloud data lake you’re querying. In addition to this, there is Ahana Cloud. Ahana is a managed service for Presto, designed to simplify the process of configuring and operating Presto. 


As cloud data warehouses become more cost-prohibitive and limited by vendor lock-in, and the data mesh, or data federation, the approach is not performant, more and more companies are migrating their workloads to an Open Data Lakehouse. If all your data is going to end up in cloud-native storage like Amazon S3, ADLS Gen2, GCS. then the most optimized and efficient data strategy is to leverage an Open Data Lakehouse stack, which provides much more flexibility and remedies the challenges noted above. Taking on the task of creating an Open Data Lakehouse is difficult. As ab introduction to the process check out this on-demand presentation, How to build an Open Data Lakehouse stack. In it you’ll see how you can build your stack in more detail, while incorporating technologies like Ahana, Presto, Apache Hudi, and AWS Lake Formation.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

data-plus-ai-2022

Ahana Will Co-Lead Session At Data & AI Summit About Presto Open Source SQL Query Engine

San Mateo, Calif. – June 23, 2022Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at Data & AI Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 27 – 30 in San Francisco, CA and virtual.

Session Title: “Presto 101 – An Introduction to Open Source Presto.”

Session Time: On Demand

Session Presenters: Ahana’s Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell.

Session Details: Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, users can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Rohan and Philip will introduce the Presto technology and share why it’s becoming so popular. In fact, companies like Facebook, Uber, Twitter, Alibaba, and many others use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. This session will show a quick demo on getting Presto running in AWS.

To register for Data & AI Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @AhanaIO to present at #DataAISummit about #Presto https://bit.ly/3n8YDQt #OpenSource #Analytics #Cloud

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Data Lakehouse

AWS Redshift Query Limits

What is AWS Redshift?

At its heart, AWS Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. It has evolved and been enhanced since then into a powerful distributed system that can provide speedy results across millions of rows. Conceptually it is based on node clusters, with a leader node and compute nodes. The leader generates the execution plan for queries and distributes those tasks to the compute nodes. Scalability is achieved with elastic scaling that can add/modify worker nodes as needed and quickly. We’ll discuss the details in the article below.

Limitations of Using AWS Redshift

There are of course Redshift limitations on many parameters, which Amazon refers to as “quotas”. There is a Redshift query limit, a database limit, a Redshift query size limit, and many others. These have default values from Amazon and are per AWS region. Some of these quotas can be increased by submitting an Amazon Redshift Limit Increase Form. Below is a table of some of these quota limitations.

QuotaValueAdjustable
Nodes per cluster128Yes
Nodes per region200Yes
Schemas per DB per cluster9,900No
Tables per node type9,900 – 100,000No
Query limit50No
Databases per cluster60No
Stored procedures per DB10,000No
Query size limit100,000 rowsYes
Saved queries2,500Yes
Correlated SubqueriesNeed to be rewrittenNo

AWS Redshift Performance

To start, Redshift is storing data in compressed, columnar format. This means that there is less area on disk to scan and less data that has to be moved around. Add to that indexing and you have the base recipe for high performance. In addition, Redshift maintains a results cache, so frequently executed queries are going to be highly performant. This is aided by the query plan optimization done in the leader node. Redshift also optimizes the data partitioning in a highly efficient manner to complement the optimizations done in the columnar data algorithms.

Scaling

There are a robust number of scaling strategies available from Redshift. With just a few clicks in the AWS Redshift console, or even with a single API call, you can change node types, add nodes and pause/resume the cluster. You are also able to use Elastic Resize to dynamically adjust your provisioned capacity within a few minutes. A Resize Scheduler is also available where you can schedule changes, say for month-end processing for example. There is also Concurrency Scaling that can automatically provision additional capacity for dynamic workloads.

Pricing

A lot of variables go into Redshift pricing depending on the scale and features you go with. All of the details and a pricing calculator can be found on the Amazon Redshift Pricing page. To give you a quick overview, however, prices start as low as $.25 per hour. Pricing is based on compute time and size and goes up to $13.04 per hour. Amazon provides some incentives to get you started and try out the service.

First, similar to the Ahana Cloud Commnity Edition, Redshift has a “Free Tier”, if your company has never created a Redshift cluster then you are eligible for a DC2 large node trial for two months. This provides 750 hours per month for free, which is enough to continuously run that DC2 node, with 160GB of compressed SSD storage. Once your trial expires or your usage exceeds 750 hours per month, you can either keep it running with their “on-demand” pricing, or shut it down.

Next, there is a $500 credit available to use their Amazon Redshift Serverless option if you have never used it before. This applies to both the compute and storage and how long it will last depends entirely on the compute capacity you selected, and your usage.

Then there is “on-demand” pricing. This option allows you to just pay for provisioned capacity by the hour with no commitments or upfront costs, partial hours are billed in one-second increments. Amazon allows you to pause and resume these nodes when you aren’t using them so you don’t continue to pay, and you also preserve what you have, you’ll only pay for backup storage.

Summary

Redshift provides a robust, scalable environment that is well suited to managing data in a data warehouse. Amazon provides a variety of ways to easily give Redshift a try without getting too tied in. Not all analytic workloads make sense in a data warehouse, however, and if you are already landing data into AWS S3, then you have the makings of a data lakehouse that can offer better price/performance. A managed Presto service, such as Ahana, can be the answer to that challenge.

Want to learn more about the value of the data lake?

In our free whitepaper, Unlocking the Business Value of the Data Lake, we’ll show you why companies are moving to an open data lake architecture and how they are getting the most out of that data lake to drive their business initiatives.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What is AWS Redshift Spectrum?

Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Learn more about its performance and price.

press-community-announcement

Ahana Announces Additional $7.2 Million Funding Led by Liberty Global Ventures and Debuts Free Community Edition of Ahana Cloud for Presto for the Open Data Lakehouse

Only SaaS for Presto now available for free with Ahana Community Edition; Additional capital raise validates growth of the Open Data Lakehouse market

San Mateo, Calif. – June 16, 2022 Ahana, the only Software as a Service for Presto, today announced an additional investment of $7.2 million from Liberty Global Ventures with participation from existing investor GV, extending the company’s Series A financing to $27.2 million. Liberty Global is a world leader in converged broadband, video and mobile communications services. This brings the total amount of funding raised to date to $32 million. Ankur Prakash, Partner, Liberty Global Ventures, will join the Ahana Board of Directors as a board observer. Ahana will use the funding to continue to grow its technical team and product development; evangelize the Presto community; and develop go-to-market programs to meet customer demand. 

Ahana also announced today Ahana Cloud for Presto Community Edition, designed to simplify the deployment, management and integration of Presto, an open source distributed SQL query engine, for the Open Data Lakehouse. Ahana Community Edition is immediately available to everyone, including users of the 100,000+ downloads of Ahana’s PrestoDB Sandbox on DockerHub. It provides simple, distributed Presto cluster provisioning and tuned out-of-the-box configurations, bringing the power of Presto to data teams of all sizes for free. Instead of downloading and installing open source Presto software, data teams can quickly learn about Presto and deploy initial SQL data lakehouse use cases in the cloud. Community Edition users can easily upgrade to the full version of Ahana Cloud for Presto, which adds increased security including integration with Apache Ranger and AWS Lake Formation, price-performance benefits including multi-level caching, and enterprise-level support.

“Over the past year we’ve focused on bringing the easiest managed service for Presto to market, and today we’re thrilled to announce a forever-free community edition to drive more adoption of Presto across the broader open source user community. Our belief in Presto as the best SQL query engine for the Open Data Lakehouse is underscored by our new relationship with Liberty Global,” said Steven Mih, Cofounder and CEO, Ahana. “With the Community Edition, data platform teams get unlimited production use of Presto at a good amount of scale for lightning-fast insights on their data.”

“Today we’re seeing more companies embrace cloud-based technologies to deliver superior customer experiences. An underlying architectural pattern is the leveraging of an Open Data Lakehouse, a more flexible stack that solves for the high costs, lock-in, and limitations of the traditional data warehouse,” said Ankur Prakash, Partner, Liberty Global Ventures. “Ahana has innovated to address these challenges with its industry-leading approach to bring the most high-performing, cost-effective SQL query engine to data platforms teams. Our investment in Ahana reflects our commitment to drive more value for businesses, specifically in the next evolution of the data warehouse to Open Data Lakehouses.” 

Details of Ahana Cloud for Presto Community Edition, include:

●        Free to use, forever

●        Use of Presto in an Open Data Lakehouse with open file formats like Apache Parquet and advanced lake data management like Apache Hudi

●        A single Presto cluster with all supported instance types except Graviton

●        Pre-configured integrations to multiple data sources including the Hive Metastore for Amazon S3, Amazon OpenSearch, Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and Amazon Redshift

●        Community support through public Ahana Community Slack channel plus a free 45 minute onboarding session with an Ahana Presto engineer

●        Seamless upgrade to the full version which includes enterprise features like data access control, autoscaling, multi-level caching, and SLA-based support

“Enterprises continue to embrace ‘lake house’ platforms that apply SQL structures and querying capabilities to cloud-native object stores,” said Kevin Petrie, VP of Research, Eckerson Group. “Ahana’s new Community Edition for Presto offers a SQL query engine that can help advance market adoption of the lake house.”

Supporting Resources:

Get Started with the Ahana Community Edition

Join the Ahana Community Slack Channel

Tweet this:  @AhanaIO announces additional $7.2 million Series A financing led by Liberty Global Ventures; debuts free community edition of Ahana #Cloud for #Presto on #AWS https://bit.ly/3xlAVW4 

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Open Source Summit Logo

Ahana Will Co-Lead Session At Open Source Summit About Presto SQL Query Engine

San Mateo, Calif. – June 14, 2022 — Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at the Linux Foundation’s Open Source Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 20 – 24 in Austin, TX and virtual.

Session Title: “Introduction to Presto – The SQL Engine for Data Platform Teams.”

Session Time: Tuesday, June 21 at 11:10am – 11:50am CT

Session Presenters: Ahana Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell. 

Session Details: Presto is an open-source high performance, distributed SQL query engine. Born at Facebook in 2012, Presto was built to run interactive queries on large Hadoop-based clusters. Today it has grown to support many users and use cases including ad hoc query, data lake analytics, and federated querying. In this session, we will give an overview of Presto including architecture and how it works, the problems it solves, and most common use cases. We’ll also share the latest innovation in the project as well as what’s on the roadmap.

To register for Open Source Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @Ahana to present at #OpenSourceSummit about #Presto https://bit.ly/3xMGQ7M #OpenSource #Analytics #Cloud 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Cloud for PrestoDB

What are the Benefits of a Managed Service?

Managed Services – Understanding the basics

What are the operational benefits of using a managed service for Presto with Ahana Cloud? To answer this question, first let’s hear from an AWS Solution Architect about his experience using Ahana as a solution for his data lakehouse: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So, the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service. The customer of Ahana does not have to do a lot of work. Everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service.”

The architect goes on to state, “the customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC.”

In addition to this the AWS architect continues to state, “this is beneficial. As the user, you have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that some context from a current user and a solution provided from this data architect, let’s get more specific about the reasons a user would want to select a managed service for their SQL engine for Data Lakehouse analytics and reporting.

For example, let’s say you want to create a a new cluster. It’s just a couple of clicks with Ahana Cloud, rather than an entire arduous process without the faciliation of a service. You can pick the the coordinator instance type and the Hive metastore instance type. And it is all flexible.

In this scenario, as to further progress with this illustration, instead of using the Ahana Cloud provided Hive metastore, you can bring your own Amazon Glue catalog. This allows the user to main control and streamline their tasks.

Then of course it’s easy to add additional data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated. After the connection, then Ahana Cloud automatically restarts the cluster.

When compared to EMR or with other distributions, this is more cumbersome for the user. All of this has to be manually completed by the user when they are not using a managed service:

  • You have to create a catalog properties file for each data source
  • Restart the cluster on your own
  • Scale the cluster manually
  • Add your own query logs and statistic
  • Rebuild everything when you stop and restart clusters
Managed service for Presto

With Ahana Cloud as a managed service for PrestoDB, all of this manual action and complexity is taken away, which in turn is allowing the data analysts and users to focus on their work – rather than spending a large amount of time being distracted with high labor processes and complicated configurations as a prerequisite to getting started with analytical tasks.

For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud, as a managed service, and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless or as simple, like with Ahana, is when you stop the entire cluster.

In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. When you as tne user restart the cluster back up, all will come up pre-integrated with the click of a button. Meaning, the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. T

Here this is noted as rather important. The reason for this is because otherwise the operator would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

Reduce Frustration When Configuring

See how Ahana simplifies your SQL Engine

Serverless SQL Engine

Next Steps – Exploring a Managed Service

As you are your team members are looking to reduce the friction from your data analytics stack, learn how Ahana Cloud reduces frustration and time spent configuring for data teams. The number one reason for selecting a managed service is that it will make your life easier. Check out our customer stories to see how organizations like Blinkit, Carbon, and Adroitts were able to increase price-performance and bring control back to their data teams – all while simplifying their processes and bringing a sense of ease to their in-house data management outfits.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

ahana + cube_integration

Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube

See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs

Ahana provides managed Presto clusters running in your AWS account.

Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.

Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.

However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.

What is Cube?

Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

855489ee 4cef 4146 876b e44d0ebcc711

This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.

Integration

Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.

You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to learn more about connecting Cube to Ahana.

Example: Parsing logs from multiple data sources with Ahana and Cube

Let’s build a real-world data application with Ahana and Cube.

We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.

Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.

Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.

You need all of that to prepare personalized dashboards showing the results of your hard work.

Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.

You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.

Architecture diagram

In the end, you want a dashboard looking like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model

The final result: two dashboards showing the number of errors made by two variants of the ML model

How to configure Ahana?

Allowing Ahana to access your AWS account

First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.

On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.

When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.

The Outputs tab containing the identifier of the IAM role for Ahana

We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.

The Ahana setup page

Creating an Ahana cluster

After configuring AWS access, we have to start a new Ahana cluster.

In the Ahana dashboard, click the “Create new cluster” button.

Ahana create new cluster

In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.

Ahana cluster setup page

Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.

Retrieving data from S3 and PostgreSQL

After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.

Adding a PostgreSQL database to Ahana

In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.

In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.

PostgreSQL data source configuration

Adding an S3 bucket to Ahana

AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.

We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).

AWS Glue databases

Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).

Here we specify where the crawler should look for new data
Here we specify where the crawler should look for new data

On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.

Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.

Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!

IAM role configuration

Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:

c80af05b 95d9 428d 8bea b3f720fb5931

Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:

configuring the crawler's schedule

In the output configuration, we need to customize the settings.

First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”

We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.

Crawler's output configuration

Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.

Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.

AWS Glue tables

In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.

JSON serialization configured in the table details view

We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:

Default partition names replaced with their actual names

After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.

Configuring data sources in the Presto cluster

Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.

AWS Glue data catalog setup in Ahana

Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.

When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:

The "Outputs" tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup page, we click the “Add data source” button.

Adding data sources to an existing cluster

Finally, we can add both data sources to our Ahana cluster.

We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.

We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.

Adding data sources to an Ahana cluster

Writing the Presto queries

The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.

Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.

Example input:

{"time_series": [51, 37, …, 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}

In the response, the model returns a JSON object with a single “prediction”key and a decimal value:

{"prediction": 21.266147618448954}

A single request in Sagemaker Endpoint logs looks like this:

{"captureData": {"endpointInput": {"observedContentType": "application/json", "mode": "INPUT", "data": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "application/json", "mode": "OUTPUT", "data": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}

AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:

with sagemaker as (
  select
  model_name,
  variant_name,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)

After that, we join both data sources and calculate the absolute error value:

sql
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),

Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.

percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)

In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.

SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant

How to configure Cube?

In our application, we want to display the distribution of absolute prediction errors.

We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.

If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.

At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:

example chart_Both models perform almost the same

If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures

An improved second version of the model_example chart

Creating a Cube deployment

Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.

First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.

Cube Deployments dashboard page

We choose the Presto cluster:

Database connections supported by Cube

Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!

Presto configuration page

Defining the data model in Cube

We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.

Let’s open the Schema view in Cube and add a new file.

The schema view in Cube showing where we should click to create a new file

In the next window, type the file name errorpercentiles.js and click “Create file.”

87e853ec 6099 48cd 8376 c0e6f9780841

In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!

Below, you see the entire content of the file. Later, we explain the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

cube('errorpercentiles', {
  sql: `with sagemaker as (
    select
    model_name,
    variant_name,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
});

In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.

A newly created cube configuration file

We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:


measures: {
  perc_10: {
    sql: `perc_10`,
    type: `max`
  },
  perc_20: {
    sql: `perc_20`,
    type: `max`
  },
  perc_30: {
    sql: `perc_30`,
    type: `max`
  },
  perc_40: {
    sql: `perc_40`,
    type: `max`
  },
  perc_50: {
    sql: `perc_50`,
    type: `max`
  },
  perc_60: {
    sql: `perc_60`,
    type: `max`
  },
  perc_70: {
    sql: `perc_70`,
    type: `max`
  },
  perc_80: {
    sql: `perc_80`,
    type: `max`
  },
  perc_90: {
    sql: `perc_90`,
    type: `max`
  },
  perc_100: {
    sql: `perc_100`,
    type: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    type: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    type: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    type: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    type: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    type: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    type: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    type: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    type: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    type: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    type: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
A part of the error percentiles configuration in Cube

The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.

We had to add the following code before using the cube function!

First, we have to create an array of column names:


const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

Now, we must generate the measures configuration object. We iterate over the array and create a measure configuration for every column:


const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

Finally, we can replace the measure definitions with:

measures: measures

After changing the file content, click the “Save All” button.

The top section of the schema view

And click the Continue button in the popup window.

The popup window shows the URL of the test API

In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):

An example result in the Playground view

Configuring access control in Cube

In the Schema view, open the cube.js file.

We will use the queryRewrite configuration option to allow or disallow access to data.

First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.

For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.

The Security Context window in the Playground view

After rejecting unauthorized access, we add a filter to all queries.

We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.

In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.models) {
      throw new Error('No models found in Security Context!');
    }
    query.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.models,
    });
    return query;
  },
};

Configuring caching in Cube

By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.

When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.

For the sake of the example, we will retrieve the value only twice a day.

Preparing data sources for caching

First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.

In PostgreSQL, we grant permissions to the user account used by Presto to access the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;

If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.

We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.

If you want to use a different database name, follow the instructions in our documentation.

f70a3069 71cd 4b6c a945 9d63d3ff3c26

Caching configuration in Cube

Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    type: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      every: `12 hour`,
    },
  },
},

After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:

An empty “Commit Changes & Push”view

When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.

We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.

Successfully cached pre-aggregations

Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.

The message that indicates that our pre-aggregation works correctly

Building the front-end application

Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.

Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.

Getting the Cube API key and the API URL

Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.

The Env vars tab on the settings page

We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.

The location of the Copy API URL link

Back end for front end

Now, we can write the back-end code.

First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.

After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.

‚Äã‚Äãconst jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
  // Pass it as an environment variable at runtime or use the
  // secret management feature of your container orchestration system

  const cubejsToken = jwt.sign(
    { "models": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );
  
  return cubejsToken;
}

We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.

We create a new express application running in the node server and configure the /models endpoint:

const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()

app.use(bodyParser.json())
app.get('/models', getAvailableModels);

app.listen(port, () => {
  console.log(`Server is running on port ${port}`)
})

In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:

Our function returns a list of objects containing the modelName and modelVariant fields.

function getAvailableModels(req, res) {
  res.setHeader('Content-Type', 'application/json');
  request.post(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({"query": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, body) => {
    if (err) {
      console.log(err);
    }
    body = JSON.parse(body);
    response = body.data.map(item => {
      return {
        modelName: item["errorpercentiles.modelName"],
        modelVariant: item["errorpercentiles.modelVariant"]
      }
    });
    res.send(JSON.stringify(response));
  });
};

Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.

The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.

{
  "query": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}

The response parsing code extracts the number of values in every bucket and prepares bucket labels:

response = body.data.map(item => {
  return {
    modelName: item["errorpercentiles.modelName"],
    modelVariant: item["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      item['errorpercentiles.perc_20_value'],
      item['errorpercentiles.perc_30_value'],
      item['errorpercentiles.perc_40_value'],
      item['errorpercentiles.perc_50_value'],
      item['errorpercentiles.perc_60_value'],
      item['errorpercentiles.perc_70_value'],
      item['errorpercentiles.perc_80_value'],
      item['errorpercentiles.perc_90_value'],
      ">=" + item['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      item['errorpercentiles.perc_20'],
      item['errorpercentiles.perc_30'],
      item['errorpercentiles.perc_40'],
      item['errorpercentiles.perc_50'],
      item['errorpercentiles.perc_60'],
      item['errorpercentiles.perc_70'],
      item['errorpercentiles.perc_80'],
      item['errorpercentiles.perc_90'],
      item['errorpercentiles.perc_100']
    ]
  }
})

Dashboard website

In the last step, we build the dashboard website using Vue.js.

If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.

We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.

We put the charts in the Suspense component to allow asynchronous loading.

To keep the example short, we will skip the CSS style part.

<script setup>
  import OwnerName from './components/OwnerName.vue'
  import ChartView from './components/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const models = ref([]);
  axios.get(SERVER_URL + '/models').then(response => {
    models.value = response.data
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName name="Test Inc." />
    </div>
  </header>
  <main>
    <div v-for="model in models" v-bind:key="model.modelName">
      <Suspense>
        <ChartView v-bind:title="model.modelName" v-bind:variant="model.modelVariant" type="percentiles"/>
      </Suspense>
    </div>
  </main>
</template>

The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.

In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:

Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);

We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:

const props = defineProps({
  title: String,
  variant: String,
  type: String
})

Next, we retrieve the chart data and labels from the back-end service. We will also prepare the variable containing the label text:

const response = await axios.get(SERVER_URL + '/' + props.type + '?model=' + props.title + '&variant=' + props.variant)
const data = response.data[0].values;
const labels = response.data[0].labels;
const label_text = "Number of prediction errors of a given value"

Finally, we prepare the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      display: true,
      text: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    display: false
  },
  tooltip: {
    enabled: false
  }
}

In the template section of the Vue component, we pass the configuration to the Bar instance:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>

If we have done everything correctly, we should see a dashboard page with error distributions.

Charts displaying the error distribution for different model variants

Wrapping up

Thanks for following this tutorial.

We encourage you to spend some time reading the Cube and Ahana documentation.

Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.

Data Lakehouse

Price-Performance Ratio of AWS Athena vs Ahana Cloud for Presto

What does AWS Athena cost? Understand the price-performance ratio of Amazon Athena vs. Ahana. Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

To explain AWS Athena pricing compared to Ahana pricing, let’s first cover what is different between them. The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
Cost dimensionPay for the amount of data is scanned on on a per query basis at USD $5 per Terabyte Scanned. It may be hard to estimate how much data your queries will scan. Pay only for EC2 usage on a per node / hour basis for EC2 and Ahana
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, Presto on Ahana Cloud is upgraded on a quarterly basis to keep up with the most recent releases.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

AWS Athena cost

Ahana has higher concurrency so queries finish faster

Ahana cost
  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
TypeInstancePrice/hrQty.Cost/hr
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08
Total$3.45

So, you can run many queries for one hour that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.

Summary

While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Ready to Compare Ahana to Athena

Start a free trial today and experience better price performance

AWS Athena Limitations

AWS Athena Alternatives

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB. We’ll also discuss some of the reasons why you’d choose to deploy PrestoDB on yourself, rather than using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. Athena has a serverless architecture, which is a benefit. However, one of the drawbacks is the cost of AWS Athena. Currently, users pay per query. Currently it’s priced at $5 per terabyte scanned. Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. In addition to this, AWS Athena is built on an older version of PrestoDB and it only supports a subset of the PrestoDB features.

An overview on AWS Athena limits

AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Most of the limitations associated with Athena are rather challenging. Luckily, some are soft quotas. With these, you can request AWS to increase them. One big issue is around Athena’s restrictions on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena Alternatives

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

Ready To Work Without Limitations?

Get Started Today for Free With Ahana Cloud

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Athena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100* buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

2.    Athena restricts each account to 100* databases, and databases cannot include over 100* tables.

*Note, recently Athena has increased this to 10K databases per account and 200K tables per database.

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An Amazon Athena alternative is deploying your own PrestoDB cluster. Amazon Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service, without the limitations of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Up next: AWS Athena Query Limits

Related Articles 

Athena vs Presto

Learn the differences between Presto and Ahana and understand the pros and cons.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Best Athena Alternative

Discover the 4 most popular choices to replace Amazon Athena.

Data Lakehouse

What is AWS Redshift Spectrum?

What is Redshift Spectrum?

Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.

Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.

Benefits of AWS Redshift Spectrum

When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.

Spectrum allows you to access your data lake files from within your Redshift data warehouse, without having to go through an ingestion process. This makes data management easier. This also reduces data latency since you aren’t waiting for ETL jobs to be written and processed.

With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift. This means there are no new tools to learn and it allows you to leverage your existing skillsets to query Redshift. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.

AWS Redshift Spectrum Performance & Price

AWS Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.

As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.

Redshift and Redshift Spectrum Use Case

An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse. However, real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.

Summary

Amazon Redshift Spectrum provides a layer of functionality to Redshift. This allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for the companies who are using Redshift and need to stay there, but also need to make use of the data lake. Or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.

Run SQL on your Data Lakehouse

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running SQL queries in a few mins.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Lake-Formation-architecture

Building a Data Lake: How to with Lake Formation on AWS

What is an AWS Lake Formation?

Briefly, AWS lake formation helps users when building a data lake. Specifically, how to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, to get started with building a data lake, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Data Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers as the next step of building a data lake. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location

Benefits of Building a Data Lake with AWS Lake Formation

Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

what is a data lake

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

data lake permissions

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

AWS Lake Formation Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

Building a Data Lake

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.

Building a Data Lake: Summary

AWS Lake Formation really simplifies the process of building a data lake, whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.

What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Reporting and dashboarding diagram

The Open Data Lakehouse – a quick intro

Understanding the necessity of building a Data Lakehouse is critical to today’s data landscape. If you’re looking to get started with constructing a data lakehouse analytics stack book time with an engineer to expedite the development process.

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

0216reporting and dashboarding

The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3. 

What is an Open Data Lakehouse

The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.

Four key elements include:

Open source – The technologies on the stack we will be exploring for Open Data Lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on. 

Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud. 

Why Open Data Lakehouses?

Open data lakehouses allow consolidation of structured and unstructured data in a central repository, the open data lake, at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

  • Bringing compute to your data (decouple of compute and storage)
  • Flexibility at the governance/transaction layer
  • Flexibility and low cost to store structured and semi/unstructured data
  • Flexibility at every layer – pick and choose which technology works best for your workloads/use case

Open Data Lakehouse architecture

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

The next EDW is the Open Data Lakehouse. Learn the data lakehouse format.

BI/Application tools – Data Visualization, Data Science tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload. 

Presto – SQL Query Engine for the Data Lake

Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores. 

Presto is built for high performance interactive querying with in-memory execution 

Key characteristics include: 

  • High scalability from 1 to 1000s of workers
  • Flexibility to support a wide range of SQL use cases
  • Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
  • Federation of data sources particularly data lakes via Presto connectors
  • Seamless integration with existing SQL systems with ANSI SQL standard
deploying presto clusters

A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi – Streaming Transactions in the Open Data Lake

One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated  and consistent without that operational overhead. 

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following: 

  • Ingesting incremental data
  • Changing data capture, both insert and deletion
  • Incremental data processing
  • ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.

Hudi data management

(Image source: Apache Hudi)

  • Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
  • Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

  • Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

AWS S3 – The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started: How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Ingest your data into AWS S3 and query with Presto

Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.

Conclusion:

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

Data Lakehouse

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand and is compatible with multitudes of AWS tools and technologies. AWS Redshift is considered the preferred cloud data warehouse of choice for most customers, but the pricing is not simple, since it tries to accommodate different use cases and customers. Let us try to understand the pricing details of Amazon Redshift.

Understanding Redshift cluster pricing

The Redshift cluster consists of multiple nodes allowing it to process data faster. This means Redshift performance depends on the node types and number of nodes. The node types can be dense compute nodes or Redshift managed storage nodes.

Dense Compute nodes: These nodes offer physical memory up to 244GB and storage capacity on SSD up to 2.5TB.

RA3 with managed Storage nodes: These nodes have physical memory up to 384GB and storage capacity on SSD up to 128TB. Additionally, when storage runs out on the nodes, Redshift will offload the data into S3. Below pricing for RA3 does not include the cost of managed storage.

Redshift pricing

Redshift spectrum pricing

Redshift spectrum is a serverless offering that allows running SQL queries directly against an AWS S3 data lake. Redshift is priced based on data scanned per TB.

redshift spectrum pricing

Concurrency scaling

Amazon Redshift allows you to grab additional resources as needed and release them when they are not needed. Every day of typical usage up to one hour is free. However, every second beyond that is charged for additional resource usage.

A pricing example as stated by Amazon Redshift, A 10 DC2.8XL node Redshift cluster in the US-East will cost $48 per hour. Consider a scenario where two transient clusters are utilized for 5 mins beyond the free concurrency scaling credits. The per-second on-demand rate will be $48 X 1/3600 = $0.13 per second. The additional cost for concurrency scaling, in this case, is 0.013 per second X 300 seconds x 2 transient clusters = $8

Amazon Redshift managed storage pricing(RMS)

Managed storage comes with RA3 node types. Usage of managed storage is calculated hourly based on the total GB stored. Managed storage does not include backup storage charges due to automated and manual snapshots.

Amazon Redshift cost

Pricing example for managed storage pricing

100 GB stored for 15 days: 100 GB X 15days x (24hours/day) =36000 GB/hours

100 TB stored for 15days: 100TB X 1024GB/TB X 15 days x (24 hours/day) = 36,864000 GB-hours

Total usage in GB-hours: 36,000 GB-Hours + 36,864000 GB-hours = 36,900,000 GB-hours

Total usage in GB-Month = 36,900,000/720 hours per month = 51,250 GB-months

Total charges for the month will be 51,250 GB-month X $0.024 = $1230

Limitation of Redshift Pricing

As you can see, Redshift has few selected instance types with limited storage.

Customers could easily hit the ceiling in terms of the node storage, and Redshift managed storage is expensive for data growth.

Redshift spectrum (a serverless option) $5 scan per TB will be an expensive option and removes the ability for the customer to scale up/down the nodes to meet their performance requirements.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced use cases in Redshift, or managing increasing cloud storage costs, check out Ahana. Ahana is a powerful managed service for Presto which provides SQL on S3. Unlike Redshift spectrum, Ahana allows customers to choose the right instance type and scale up/down as needed and comes with a simple pricing model on the number of compute instances.

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

On-Demand-lake-to-shining-lakehouse

On-Demand Presentation

From Lake to Shining Lakehouse, A New Era In Data

From data warehousing to data lakes, and now with so-called Data Lakehouses, we’re seeing an ever greater appreciation for the importance of architecture. The success of Snowflake proved that data warehousing is a live and well; but that’s not to say that data lakes aren’t viable. The key is to find a balance of both worlds, thus enabling the rapid analysis afforded by warehousing, and the strategic agility via explorative ad hoc queries that data lakes provide. During this episode of DM Radio you will learn from experts Raj K of General Dynamics Information technology, and Wen Phan of Ahana.


Speakers

K Raj

gd it logo color
K Raj headshot

Wen Phan

ahana logo
Wen-Phan_Picture

Eric Kavanaugh

bloor group logo1
eric kavanaugh

PrestoDB on AWS

PrestoDB on AWS

What is PrestoDB on AWS?

Tip: If you are looking to better understand PrestoDB on AWS then check out the free, downloadable ebook, Learning and Operating Presto. This ebook will breakdown what Presto is, how it started, and best use cases.

To tackle this common question, what is PrestoDB on AWS, let’s first define Presto. PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. Presto was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. It was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables self-service ad-hoc analytics for its users on large amounts of data. With Presto, you can query data where it lives. This is including Hive, Amazon S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

AWS and Presto is a powerful combination. If you want to run PrestoDB on AWS it’s easy to spin up a managed Presto cluster. This can be done either through the Amazon Management Console, the AWS CLI, or the Amazon EMR API. It is not too difficult to run AWS Presto CLI EMR.

You can also give Ahana Cloud a try. Ahana is a managed service for Presto that takes care of the devops for you and provides everything you need to build your SQL Data Lakehouse using Presto.

Running Presto on AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does PrestoDB on AWS Work?

This is another very common question. The quickest answer is that PrestoDB is the compute engine on top of the data storage of your SQL Data Lakehouse. In this case, the storage is AWS S3. See the image below for an overview.

PrestoDB on AWS

There are some AWS services that work with PrestoDB on AWS, like Amazon EMR and Amazon Athena. Amazon EMR and Amazon Athena are the best Amazon services to deploy Presto in the cloud. They are managed services that do the integration, testing, setup, configuration, and cluster tuning for you. Amazon Athena Presto and EMR are widely used, but both come with some challenges, such as price performance and cost.

There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, and within minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune.

Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or PrestoDB. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for PrestoDB on AWS.

To get started with Presto for your SQL Data Lakehouse on AWS quickly, check out the services from Ahana Cloud. Ahana has two versions of their solution: a Full-Edition and a Free-Forever Community Edition. Each option has components of the SQL Lakehouse included, as well as support from Ahana. Explore Ahana’s managed service for PrestoDB.

Related Articles 

PrestoDB on Spark

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads.

Price-Performance Ratio of AWS Athena Presto vs Ahana Cloud for PrestoDB

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project. The biggest difference between the two is that Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers.

Athena Query Limits | Comparing AWS Athena & PrestoDB

In this blog, we discuss AWS Athena vs Presto and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service, like AWS pricing.

Ahana Cloud for PrestoDB

What is Presto and How Does It Work?

What is Presto and how does It work?

How does PrestoDB work? PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, how it works is you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

Is Presto a database?

No, PrestoDB is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between Presto and other forks?

PrestoDB originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. PrestoDB between other versions or compared to other versions are forks of the project and are not backed by the Linux Foundation’s PrestoDB Foundation.

Is Presto In-Memory? 

When it comes to memory, how it works is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. PrestoDB itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto cache – stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When PrestoDB executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, PrestoDB will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 
– Capturing data
– Storing data
– Analysis
– Search
– Sharing
– Transfer
– Visualization
– Querying
– Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

Presto data sources are sources that connect to PrestoDB and that you can query. There are a ton in the PrestoDB ecosystem including AWS S3, Redshift, MongoDB, and many more.

What Is Presto Hive? 

Presto Hive typically refers to using PrestoDB with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto MySQL via the Hive connector is able to access both these components. One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

PrestoDB and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

PrestoDB is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Autoscale your Presto cluster in Ahana Cloud

Autoscaling is now available on Ahana Cloud. This feature will monitor the worker nodes’ average CPU Utilization of your worker nodes and scale-out when reaching the 75% threshold.

Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources.

Presto vs Snowflake: Data Warehousing Comparisons

Data Lakehouse

Snowflake vs Presto

This article touches on several basic elements to compare Presto and Snowflake.

To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your data lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).

Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

Use cases: Snowflake vs. Presto

Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic. 

Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.

Open Source Or Vendor lock-in

Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in. 

Vendor Lock-In can lead to:

  • Excessive cost as you grow your data warehouse
  • When ingested into another system, data is typically locked into the formats of a closed source system
  • No community innovations or ways to leverage other innovative technologies and services to process that same data

Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem. 

Open File Formats

Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc. 

On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration. 

Open transaction format

Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does. 

Data Ownership

While Snowflake did decouple storage and compute, they did not decouple data ownership. . They  still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles. 

Scalability and Concurrency

With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.

Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time

Cost of Snowflake

Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.

Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database. 

Snowflake vs. Presto: In Summary

Snowflake is a well-known cloud data warehouse, but sometimes users need more than that – 

  1. Immediate data access as soon as it is written in a federated manner
  2. Eliminate lag associated with ETL migration when you can directly query from the source
  3. Flexible environment to run unstructured/ semi-structured or machine learning workloads
  4. Support for open file formats and storage standards to build open data lakehouse
  5. Open-source technologies to avoid vendor lock-in
  6. The cost-effective solution that is optimized for high concurrency and scalability. 

Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way. 

SaaS for Presto

If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. 

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

AWS Athena vs AWS Glue: What Are The Differences?

Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

Redshift-internal-architecture-

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?

You might already be familiar with Redshift basics. However, in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.

Redshift Data Warehouse Architecture and Main Components

Redshift internal architecture explained

As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow. Data can also be uploaded through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.

Breaking Down the Redshift Cluster Components

  • The leader node is Redshift’s ‘brain’ and manages communications with external client programs. It also manages the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
  • Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
  • The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
  • Intermediate results are sent back to the leader node. This then performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
  • Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
  • Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.

This AWS presentation offers more details about Redshift’s internal architecture. It also provides a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:

Additional Performance Features

In addition to these core components, Redshift has multiple built-in features meant to improve performance:

  • Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
  • Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations. 
  • Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time. This ensures final aggregations are returned faster.
  • Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries. 
  • Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..

Redshift vs Traditional Data Warehouses

While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:

  • Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
  • Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
  • Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.

Redshift Costs and Performance

Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:

  • On-demand instances are charged by the hour, with no long-term commitment or upfront fees. 
  • Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time. 
  • Serverless instances are charged based on usage, so customers only pay for the capacity they consume.

The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters. 

Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query. 

Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.

Reducing Redshift Costs with a Lakehouse Architecture

We’ve worked with many companies who started out using Redshift. This was sufficient for their needs when they didn’t have much data, but they found it difficult and costly to scale as their needs evolved. 

Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.

At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse. This would leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.

Redshift Lakehouse Architecture Explained

In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data. It allows them to do so without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.

Still Choosing Between a Data Warehouse and a Data Lake?

Watch the webinar, Data Warehouse or Data Lake, which one do I choose? In this webinar, we’ll discuss the data landscape and why many companies are moving to an open data lakehouse.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand, but the pricing is not simple. This article breaks down how their pricing works.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Data Lakehouse

What Is Trino?

What is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB. As such, it was designed from the ground up for fast queries against any amounts of data. It supports any types of data sources including relational and non-relational sources via its connector architecture.

What is the history of Trino?

Trino is a hard fork of the widely popular open source Presto project which started out at Facebook, running large-scale interactive analytic queries against a 300PB data lake using Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive. In November, 2013, Facebook open sourced Presto under the Apache 2 license, and made it available in the public GitHub code repository named “prestodb”. In early 2019, the hard fork named Trino was started by the creators of Presto who later became cofounder/CTOs of the commercial vendor Starburst. In the meantime, Presto became part of the openly governed Presto Foundation, hosted under the guidance and experience of The Linux Foundation. Trino has subsequently diverged from Presto. Many of the innovations the community is driving in Presto are not available in outside of Presto. Today, only Presto is running at companies like Facebook, Uber, Twitter, and Alibaba.

Why is Trino so fast?

As it is a hard fork of the original Presto project, it carries with it some of the original elements which make Presto so fast, namely the in-memory execution architecture. Prior to Presto, distributed query engines such as Hive were designed to store intermediate results to disk.

How does Trino work?

It’s a distributed system that runs on Hadoop, and uses an architecture similar to massively parallel processing (MPP) databases. It has one coordinator node working with multiple worker nodes. Users submit SQL to the coordinator which uses query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It supports standard ANSI SQL, including complex queries, joins aggregations, and outer joins.

What is Apache Trino?

Actually, this is a misnomer in that Trino is not a project hosted under the well-known Apache Software Foundation (ASF). Apache Incubator and top level projects are subject to the naming conventions “Apache [Project Name].” An example of which is Apache Mesos. Instead trino project which is a hard fork of Presto is with a vendor controlled non-profit called the Trino Software Foundation. It is not affiliated with any well-known project hosting organizations like ASF for The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which they are licensed with.

Is Trino OLAP?

It’s an open source distributed SQL query engine. It is a hard fork of the original Presto project created by Facebook. It lets developers run interactive analytics against large volumes of data. With Trino, organizations can easily use their existing SQL skills to query data without having to learn new complex languages. The architecture is quite similar to traditional online analytical processing (OLAP) systems using distributed computing architectures, in which one controller node coordinates multiple worker nodes.

What is the Trino Software Foundation?

The Software Foundation is a non-profit corporation which is controlled by the cofounders of the commercial vendor Starburst. The Trino Software Foundation has the open source Trino project. It is a hard fork of the Presto project, which is separate and hosted by the Linux Foundation. From the trino website there’s only two sentences about the foundation: “The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.” What is not mentioned is any form of charter or governance. These are tables stakes for Linux Foundation projects, where the project governance is central to the project.

What SQL does Trino use?

Just like the original Presto, is built with a familiar SQL query interface that allows interactive SQL on many data sources. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations.

What Is A Trino database?

Their distributed system runs on Hadoop/HDFS and other data sources. It uses a classic MPP model (massively parallel processing). The java-based system has a coordinator node (master) working in conjunction with a scalable set of worker nodes. Users send their SQL query through a client to the Trino coordinator which plans and schedules a distributed query plan across all its worker nodes. Both, Trino and Presto are SQL query engines and thus are not databases by themselves. They do not store any data, but from a user perspective, Trino can appear as a database because it queries the connected data stores.

What is the difference between Presto and Trino?

There are technical innovations and differences between Presto and Trino that include:
– Presto is developed, tested, and runs at scale at Facebook, Uber, and Twitter
– Presto uses 6X less memory and repartitions 2X faster with project Aria
– “Presto on Spark” today can run massive batch ETL jobs.
– Presto today is 10X faster with project RaptorX, providing caching at multiple levels
– The Presto community is making Presto more reliable and scalable with multiple coordinators instead of the single point of failure of one coordinator node.  

Trino can query data where it is stored, without needing to move data into separate warehouse or analytics database. Queries are executed in parallel with the memory of distributed worker machines. Most results return in seconds of time. Whereas Trino is a new fork, Presto continues to be used by many well-known companies: Facebook, Uber, Twitter, AWS. Trino is vendor driven project, as it is hosted in a non-profit organization which is owned by the cofounders of the Trino vendor Starburst. In comparison, Presto is hosted by Presto Foundation, a sub-foundation under The Linux Foundation. There are multiple vendors who support Presto, including the Presto as a Service (SaaS) offerings: Ahana Cloud for Presto and AWS Athena, which is based on Presto, not Trino.

As the diagram below illustrates, Presto saves time by running queries in the memory of the worker machines, running operations on intermediate datasets in-memory which is much faster, instead of persisting them to disk. It also shuffles data amongst the workers as needed. This also obviates the writes to disk between the stages. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

Whereas the pipelining approach between Presto and Trino is shared, Presto has a number of performance innovations that are not shared, such as caching.  For more about the differences, see the April 2021 talk by Facebook at PrestoCon Day, which describe what they, along with others like Ahana, are doing to push the technology forward.

Trino is a distributed SQL query engine that is used best for running interactive analytic workloads on your data lakes and data sources. It is used for similar use cases that the original Presto project was designed for. It allows you to query against many different data sources whether its HDFS, Postgres, MySQL, Elastic, or a S3 based data lake. Trino is built on Java and can also integrate with other third party data sources or infrastructure components. 

Trino SQL

After the query is parsed, Trino processes the workload into multiple stages across workers. Computing is done in-memory with staged pipelines.

To make Trino extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, they has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), 
Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server. Like the original community-driven open source Presto project, the data is queried where it is stored, without the need to move it into a separate analytics system.  

What to hear more about Ahana – the easiest Presto managed service ever made? Learn more about Ahana Cloud
data-warehouse-or-data-lake-on-demand

Webinar On-Demand
Data Warehouse or Data Lake, which one do I choose?

(Hosted by Dataversity)

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.


Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.


In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

Speaker

Ali LeClerc

Head of Community, Ahana

ali leclerc

presto-query-analyzer-logo

Ahana Announces New Presto Query Analyzer to Bring Instant Insights into Presto Clusters

Free-to-use Presto Query Analyzer by Ahana enables data platform teams to analyze Presto workloads and ensure top performance

San Mateo, Calif. – May 18, 2022 – Ahana, the only SaaS for Presto, today announced a new tool for Presto users called the Presto Query Analyzer. With the Presto Query Analyzer, data platform teams can get instant insights into their Presto clusters including query performance, bandwidth bottlenecks, and much more. The Presto Query Analyzer was built for the Presto community and is free to use.

Presto has become the SQL query engine of choice for the open data lakehouse. The open data lakehouse brings the reliability and performance of the data warehouse together with the flexibility and simplicity of the data lake, enabling data warehouse workloads to run alongside machine learning workloads. Presto on the open data lakehouse enables much better price performance as compared to expensive data warehousing solutions. As more companies are moving to an open data lakehouse approach with Presto as its engine, having more insights into query performance, workloads, resource consumption, and much more is critical.

“We built the Presto Query Analyzer to help data platform teams get deeper insights into their Presto clusters, and we are thrilled to be making this tool freely available to the broader Presto community,” said Steven Mih, Cofounder & CEO, Ahana. “As we see the growth and adoption of Presto continue to skyrocket, our mission is to help Presto users get started and be successful with the open source project. The Presto Query Analyzer will help teams get even more out of their Presto usage, and we look forward to doing even more for the community in the upcoming months.”

Key benefits of the Presto Query Analyzer include:

  • Understand query workloads: Break down queries by operators, CPU time, memory consumption, and bandwidth. Easily cross-reference queries for deep drill down.
  • Identify popular data: See which catalog, schema, tables, and columns are most and least frequently used and by who.
  • Monitor research consumption: Track CPU and memory utilization across the users in a cluster.

The Presto Query Analyzer by Ahana is free to download and use. Download it to get started today.

More Resources

Free Download: Presto Query Analyzer by Ahana

Presto Query Analyzer sample report

Tweet this:  @AhanaIO announces #free #Presto Query Analyzer for instant insights into your Presto clusters https://bit.ly/3lo2rMM

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Data Lakehouse

What is Amazon Redshift Used For?

Introduction

Amazon Redshift is one of the most widely-used services in the AWS ecosystem, and is a familiar component in many cloud architectures. In this article, we’ll cover the key facts you need to know about this cloud data warehouse, and the use cases it is best suited for. We’ll also discuss the limitations and scenarios where you might want to consider alternatives.

What is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse offered by AWS. First introduced in 2012. Today Redshift is used by thousands of customers, typically for workloads ranging from hundreds of gigabytes to petabytes of data.

Redshift is based on PostgreSQL 8.0.2 and supports standard SQL for database operations. Under the hood, various optimizations are implemented to provide fast performance even at larger data scales,. This includes massively parallel processing (MPP) and read-optimized columnar storage.

What is a Redshift Cluster?

A Redshift cluster represents a group of nodes provisioned as resources for a specific data warehouse. Each cluster consists of a leader and compute nodes. When a query is executed, Redshift’s MPP design means it distributes the processing power needed to return the results of an SQL query between the available nodes. It does this automatically.

Determining cluster size depends on the amount of data stored in your database. This also depends on the number of queries being executed, and the desired performance. 

Scaling and managing clusters can be done through the Redshift console, the AWS CLI, or programmatically through the Redshift Query API.

What Makes Redshift Unique?

When Redshift was first launched, it represented a true paradigm shift from traditional data warehouses provided by the likes of Oracle and Teradata. As a fully managed service, Redshift allowed development teams to shift their focus away from infrastructure and toward core application development. The ability to add compute resources automatically with just a few clicks or lines of code, rather than having to set up and configure hardware, was revolutionary and allowed for much faster application development cycles.

Today, many modern cloud data warehouses offer similar linear scaling and infrastructure-as-a-service functionality. A few notable products including Snowflake and Google BigQuery. However, Redshift remains a very popular choice and is tightly integrated with other services in the AWS cloud ecosystem.

Amazon continues to improve Redshift, and in recent years has introduced federated query capabilities, serverless, and AQUA (hardware accelerated cache).

Redshift Use Cases

Redshift’s Postgres roots mean it is optimized for online analytical processing (OLAP) and business intelligence (BI) – typically executing complex SQL queries on large volumes of data rather than transactional processing which focuses on efficiently retrieving and manipulating a single row.

Some common use cases for Redshift include:

  • Enterprise data warehouse: Even smaller organizations often work with data from multiple sources such as advertising, CRM, and customer support. Redshift can be used as a centralized repository that stores data from different sources in a unified schema and structure to create a single source of truth. This can then feed enterprise-wide reporting and analytics.
  • BI and analytics: Redshift’s fast query execution against terabyte-scale data makes it an excellent choice for business intelligence use cases. Redshift is often used as the underlying database for BI tools such as Tableau (which otherwise might struggle to perform when querying or joining larger datasets).
  • Embedded analytics and analytics as a service: Some organizations might choose to monetize the data they collect by exposing it to customers. Redshift’s data sharing, search, and aggregation capabilities make it viable for these scenarios, as it allows exposing only relevant subsets of data per customer while ensuring other databases, tables, or rows remain secure and private.
  • Production workloads: Redshift’s performance is consistent and predictable, as long as the cluster is adequately-resourced. This makes it a popular choice for data-driven applications, which might use data for reporting or perform calculations on it.
  • Change data capture and database migration: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. This is typically done to provide more flexible analytical capabilities, or when migrating from legacy data warehouses.

Redshift Challenges and Limitations 

While Amazon Redshift is a powerful and versatile data warehouse, it still suffers from the limitations of any relational database, including:

  • Costs: Because storage and compute are coupled, Redshift costs can quickly grow very high. This is especially noted when working with larger datasets, or with streaming sources such as application logs.
  • Complex data ingestion: Unlike Amazon S3, Redshift does not support unstructured object storage. Data needs to be stored in tables with predefined schemas. This can often require complex ETL or ELT processes to be performed when data is written to Redshift. 
  • Access to historical data: Due to the above limiting factors, most organizations choose to store only a subset of raw data in Redshift, or limit the number of historical versions of the data that they retain. 
  • Vendor lock-in: Migrating data between relational databases is always a challenge due to the rigid schema and file formats used by each vendor. This can create significant vendor lock-in and make it difficult to use other tools to analyze or access data.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced Redshift use cases, or managing increasing cloud storage costs, check out Ahana. Ahana Cloud is a powerful managed service for Presto which provides SQL on S3. 

Start Running SQL on your Data Lakehouse

Go from 0 to Presto in 30 minutes and drop the limitations of the data warehouse

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

There can be some confusion with the difference between AWS Redshift Spectrum and AWS Athena. Learn more about the differences in this article.

ETL process diagram

ETL and ELT in Data Warehousing

What is the difference between ETL and ELT?

What is ETL used for?

If you’re looking to understand, ETL and ELT differences let’s start with explaining what they are. ETL, or Extract Transform Load, is when an ETL tool or series of homegrown programs extracts data from a data source(s), often a relational database, and performs transformation functions. Those transformations could be data cleansing, standardizations, enrichment, etc., and then write (load) that data into a new repository, often a data warehouse. 

In the ETL process, an ETL tool or series of programs extracts the data from different RDBMS source systems, and then transforms the data, by applying calculations, concatenations, etc., and then loads the data into the Data Warehouse system.

What is ETL

What is ELT used for?

ELT, or Extract Load Transform turns the ETL process around a little bit and has you extract the raw data out from the data source and directly load it into the destination, without any processing in between. The transformation process is then done “in place” in the destination repository. Generally, the raw data is stored indefinitely so various transformations and enrichments can all be done by users with access to it, using tools they are familiar with.

What is ELT?

Both are data integration styles and have much in common with their ultimate goals, but are implemented very differently. Knowing what they are, and understanding the ETL and ELT processes, let’s dive deeper into how they differ from one another.

What is the difference between ETL and ELT?

So how does ETL vs ELT break down?

CategoryETLELT
DefinitionData is extracted from ‘n’ number of data sources. Transformed in a separate process, then loaded into the destination repository.Data is extracted from ‘n’ number of data sources and directly loaded into the destination repository. Transformation occurs inside the destination.
TransformationData is transformed within an intermediate processing step that is independent of extract and load.Data can be transformed on an ad-hoc basis during reads, or in batch and stored in another set of tables.
Code-Based TransformationsPrimarily executed in the compute-intensive transformation process.Primarily executed in the database but also done ad-hoc through analysis tools.
Data Lake SupportOnly in the sense that it can be utilized as storage for the transformation step.Well oriented for the data lake.
CostSpecialized servers for transformation can add significant costs.Object stores are very inexpensive, requiring no specialized servers.
MaintenanceAdditional servers add to the overall maintenance burden.Fewer systems mean less to maintain.
LoadingData has to be transformed prior to loading. Data is loaded directly into the destination system.
MaturityETL tools and methods have been around for decades and are well understood.Relatively new on the scene, with emerging standards and less experience.

Use Cases

Let’s take HIPAA as an example of data that would lend itself to ETL rather than ELT. The raw HIPAA data contains a lot of sensitive information about patients that isn’t allowed to be shared, so you would need to go through the transformation process prior to loading it to remove any of that sensitive information. Say your analysts were trying to track cancer treatments for different types of cancer across a geographic region. You would scrub your data down in the transformation process to include treatment dates, location, cancer type, age, gender, etc., but remove any identifying information about the patient.

An ELT approach makes more sense with a data lake where you have lots of structured, semi-structured, and unstructured data. This can also include high-velocity data where you are trying to make decisions in near real-time. Consider an MMORPG where you want to offer incentives to players in a particular region that have performed a particular task. That data is probably coming in through a streaming protocol such as Kafka and analysts are doing transforming jobs on the fly to distill it down to the necessary information to fuel the desired action.

Differences between ETL and ELT

Summary

In summary, the difference between ETL and ELT in data warehousing really comes down to how you are going to use the data as illustrated above. They satisfy very different use cases and require thoughtful planning and a good understanding of your environment and goals. If you’re exploring whether to use a data warehouse or a data lake, we have some resources that might be helpful. Check out our white paper on Unlocking the Business Value of the Data Lake which discusses the data lake approach in comparison to the data warehouse. 

Ready to Modernize Your Data Stack?

In this free whitepaper you’ll learn what the open data lakehouse is, and how it overcomes challenges of previous solutions. Get the key to unlocking lakehouse analytics.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

Data Lakehouse

Presto has evolved into a unified engine for SQL queries on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.

Presto’s Redshift connector allows conducting SQL queries on the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

How to Run SQL Queries in Redshift with Presto

Step 1: Setup a Presto cluster with Kubernetes

Set up your own Presto cluster on Kubernetes using our Presto on Kubernetes tutorial or you can use Ahana’s managed service for Presto

Step 2: Setup a Amazon Redshift cluster

Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Amazon Redshift cluster – “redshift-presto-demo” 

SQL queries in Redshift

 Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.

You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules. 

If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.

Step 3: Configure Presto Catalog for Amazon Redshift Connector

At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.

Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connection-password=secret
connector.name=redshift
connection-url=jdbc:postgresql://example.net:5439/database
connection-user=root

This is how my catalog properties look like –

  my_redshift.properties: |
      connector.name=redshift   
      connection-user=awsuser
      connection-password=admin1234 
connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev

Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database

After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

IN the below example you can see a new catalog for Redshift Database got initiated called “my_redshift. ”

presto> show catalogs;
   Catalog   
-------------
 ahana_hive  
 jmx         
 my_redshift 
 system      
 tpcds       
 tpch        
(6 rows)
 
Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Further you can check all available schemas for your Amazon Redshift from Presto to work with.

presto> show schemas from my_redshift;
       Schema       
--------------------
 catalog_history    
 information_schema 
 pg_catalog         
 pg_internal        
 public             
(5 rows)
 
Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:01 [5 rows, 85B] [4 rows/s, 72B/s]

Here, I have used sample data that comes with Redshift Cluster setup. I have chosen the schema “public” which is a part of “dev” Redshift Database. 

presto> show tables from my_redshift.public;
  Table   
----------
 category 
 date     
 event    
 listing  
 sales    
 users    
 venue    
(7 rows)
 
Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:03 [7 rows, 151B] [2 rows/s, 56B/s]

Further, you can explore tables as “sales” in the below example.

presto> select * from my_redshift.public.sales LIMIT 2;
 salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission |        saletime         
---------+--------+----------+---------+---------+--------+---------+-----------+------------+-------------------------
   33095 |  36572 |    30047 |     660 |    2903 |   1827 |       2 | 234.00    | 35.10      | 2008-01-01 01:41:06.000 
   88268 | 100813 |    45818 |     698 |    8649 |   1827 |       4 | 836.00    | 125.40     | 2007-12-31 23:26:20.000 
(2 rows)
 
Query 20210810_185527_00212_krtkp, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]

Following are some more complex queries you can run against sample data:

presto:public> -- Find top 10 buyers by quantity
            ->SELECT firstname, lastname, total_quantity 
            -> FROM   (SELECT buyerid, sum(qtysold) total_quantity
            ->         FROM  sales
            ->         GROUP BY buyerid
            ->         ORDER BY total_quantity desc limit 10) Q, users
            -> WHERE Q.buyerid = userid
            -> ORDER BY Q.total_quantity desc;
 firstname | lastname | total_quantity 
-----------+----------+----------------
 Jerry     | Nichols  |             67 
 Armando   | Lopez    |             64 
 Kameko    | Bowman   |             64 
 Kellie    | Savage   |             63 
 Belle     | Foreman  |             60 
 Penelope  | Merritt  |             60 
 Kadeem    | Blair    |             60 
 Rhona     | Sweet    |             60 
 Deborah   | Barber   |             60 
 Herrod    | Sparks   |             60 
(10 rows)
 
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
 
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
            -> SELECT eventname, total_price 
            -> FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
            ->        FROM (SELECT eventid, sum(pricepaid) total_price
            ->              FROM   sales
            ->              GROUP BY eventid)) Q, event E
            ->        WHERE Q.eventid = E.eventid
            ->        AND percentile = 1
            -> ORDER BY total_price desc;
      eventname       | total_price 
----------------------+-------------
 Adriana Lecouvreur   | 51846.00    
 Janet Jackson        | 51049.00    
 Phantom of the Opera | 50301.00    
 The Little Mermaid   | 49956.00    
 Citizen Cope         | 49823.00    
 Sevendust            | 48020.00    
 Electra              | 47883.00    
 Mary Poppins         | 46780.00    
 Live                 | 46661.00    
(9 rows)
 
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]

Step 5: Run SQL queries to join data between different systems like Redshift and Hive

Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto. 

Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users

 and ahana_hive.default.customer table respectively within their schema.

Following very simple SQL queries to join these tables, the same way you join two tables from the same database. 

presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x  join my_redshift.public.users y on 
x.nationkey = y.userid;
Advanced SQL on Redshift

Understanding Redshift’s Limitations

Running SQL queries on Redshift has its advantages, but there are some shortcomings associated with Amazon Redshift. If you are looking for more information about Amazon Redshift, check out the pros and cons and some of the limitations of Redshift in more detail.


Start Running SQL Queries on your Data Lakehouse

We made it simple to run SQL queries on Presto in the cloud.
Get started with Ahana Cloud and start running SQL in a few mins.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse, permitting the execution of SQL queries, offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. Users can easily run SQL queries on Redshift, but there are some limitations.

Data Lakehouse

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

Before we begin: Redshift Spectrum vs Redshift

While the thrust of this article is an AWS Redshift Spectrum vs Athena comparison, there can be some confusion with the difference between AWS Redshift Spectrum and AWS Redshift. Very briefly, Redshift is the storage layer/data warehouse. Redshift Spectrum, on the other hand, is an extension to Redshift that is a query engine.

Redshift spectrum

What is Amazon Athena?

Athena is Amazon’s standalone, serverless SQL query engine implementation of Presto. This is used to query data stored on Amazon S3. It is fully managed by Amazon, there is nothing to setup, manage or configure. This also means that the performance can be very inconsistent as you have no dedicated compute resources.

What is Amazon Redshift Spectrum?

Redshift Spectrum is an extension of Amazon Redshift. It is a serverless query engine that can query both AWS S3 data and tabular data in Redshift using SQL. This enables you to join data stored in external object stores with data stored in Redshift to perform more advanced queries.

Key Features & Differences: Redshift vs Athena

Athena and Redshift Spectrum offer similar functionality, namely, serverless query of S3 data using SQL. That makes them easy to manage. This also is more cost-effective as there is nothing to set up and you are only charged based on the amount of data scanned. S3 storage is significantly less expensive than a database on AWS for the same amount of data.

  • Pooled vs allocated resources: Both are serverless, however Spectrum resources are allocated based on your Redshift cluster size. Athena, however, relies on non-dedicated, pooled resources.
  • Cluster management: Spectrum actually does need a bit of cluster management, but Athena is truly serverless.
  • Performance: Performance for Athena depends on your S3 optimization, while Spectrum, as previously noted, depends on your Redshift cluster resources and S3 optimization. If you need a specific query to run more quickly, then you can allocate additional compute resources to it.
  • Standalone vs feature: Redshift Spectrum runs in tandem with Amazon Redshift, while Athena is a standalone query engine for querying data stored in Amazon S3.
  • Consistency: Spectrum provides more consistency in query performance while Athena has inconsistent results due to the pooled resources.
  • Query types: Athena is great for simpler interactive queries, while Spectrum is more oriented towards large, complex queries.
  • Pricing: The cost for both is the same. They run $5 per compressed terabyte scanned, however with Spectrum, you must also consider the Redshift compute costs.
  • Schema management: Both use AWS Glue for schema management, and while Athena is designed to work directly with Glue, Spectrum needs external tables to be configured for each Glue catalog schema.
  • Federated query capabilities: Both support federated queries.

Athena vs Redshift: Functionality

The functionality of each is very similar, namely using standard SQL to query the S3 object store. If you are working with Redshift, then Spectrum can join information in S3 with tables stored in Redshift directly. Athena also has a Redshift connector to allow for similar joins. However if you are using Redshift, it would likely make more sense to use Spectrum in this case.

Athena vs Redshift: Integrations

Keep in mind that when working with S3 objects, these are not traditional databases, which means there are no indexes to be scanned or used for joins. If you are working with files with high-cardinality and trying to join them, you will likely have very poor performance.

When connecting to data sources other than S3, Athena has a connector ecosystem to work with. This system provides a collection of sources that you can directly query with no copy required. Federated queries were added to Spectrum in 2020 and provide a similar capability with the added benefit of being able to perform transformations on the data and load it directly into Redshift tables.

AWS Athena vs Redshift: To Summarize

If you are already using Redshift, then Spectrum makes a lot of sense, but if you are just getting started with the cloud, then the Redshift ecosystem is likely overkill. AWS Athena is a good place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena and/or Redshift users that saw challenges around price performance (Redshift) and concurrency/deployment control (Athena). Keep in mind that Athena and Redshift Spectrum provide the same $5 terabyte scanned cost while Ahana is priced purely at instance hours. The power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

amazon athena logo edited

Understanding AWS Athena Costs with Examples

What Is Amazon Athena? 

Since you’re reading this to understand Athena costs, you likely already know, so we’ll just very briefly touch on what it is. Amazon Athena is a managed serverless version of Presto. It provides a SQL query engine for analyzing unstructured data in AWS S3. The best use case is where reliable speed and scalability are not particularly important, meaning that, since there are no dedicated resources for the service, it will not perform in a consistent fashion. So, testing ideas, small use cases and quick ad-hoc analysis are where it makes the most sense.

How Much Does AWS Athena Cost?

An Athena query costs from $5 to $7 per terabyte scanned, depending on the region. Most materials you read will only quote the $5, but there are regions that cost $7, so keep that in mind. For our examples, we’ll use the $5 per terabyte as our base. There are no costs for failed queries, but any other charges such as the S3 storage will apply as usual for any service you are using.

AWS Athena Pricing Example

In this example, we have a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows:

$5 per TB scanned * 4 TB scanned = $20 per query

So if we are doing that query 20 times per month, then we have 20 * $20 = $400 per month 

Service settings_Amazon Athena

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1 compression rate would cut your costs in half to just $200 per month. Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where we now only have to look at half our data, and now our cost is down to $100 per month.

Let’s go ahead and try a larger example, and not even a crazy big one if you are using the data lake and doing serious processing. Let’s say you have 20 queries per day, and you are working on 100TB of uncompressed, row based data:

pricing calculator

That’s right, $304,000 per month. Twenty queries per day isn’t even unrealistic if you have some departments that are wanting to run some dashboard queries to get updates on various metrics. 

Summary

While we learned details about Athena pricing, we also saw how easy it would be to get hit with a giant bill unexpectedly. If you haven’t compressed your data, or reformatted it to reduce those costs and just dumped a bunch of CSV or JSON files into S3, then you can have a nasty surprise. If you unleash connections to Athena to your data consumers without any controls, you can also end up with some nasty surprises if they are firing off a lot of queries on a lot of data. It’s not hard to figure out what the cost will be for specific usage, and Amazon has provided the tools to do it.

If you’re an Athena user who’s not happy with costs, you’re not alone. We see many Athena users wanting more control over their deployment and in turn, costs. That’s where we can help – Ahana is SaaS for Presto (the same technology that Athena is running) that gives you more control over your deployment. Typically our customers see up to 5.5X price performance improvements on their queries as compared to Athena. 

You can learn more about how Ahana compares to AWS Athena in this comparison page.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Architecture

5 Components of Data Warehouse Architecture

Data Warehouse Architecture

Tip: If you are struggling to get value from your data warehouse due to vendor lock-in or handling semi and unstructured data, set up a time to chat with an engineer about migrating to a Data Lakehouse.

What are the components of a data warehouse?

Most data warehouses will be built around a relational database system, either on-premise or in the cloud, where data is both stored and processed. Other components would include metadata management and an API connectivity layer allowing the warehouse to pull data from organizational sources and provide access to analytics and visualization tools.

A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

The data warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load).

To learn more about the internal architecture of a data warehouse and its various components such as nodes and clusters, check out our previous article on Redshift data warehouse architecture.

Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.

As data warehouse architecture evolved and grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Here’s a quick high level overview of the data warehouse architecture:

Data Warehouse Architecture
In this article we’ll look at the contextual requirements of data warehouse architecture, and the five components of a data warehouse. 

The 5 components of a data warehouse architecture are:
  1. ETL
  2. Metadata
  3. SQL Query Processing
  4. Data layer
  5. Governance/security

ETL 

As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.

Metadata

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL Query Processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data Layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.

Governance/Security

This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

Challenges with a Data Warehouse Architecture

Now that I’ve laid out the five key components of a data warehouse architecture, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. 

We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Open Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:

Data Warehouse

The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an Open Data Lakehouse is the way to go.


If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse.


Related Articles

Data Warehouse: A Comprehensive Guide

Data warehouse architecture is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is and how it’s used.

Data Warehouse Concepts for Beginners

A relational database that is designed for query and analysis rather than for transaction processing. Learn more here.

What is a Data Lakehouse Architecture?

Overview

The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse architecture is, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.

What is a Data Lakehouse?

First, it’s best to explain a Data Warehouse and a Data Lake.

Data Warehouse

A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.

Data Lake

On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).

Data Lakehouse explained_diagram

Data Lakehouse

Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.

Data Lakehouse Architecture Explained

Here’s an example of a Data Lakehouse architecture:

An example of a Data Lakehouse architecture

You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools. 

There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.

Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.

Building your Data Lakehouse with Ahana Cloud for Presto

At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:

Building your Data Lakehouse with Ahana Cloud for Presto_diagram

Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.

We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.

Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

webinar-introduction-toahana-cloud-on-aws-ondemand

Webinar On-Demand
An introduction to Ahana Cloud for Presto on AWS

The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity of the Data Lake, enabling data warehouse workloads to run on the data lake. At the heart of the data lakehouse is Presto – the open source SQL query engine for ad hoc analytics on your dataDuring this webinar we will share how to build an open data lakehouse with Presto and AWS S3 using Ahana Cloud.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. Ahana is a managed service for Presto that gives data platform teams of all sizes the power of SQL for their data lakehouse.

In this webinar we will cover:

  • What an Open Data Lakehouse is
  • How you can use Presto to underpin the lakehouse in AWS
  • A demo on how to get started building your Open Data Lakehouse in AWS

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

Data Lakehouse

Enterprise Data Lake Formation & Architecture on AWS

What is Enterprise Data Lake

An enterprise data lake is simply a data lake for enterprise-wide information sharing and storing of data. The key purpose of “Enterprise data lake” is to incorporate analytics on it to unlock business insights from the stored data.

Why AWS Lake formation for Enterprise Data Lake

The key purpose of “Enterprise Data Lake” is to run analytics to gain business insights. As part of that process,governance of data becomes more important to secure the access of data between different roles in the enterprise. AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for data lakes on AWS S3. 

Enterprise Data Lake Formation & Architecture

Enterprise data platforms need a simpler, scalable, and centralized way to define and enforce access policies on their data lakes . A policy based approach to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, it may not be an option for enterprises.

Data Lake Storage diagram

AWS Lake Formation allows enterprises to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). 

Benefits of Lake formation for Enterprise Data Lakes

  •  One schema – shareable with no dependency on architecture
  •  Share Lake Formation databases and tables to any Amazon accounts 
  •  No Amazon S3 policy edits required 
  •  Receivers of the data can use analytic service provider like Ahana to run analytics.
  •  There is no dependency between roles on how the data will be further shared.
  •  Centralized logging.

AWS Enterprise Lake Formation: To Summarize

AWS Lake Formation has been integrated with AWS partners like Ahana cloud, a managed service for SQL on data lakes. These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

Related Articles

The Role of Blueprints in Lake Formation

A Lake Formation Blueprint allows you to easily stamp out and create workflows. Learn more about what is in this article.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become very popular over the last year or so. Learn more about what it is and how it’s used.

webinar build analytics stack – 1

Webinar On-Demand
How to build an Open Data Lakehouse Analytics stack

As more companies are leveraging the Data Lake to run their warehouse workloads, we’re seeing many companies move to an Open Data Lakehouse stack. The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity the Data Lake, enabling data warehouse workloads to run on the data lake.

Join us for this webinar where we’ll show you how you can build an open data lakehouse stack. At the heart of this stack is Presto, the open source SQL query engine for the data lake, and the transaction manager / governance layer, which includes technologies like Apache Hudi, Delta Lake, and AWS Lake Formation.

You’ll Learn:

  • What an Open Data Lakehouse Analytics Stack is
  • How Presto, the de facto query engine for the data lakehouse, underpins that stack
  • How to get started building your open data lakehouse analytics stack today

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

Data Lakehouse

How to Query Your JSON Data Using Amazon Athena

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.


JSON Data

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous AWS Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Related Articles

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

Tutorial: How to run SQL queries with Presto on BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

SCaFoWYr3cmprK3ZSEvpBBsgC1ftG8KxE4HhUAmU1htxU2IUbn6mmLZJ7FhIhGM9WCRY6l8Tk3MlUTbPNxzqX851Uq

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

0aUUfLabQ91Z p5WH5epQ2Q7 X79jr6HEOGtombODt2rM083VBraTw9QWtBWrXsa36xPbebDuJeiz4 7SK5q5BadLGxX6BuaraZBpItTnQqORU0kLO1Rx6yLOOvNTJNSY
  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

9GKit7efkigufFrSRaq4q55DLfQcBY Dm3XMy4krolUbmc4yI3QQA00AEM 3L7Ro5V EwLAuYUs7ZYXOebNVF361bLec6HN8aI6p7NWbGWb4pbswHZFqVhW4E5fZ3L2k96cqsn1X
  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

1zwAjBUbwaRRFfPjp1VquvCI dH8Ba7C5m41 FBCgBvb6vAr9UeNGW8bE 7ESJsgkeyPTbge5EJBk5JHUKO6k0Hd6ayFB0 7Xk6JNspcmWqEp n 5sdMRRtBmTMqd8k8dL4glSA

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

SBpCdf6 qZkSnSR4JMaEauAk20CKGoTx8quE7mmUNJWJJ27feHkwmCcOuLyMqkpjF3COl

show schemas from bigquery;

WJvDcPeSzfkSngHjSV oeacdp rVXqd53dwYFcyePYl6kWzGVy9rp29qcOWOJbhl7hKJemNVSb1OgOQYXrJIwryluVlD3q06jmzAYblGbHSMwVIX0q6WnfPdF7ET8Z0b3YroT085

show tables from bigquery.rohan88;

pVW1k3JaXGSB2K1k M3G2f0NhEogho7cACY fEdnB4GBz9MCs4B46R1zOvm1HLxHi9JYD0aWDjNUsTF bMOx5fMva3K6kdnREmucbxfLcmjCYw 36qds70m6om55TKD0db3tVld

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

8DDiuBMCMikR rp2 oOjLNgDqM2S9j0XTW1oJbLmw 8tJv6E

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password
mNFFvwlr1vHPt CRq07RdbIcE93B f7AA6bZaw vmzgMo5sipvg8SIpGOZ0ngZCeKOqpkIb7fy8DHPVsy4B44ZPzAQ7Yv

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

9In mgUZOQvbDt1sqLyqxFMXI6mCBGdvCKkFWlVvNQ1 SB3pNeQNmcwmNn URw54h u6a9aJrrZef0p23YFf790WeyLQCpGIPGxKl

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Related Articles

How do I query a data lake with Presto?

Learn how to get up and running with Presto.

What is Presto (and FAQ’s about Presto)

This article will explain what Presto is and what it’s used for.

whitepaper business value of data lake od

Webinar On-Demand
Unlocking the Business Value of the Data Lake  

As more companies are moving to the cloud, choosing how and where to store their data remains a critical decision. While the data lake has quickly become a popular choice, the challenge lies in getting business value out of that data lake. To solve for that, we’re seeing a modern, open data lake analytics stack emerge. This stack includes open source, open formats, and open clouds, giving companies flexibility at each layer so they can harness the full potential of their data lake data.

During this webinar we’ll discuss how nearly three-fifths of organizations have gained competitive advantage from their data lake initiatives. That includes unleashing the intelligence-generating potential of a data lake that enables ad hoc data discovery and analytics in an open and flexible manner. We’ll cover:

  • The primary approaches for building an open data lake analytics stack, including where and how the data warehouse fits
  • The business benefits enabled by the technical advantages of this open data lake analytics stack
  • Why structured data processing and analytics accelerating capabilities are critical

Speakers

Matt Aslett

VP & Research Director, Ventana Research

Matt-Aslett-Ventana-1

Wen Phan

Director of Product, Ahana

Wen-Phan_Picture

What is an Open Data Lake in the Cloud?

Data Lake in the Cloud

Problems that necessitate a data lake

In today’s competitive landscape, more and more companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. This is obvious when understanding the environment in which these companies operate in. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities through the development of evidence-based strategies. Also, analytics dashboards can be presented to customers for added value.

Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you would expect – historical information about products, services, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data formats and is far beyond Enterprise Data. Some current examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.

The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive when used to handle this amount. And to add to this, data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether. 

Data Lake insights

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture. 

The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.

With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data. 

Why Open for a Data Lake

As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake. 

If you want to learn more about why you should be thinking about building an Open Data Lake in the cloud, check out our free whitepaper on Unlocking the Business Value of the Data Lake – how open and flexible cloud services help provide value from data lakes.

Helpful Links

Best Practices for Resource Management in PrestoDB

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

Data Lakehouse

AWS Athena vs AWS Glue: What Are The Differences?

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

CREATE EXTERNAL TABLE sampleTable (
  col1 INT,
  col2 INT,
  str1 STRING,
  ) STORED AS AVRO
  TBLPROPERTIES (
  'classification'='avro')
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

CSV table answers AthenaVGlue

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/ 

Best Practices for Resource Management in PrestoDB

Data Lakehouse

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources. Resource management in a distributed environment makes accessibility of data easier and manages resources over the network of autonomous computers (i.e. Distributed System). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. In order to be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups as well as best practices and considerations to think about before setting up a production system with resource grouping.

Getting Started

Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the  example below from the PrestoDB doc pages.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "jmxExport": true,
      "subGroups": [
        {
          "name": "data_definition",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 1,
          "schedulingWeight": 10,
          "subGroups": [
            {
              "name": "other",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 2,
              "maxQueued": 1,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 1,
                  "maxQueued": 100
                }
              ]
            },
            {
              "name": "bi-${tool_name}",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 10,
              "maxQueued": 100,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 3,
                  "maxQueued": 10
                }
              ]
            }
          ]
        },
        {
          "name": "pipeline",
          "softMemoryLimit": "80%",
          "hardConcurrencyLimit": 45,
          "maxQueued": 100,
          "schedulingWeight": 1,
          "jmxExport": true,
          "subGroups": [
            {
              "name": "pipeline_${USER}",
              "softMemoryLimit": "50%",
              "hardConcurrencyLimit": 5,
              "maxQueued": 100
            }
          ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
    {
      "user": "bob",
      "group": "admin"
    },
    {
      "source": ".*pipeline.*",
      "queryType": "DATA_DEFINITION",
      "group": "global.data_definition"
    },
    {
      "source": ".*pipeline.*",
      "group": "global.pipeline.pipeline_${USER}"
    },
    {
      "source": "jdbc#(?<tool_name>.*)",
      "clientTags": ["hipri"],
      "group": "global.adhoc.bi-${tool_name}.${USER}"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ],
  "cpuQuotaPeriod": "1h"
}

Okay, so there is clearly a LOT going on here so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limitation.

Penalties

Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.

Resource Group Specifications

The example below is a resource specification of how to evenly distribute CPU resources between two different users.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1h"
}

The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:

  • hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
  • maxQueued sets the limit on how many queries can be queued
  • schedulingPolicy ‘fair’ determines how queries within the same group are prioritized

Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended as well.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently this is set to 1 minute and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long amount of time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values are shown  below.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

With testing, the above resource group spec results in two queries completing – using a total of 127m CPU time. From there, all further queries block for about 2 minutes before they run again. This blocked time adds up because for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back down to zero so queries could submit again.

Conclusion

Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads which depend on variable input data so that a specified scheduled job doesn’t arbitrarily end up taking over a large chunk of resources. For resource partitioning between multiple users/teams the best approach still seems to be to run and maintain multiple segregated Presto clusters.


Ready to get started with Presto? Check out our tutorial series where we cover the basics: Presto 101: Installing & Configuring Presto locally.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become popular over the last year as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse is, the common architecture and how companies are using the it in production today.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Query editor

Querying Parquet Files using Amazon Athena

Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc. Here are steps to quickly get set up to query your parquet files with a service like Amazon Athena.

How to query Parquet files

Parquet is a columnar storage that is optimized for analytical querying. Data warehouses such as Redshift support Parquet for optimized performance. Parquet files stored on Amazon S3 can be queried directly by serverless query engines such as Amazon Athena or open source Presto using regular SQL.

Below we show a simple example of running such a query.

Prerequisites

  • Sample Parquet Data –  https://ahana-labs.s3.amazonaws.com/movielens/ratings/ratings.parquet
  • AWS Account and Role with access to below services:
    • AWS S3
    • AWS Glue (Optional but highly recommended)
    • AWS Athena

Setting up the Storage

For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket. 

Log in to your AWS account and select the S3 service in the Amazon Console.

  1. Click on Create Bucket
  2. Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created. 
  3. Scroll to the bottom and click Create Bucket
  4. Click on your newly created bucket
  5. Create a folder in the S3 bucket called ‘test-data’
  6. Click on the newly created folder
  7. Choose Upload Data and upload your parquet file(s).

Running a Glue Crawler

Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.

  1. On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
  2. Click Add Crawler
  3. Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
  4. In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
  5. In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
  6. In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
  7. For Frequency leave the setting as default and choose Next
  8. For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
  9. Review and then choose Finish.
  10. AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
  11. Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.

Querying the Parquet file from AWS Athena

Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.

  1. Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
Query Editor
  1. Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
  2. Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
  3. The final result should look something like this: in the S3 storage step.
Query Editor v2

Conclusion

Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations. 

Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.

Configuring RaptorX – a multi-level caching with Presto

Multi-level-Data-Lake-Cashing-with-RaptorX

RaptorX Background and Context

Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it  usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network.  As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how Presto-Hive connector works –

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. Scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer CacheTypeAffinity SchedulingBenefits
Data IO Local DiskRequiredReduced query latency
Intermediate Result SetLocal DiskRequiredReduced query latency and CPU utilization for aggregation queries 
File MetadataIn-memoryRequiredReduced CPU & latency decrease
Metastore In-memoryNAReduced query latency
File ListIn-memoryNAReduced query latency
Table: Summary of Presto Multi Layer Cache Implementation

Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) cache

This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks. 

To enable this cache, worker configuration needs to be updated with below properties at

etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache
-Dalluxio.user.app.id=presto

#2 Fragment result set cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.

Enable file list cache by setting below set of properties at

/catalog/<catalog-name>.properties 

# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File metadata cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.

This is in memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).

For Parquet, it caches the file and block level metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity scheduler

With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node  to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working. 

In order to enable affinity scheduler set below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How can you test or debug your RaptorX cache setup with JMX metrics?

Each section describes queries to be run and query the jmx metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries to verify Data IO data cache usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries to trigger fragment result cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result cache JMX metrics.

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC metadata cache

Queries to trigger ORC cache usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet metadata cache

Queries to trigger Parquet metadata cache

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List cache

Query File List cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage. 

Multi-level Data Lake Cashing with RaptorX
Figure: Multi-level Data Lake Cashing with RaptorX

Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.

icon-aws-lake-formation.png

The Role of Blueprints in Lake Formation on AWS

Why does this matter?

There are 2 major steps to create a Data Lakehouse on AWS, first is to set up your S3-based Data Lake and second is to run analytical queries on your data lake. A popular SQL engine that you can use is Presto. This article is focused on the first step and how AWS Lake Formation Blueprints can make that easy and automated. Before you can run analytics to get insights, you need your data continuously pooling into your lake!

AWS Lake Formation helps with the time-consuming data wrangling involved with maintaining a Data Lake. It makes that simple and secure. In Lake Formation, there is the Workflows feature. Workflows encompasses a complex set of ETL jobs to load and update data. 

work flow diagram

What is a Blueprint?

A Lake Formation Blueprint allows you to easily stamp out and create workflows. This is an automation capability within Lake Formation. There are 3 types: Database snapshots, incremental database, and log file blueprints.

The database blueprints support automated data ingestion of sources like MySQL, PostgreSQL, SQL service to the Open Data Lake. It’s a point and click service with simple forms in the AWS console.

A Database snapshot does what it sounds like, it loads all the tables from a JDBC source to your lake. This is good when you want time stamped end-of-period snapshots to compare later.

An Incremental database also does what it sounds like, taking only the new data or the deltas into the data lake. This is faster and keeps the latest data in your data lake. The Incremental database blueprint uses bookmarks on columns for each successive incremental blueprint run. 

The Log file blueprint takes logs from various sources and loads them into the data lake. ELB logs, ALB logs, and Cloud Trail logs are an example of popular log files that can be loaded in bulk. 

Summary and how about Ahana Cloud?

Getting data into your data lake is easy, automated, and consistent with AWS Lake Formation. Once you have your data ingested, you can use a managed service like Ahana Cloud for Presto to enable fast queries on your data lake to derive important insights for your users. Ahana Cloud has integrations with AWS Lake Formation governance and security policies. See that page here: https://ahana.io/aws-lake-formation 

lake formation diagram
Mysql

Presto equivalent of mysql group_concat

As you may know, PrestoDB supports ANSI SQL and includes support for several SQL dialects. These dialects include MySQL dialect, making it easy to group and aggregate data in a variety of ways. However not ALL functions in MySQL are supported by PrestoDB! 

Mysql

Now, let us look at the really useful MySQL and MariaDB SQL function GROUP_CONCAT(). This function is used to concatenate data in column(s) from multiple rows into one field. It is an aggregate (GROUP BY) function which returns a String. This is assuming the group contains at least one non-NULL value (otherwise it returns NULL). GROUP_CONCAT() is an example of a function that is not yet supported by PrestoDB and this is the error you will see if you try using it to get a list of customers that have ordered something along with their order priorities:

presto> use tpch.sf1;

presto:sf1> select custkey, GROUP_CONCAT(DISTINCT orderpriority ORDER BY orderpriority SEPARATOR ',') as OrderPriorities from orders GROUP BY custkey;

Query 20200925_105949_00013_68x9u failed: line 1:16: Function group_concat not registered

Is there a way to handle this? If so what’s the workaround? There is!

array_join() and array_agg() to the rescue! 

presto:sf1> select custkey,array_join(array_distinct(array_agg(orderpriority)),',') as OrderPriorities from orders group by custkey;
 custkey |                OrderPriorities                 
---------+------------------------------------------------
   69577 | 2-HIGH,1-URGENT,3-MEDIUM,5-LOW,4-NOT SPECIFIED 
   52156 | 4-NOT SPECIFIED,3-MEDIUM,1-URGENT,5-LOW,2-HIGH 
  108388 | 5-LOW,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,1-URGENT 
  111874 | 5-LOW,1-URGENT,2-HIGH,4-NOT SPECIFIED          
  108616 | 1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM,2-HIGH 
(only the first 5 rows displayed) 

If you do not want to use the DISTINCT operator (you want duplicates in your result set in other words) then there is an easy solution. To skip the DISTINCT operator, simply drop the array_distinct() function from your query:

presto:sf1> select custkey,array_join(array_agg(orderpriority),',') as OrderPriorities from orders group by custkey;
 custkey | OrderPriorities                             
---------+-------------------------------------------------------------------------------- 
   24499 | 5-LOW,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,2-HIGH,4-NOT SPECIFIED,3-MEDIUM,1-URGENT,2-HIGH,3-MEDIUM,1-URGENT,5-LOW,3-MEDIUM,4-NOT SPECIFIED,4-NOT SPECIFIED,4-NOT SPECIFIED,3-MEDIUM,3-MEDIUM,5-LOW,1-URGENT,1-URGENT,4-NOT SPECIFIE
   58279 | 4-NOT SPECIFIED,2-HIGH,5-LOW,1-URGENT,1-URGENT,5-LOW,5-LOW,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,5-LOW,3-MEDIUM,1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,5-LOW,5-LOW,3-MEDIUM,3-MEDIUM,1-URGENT,3-MEDIUM,2-HIGH,5-LOW
  142027 | 1-URGENT,2-HIGH,2-HIGH,1-URGENT,3-MEDIUM,1-URGENT,5-LOW,4-NOT SPECIFIED,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,2-HIGH,1-URGENT,3-MEDIUM,5-LOW,3-MEDIUM,4-NOT SPECIFIED,2-HIGH,1-URGENT,5-LOW,2-HIGH,5-LOW,1-URGENT,4-NOT SPECIFIED,2-HIG
   94169 | 1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,5-LOW,4-NOT SPECIFIED,2-HIGH,5-LOW,4-NOT SPECIFIED                                                                                                                                                                                                        
   31607 | 4-NOT SPECIFIED,2-HIGH,4-NOT SPECIFIED,2-HIGH,2-HIGH,5-LOW 

You can, of course specify the separator character. In the example shown above, I have used a comma as the separator.

It’s worth noting that, like PrestoDB, there wasn’t a T-SQL equivalent of the MySQL GROUP_CONCAT() function in Microsoft SQL Server either. However, T-SQL now has the STRING_AGG() function which is available from SQL Server 2017 onwards.

And hey, presto, you now have a working Presto equivalent of mysql group_concat.


Understanding the Presto equivalent of mysql group_concat – Now what?

If you are looking for more tips are tricks, here’s your next step. Check out our Answers section to learn more about PrestoDB, competitor reviews and comparisons, and more technical guides and resources to get you started.


Related Articles

What is Presto?

What’s Presto, how did it start, and what is it for? Ready to answer these questions? Take a deeper dive into understanding Presto. Learn what PrestoDB is, how it got started, and the benefits for Presto users.

How to Build a Data Lake Using Lake Formation on AWS

What is AWS Lake Formation? AWS lake formation helps users to build, manage, and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. Learn more about AWS Lake Formation, including the pros and cons of Amazon Lake Formation.

Data Warehouse: A Comprehensive Guide

Looking to learn more about data warehouses? Start here for a deeper look. This article will cover topics like what it is: a data warehouse is a data repository. Also get more info about their use: typically a warehouse is used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is, how it’s used, and the pros and cons of a data warehouse compared to a data lake.

AWS Athena Alternatives

Querying Amazon S3 Data Using AWS Athena

The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. In this article we’ll look more closely at AWS S3 and AWS Athena.

How Does AWS Athena work with Amazon S3

In AWS land, AWS S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.

The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.

How to query S3 using AWS Athena

The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option. 

From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:

  1. Set up your Database, Table, and Folder Names & Locations
  2. Choose the data format you’ll be querying
  3. Define your columns so Athena understands your data schema
  4. Set up your Data Partitions if needed

Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.

What happens when AWS Athena hits its limits

While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS. 

Here’s a quick comparison between the two offerings:

AWS Athena replacement

Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.

We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.

Related Articles


What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

Managed service for SQL

Ahana Announces New Security Capabilities to Bring Next Level of Security to the Data Lake

Advancements include multi-user support, deep integration with Apache Ranger, and audit support 

San Mateo, Calif. – February 23, 2022 Ahana, the only SaaS for Presto, today announced significant new security features added to its Ahana Cloud for Presto managed service. They include multi-user support for Presto and Ahana, fine-grained access control for data lakes with deep Apache Ranger integration, and audit support for all access. These are in addition to the recently announced one-click integration with AWS Lake Formation, a service that makes it easy to set up a secure data lake in a matter of hours.

The data lake isn’t just the data storage it used to be. More companies are using the data lake to store business-critical data and running critical workloads on top of it, making security on that data lake even more important. With these latest security capabilities, Ahana is bringing an even more robust offering to the Open Data Lake Analytics stack with Presto at its core.

“From day one we’ve focused on building the next generation of open data lake analytics. To address the needs of today’s enterprises that leverage the data lake, we’re bringing even more advanced security features to Ahana Cloud,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “The challenge with data lake security is in its shared infrastructure, and as more data is shared across an organization and different workloads are run on the same data, companies need fine-grained security policies to ensure that data is accessed by the right people. With these new security features, Ahana Cloud will enable faster adoption of advanced analytics with data lakes with advanced security built in.”

“Over the past year, we’ve been thrilled with what we’ve been able to deliver to our customers. Powered by Ahana, our data platform enables us to remain lean, bringing data to consumers when they need it,” said Omar Alfarghaly, Head of Data Science, Cartona. “With advanced security and governance, we can ensure that the right people access the right data.”

New security features include:

  • Multi-user support for Presto: Data platform admins can now seamlessly manage users without complex authentication files and add or remove users for their Presto clusters. Unified user management is also extended across the Ahana platform and can be used across multiple Presto clusters. For example, a data analyst gets access to the analytics cluster but not to the data science cluster.
  • Multi-user support for Ahana: Multiple users are now supported in the Ahana platform. An admin can invite additional users via the Ahana console. This is important for growing data platform teams.
  • Apache Ranger support: Our open source plugin allows users to enable authorization in Ahana-managed Presto clusters with Apache Ranger for both the Hive Metastore or Glue Catalog queries, including fine-grained access control up to the column level across all clusters. In this newest release of the Ahana and Apache Ranger plug-in, all of the open source Presto and Apache Ranger work is now available in Ahana and it’s now incredibly easy to integrate through just a click of a button. With the Apache Ranger plugin, customers can easily add role-based authorization. Policies from Apache Ranger are also now cached in the plugin to enable little to no query time latency impact.  Previously, support for Apache Ranger was only available in open source using complicated config files.
  • Audit support: With extended Apache Ranger capabilities, Ahana customers can enable centralized auditing of user access on Ahana-managed Presto clusters for comprehensive visibility. For example, you can track when users request access to data and if those requests are approved or denied based on their permission levels.
  • AWS Lake Formation integration: Enforce AWS Lake Formation fine-grained data lake access controls with Ahana-managed Presto clusters.

“We’re seeing an increasing proportion of organizations using the cloud as their primary data lake platform to bring all of an enterprise’s raw structured and unstructured data together, realizing significant benefits such as creating a competitive advantage and helping lower operational costs,” said Matt Aslett, VP and Research Director, Ventana Research. “Capabilities such as governance mechanisms that allow for fine-grained access control remain important given the simplicity of the cloud. Innovations that allow for better data governance on the data lake, such as those Ahana has announced today, will help propel usage of more sophisticated use cases.”

Supporting Resources:

Tweet this:  @AhanaIO announces new security capabilities for the data lake #analytics #security #Presto https://bit.ly/3H0Hr7p

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

AWS Lake Formation vs AWS Glue – What are the differences?

Last updated: October 2022

As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies:

  • AWS Lake Formation for security and governance; and
  • AWS Glue. a data catalog.

While both of these services are typically used to build, manage, and operationalize AWS data lakes, they fulfil completely different roles. AWS Lake Formation is built around AWS Glue, and both services share the same AWS Glue Data Catalog; however, Lake Formation provides a wider breadth of governance and data management functionality, whereas Glue is focused on ETL and data processing.

What is AWS Lake Formation? 

AWS Lake Formation makes it easier for you to build, secure, and manage data lakes. It provides a means to address some of the challenges around unstructured data lake storage – including security, access control, governance, and performance.

How it works: AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services

For AWS users who want to get governance on their data lake, AWS Lake Formation makes it easy to set up a secure data lake very quickly (in a matter of days). 

In order to provide better query performance when using services such as Athena or Presto, Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake.

When should you use AWS Lake Formation? 

At its core, Lake Formation is built to simplify the process of moving your data to a data lake, cataloging the data, and making it available for querying. Typical scenarios where this comes into play include:

  • Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  • Add Authorization on your Data Lake  – You can centrally define and enforce security, governance, and auditing policies.
  • Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

To understand how this works in practice, check out our article on using Redshift Spectrum in Lake Formation.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake.  Using the AWS Glue Data Catalog, users can easily find and access data.

Glue ETL can be used to run managed Apache Spark jobs in order to prepare the data for analytics, perform transformations, compact data, and convert it into columnar formats such as Apache Parquet.

Read more: What’s the difference between Athena and Glue?

When should you use AWS Glue?

To make data in your data lake accessible, some type of data catalog is essential. Glue is often the default option as it’s well-integrated into the broader AWS ecosystem, although you could consider open-source alternatives such as Apache Iceberg. Glue ETL is one option to process data, where alternatives might include running your own Spark cluster on Amazon EMR or using Databricks.

Typical scenarios where you might use include:

  • Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
  • Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
  • Build ETL pipelines to ingest data into your S3 data lake. 

The data workflows initiated from AWS Lake Formation blueprint are executed as AWS Glue jobs. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.

AWS Lake Formation vs AWS Glue: A Summary

AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics. While both of these services are used as data lake building blocks, they are complimentary. Glue provides basic functionality needed in order to enable analytics, including data cataloging and ETL; Lake Formation offers a simplified way to manage your data lake, including the underlying Glue jobs.

Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Presto platform

Presto Platform Overview

The Presto platform is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. PrestoDB was developed from the ground up by the engineers at Meta. Currently, some of the world’s most well known, innovative and data-driven companies like Twitter, Uber, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook , as example, still uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

The Presto platform was designed and written from scratch for handle interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Airbnb or Twitter.

Presto allows users to effectively query data where it lives. This is including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources. This, in turn, allows for quick and accurate analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at data analysts and data scientists who expect response times ranging from sub-second to minutes. The Presto platform breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source. Examples of such are Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Curious about how you can get going with the Presto platform? Ahana offers a managed service for Presto in the cloud. You can get started for free today with the Ahana Community Edition or a free trial for the full edition. The Ahana Community Edition is a free forever version of the Ahana Cloud managed service.

What is an Open Data Lake in the Cloud?

Have you been hearing the term “Open Data Lakehouse” more often? Learn what The Open Data Lake in the cloud actually is, and how it’s a solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses including their strengths and weaknesses.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple. Amazon Redshift tries to accommodate different use cases, but the pricing model does not fit all users. Learn more about the pricing of Amazon Redshift.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product. Redshift is based on PostgreSQL version 8.0.2. Learn more about the pros and cons of Amazon Redshift.

Data Lakehouse

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.

lake formation_image

What is AWS Lake Formation?

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

We’re seeing more companies move to the data lake because it’s flexible, cheaper, and much easier to use than a data warehouse. You’re not locked into proprietary formats, nor do you have to ingest all of your data into a proprietary technology. As more companies are leveraging the data lake, then security becomes even more important because you have more people needing access to that data and you want to be able to control who sees what. 

AWS Lake Formation can help address security on the data lake. For Amazon S3 users, it’s a seamless integration that allows you to get granular security policies in place on your data. AWS Lake Formation gives you three key capabilities:

  1. Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  2. Simplify security management – You can centrally define and enforce security, governance, and auditing policies.
  3. Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

If you’re currently using AWS S3 or planning to, we recommend looking at AWS Lake Formation as an easy way to get security policies in place on your data lake. As part of your stack, you’ll also need a query engine that will allow you to get analytics on your data lake. The most popular engine to do that is Presto, an open source SQL query engine built for the data lake.

At Ahana, we’ve made it easy to get started with this stack: AWS S3 + Presto + AWS Lake Formation. We provide SaaS for Presto with out of the box integrations with S3 and Lake Formation, so you can get a full data lake analytics stack up and running in a matter of hours.

AWS lake formation diagram

Check out our webinar where we share more about our integration with AWS Lake Formation and how you can actually enforce security policies across your organization.

Data Lakehouse

How does Presto Work With LDAP?

What is LDAP?

To learn how does Presto work with LDAP, let’s first cover what LDAP is. The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry standard application protocol used for directory services authentication. In LDAP user authentication, the LDAP server authenticates users to directly communicate with the Presto server. 

Presto & LDAP

Presto can be configured to enable LDAP authentication over HTTPS for clients, such as the Presto CLI, or the JDBC and ODBC drivers. At present only a simple LDAP authentication mechanism involving username and password is supported. The Presto client sends a username and password to the coordinator and the coordinator validates these credentials using an external LDAP service.

To enable LDAP authentication for Presto, the Presto coordinator configuration file needs to be updated with LDAP-related configurations. No changes are required to the worker configuration; only the communication from the clients to the coordinator is authenticated. However, if you want to secure the communication between Presto nodes then you should configure Secure Internal Communication with SSL/TLS.

Summary of Steps to Configure LDAP Authentication with Presto:

Step 1: Gather configuration details about your LDAP server

Presto requires Secure LDAP (LDAPS), so make sure you have TLS enabled on your LDAP server as well.

Step 2: Configure SSL/TSL on Presto Coordinator

Access to the Presto coordinator must be through HTTPS when using LDAP authentication.

Step 3: Configure Presto Coordinator with config.properties for LDAP

Step 4: Create a Password Authenticator Configuration (etc/password-authenticator.properties) file on the coordinator

Step 5: Configure Client / Presto CLI with either a Java Keystore file or Java Truststore for its TLS configuration.

Step 6: Restart your Presto Cluster and invoke the CLI with LDAP enabled CLI with  either –keystore-* or –truststore-* or both properties to secure TLS connection.

Reference: https://prestodb.io/docs/current/security/ldap.html

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our presentation with AWS on how to get started in 30min with Presto in the cloud.

Apache Ranger plugin-diagram

What is Apache Ranger?

What is Apache Ranger? In a Nutshell

Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the data platform. It is an open-source authorization solution that provides access control and audit capabilities for big data platforms through centralized security administration.

Its open data governance model and plugin architecture enabled the extension of access control to other projects beyond the Hadoop ecosystem, and the platform is widely accepted among “major cloud vendors like AWS, Azure, GCP”. 

With the help of the Apache Ranger console, admins can easily manage centralized, fine-grained access control policies, including file, folder, database, table and column-level policies across all clusters. These policies can be defined at user level, role level or group level.

Apache Service Integration

Apache Ranger uses plugin architecture in order to allow other services to integrate seamlessly with authorization controls.

Apache Ranger plugin diagram

Figure: Simple sequence diagram showing how the Apache Ranger plugin enforces authorization policies with Presto Server.

AR also supports centralized auditing of user access and administrative actions for comprehensive visibility of sensitive data usage through a centralized audit store that tracks all the access requests in real time and supports multiple audit stores including Elasticsearch and Solr.

Many companies are today looking to leverage the Open Data Lake Analytics stack, which is the open and flexible alternative to the data warehouse. In this stack, you have flexibility when it comes to your storage, compute, and security to get SQL on your data lake. With Ahana Cloud, the stack includes AWS S3, Presto, and in this case our AR integration. 

Ahana Cloud for Presto and Apache Ranger

Ahana-managed Presto clusters can take advantage of Ranger Integration to enforce access control policies defined in Apache. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack we hosted with Dzone.

Related Articles

What are the differences between Presto and Apache Drill?

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets”. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

What Is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB

Benchmark Presto | Benchmarking Warehouse Workloads

Benchmark Presto

TPC-H Benchmark Presto

To learn how to benchmark Presto, Let’s first start by covering the basics. Presto is an open source MPP Query engine designed from the ground up for high performance with linear scaling. Businesses looking to solve their analytics workload using Presto need to understand how to evaluate Presto performance and this technical guide will help in the endeavor. Learn how to get started with running your own benchmark. 

To help users who would like to benchmark Presto, we have written a detailed, informative guide on how to set up your PrestoDB benchmark using benchto. Benchto is an open source framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environments.

Running a benchmark on PrestoDB can help you to identify things like: 

  • system resource requirements 
  • resource usage during various operations 
  • performance metrics for such operations
  • …and more, depending on your workload and use case

This technical guide provides an overview on TPC-H, the industry standard for benchmarking. It also goes into detail to explain how to configure and use the open-source Benchto tool to benchmark Presto. In addition, it will show an example on comparing results between two different runs of an Ahana-managed Presto cluster with and without cache enabled.

benchmark warehouse workloads

We hope you find this useful! Happy benchmarking.

AWS & Ahana Lake Formation

Webinar On-Demand
How to Build and Query Secure S3 Data Lakes with Ahana Cloud and AWS Lake Formation

AWS Lake Formation is a service that allows data platform users to set up a secure data lake in days. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply.

In this webinar, we’ll share more on the recently announced AWS Lake Formation and Ahana integration. The AWS & Ahana product teams will cover:

  • Quick overview of AWS Lake Formation & Ahana Cloud
  • The details of the integration
  • How data platform teams can seamlessly integrate Presto natively with AWS Glue, AWS Lake Formation and AWS S3 through a demo

Join AWS Solution Architect Gary Stafford and Ahana Principal Product Manager Wen Phan for this webinar where you’ll learn more about AWS Lake Formation from an AWS expert and get an insider look at how you can now build a secure S3 data lake with Presto and AWS Lake Formation.


Webinar Transcript

SPEAKERS

Ali | Ahana, Wen Phan, Ahana, Gary Stafford | AWS

Ali LeClerc | Ahana 

All right I think we have folks joining, so thanks everyone for getting here bright and early, if you’re on the west coast, or if you’re on the East Coast your afternoon I guess we will get started here in just a few minutes.

Ali LeClerc | Ahana 

I’ll play some music to get people in the right mindset to learn about Lake Formation and Ahana Cloud for Presto. Wen, do you want to share the title slide of your slide deck are you going to start with something else? Up to you.

Wen Phan | Ahana 

I’ll bring it up in a second.

Ali LeClerc | Ahana 

Alright folks, thanks for joining. We’re going to just wait a few more minutes until we get things kicked off here, just to let people join, so give us a few minutes enjoy the music.

Ali LeClerc | Ahana 

Alright folks, so we’re just waiting a few more minutes letting people get logged in and join and we’ll get started here in just a few.

Ali LeClerc | Ahana 

All right. We are three minutes past the hour. So let’s go ahead and get started. Welcome folks to today’s Ahana webinar “How to Build and Secure AWS S3 Data Lakes with Ahana Cloud and AWS Lake Formation.” I’m Ali LeClerc, and I will be moderating today’s webinar. So before we get started, just a few housekeeping items. One is this session is recorded. So afterwards, you’ll get a link to both the recording and the slides. No need to take copious amounts of notes, you will get both the slides and the recording. Second is we did have an AWS speaker Gary Stafford, who will be joining us, he unfortunately had something come up last minute, but he will be joining as soon as he can finish that up. So you will have an AWS expert join. If you do have questions, please save those. And he will be available to take them later on. Last, like I just mentioned, we are doing Q&A at the end. So there’s a Q&A box, you can just pop your questions into that Q&A box at the bottom of your control panel. And again, we have allotted a bunch of time at the end of this webinar to take those questions. So with that, I want to introduce our speaker Wen Phan. Wen is our principal product manager at Ahana, has been working extensively with the AWS Lake Formation team to build out this integration and is an expert in all things Ahana Cloud and AWS Lake Formation. Before I turn things over to him to get started, I want to share or launch a poll, just to get an idea of the audience that we have on the webinar today. How familiar are you with Presto, with data lakes, and with Lake Formation? So if you could take just a few seconds to fill that in, that would be super appreciated. And we can kind of get a sense of who we have on today’s webinar. Wen is going to kind of tailor things on the fly based on the results here. So looks like good. We have some results coming in. Wen can you see this? Or do I need to end it for you to see it? Can you see any of the results?

Wen Phan | Ahana 

I cannot see. Okay, the results?

Ali LeClerc | Ahana 

No worries. So I’m going to wait we have 41% – 50% participation. I’m going to wait a few more seconds here. And then I will end the poll and show it looks like just to kind of give real time familiarity with Presto, most people 75% very little data lakes, I think it’s more spread across the board. 38% very little 44% have played around 90% using them today. Familiar already with Lake formation. 50% says very little. So it looks like most folks are fairly new to these concepts. And that is great to know. So I’ll just wait maybe a few more seconds here. Looks like we have 64% participation. Going up a little, do 10 more seconds. Give people a minute and a half of this and then I will end the poll here. We’re getting closer, we’re inching up. All righty. Cool. I’m going to end the poll. I’m going to share the results. So everybody can kind of see the audience makeup here. Alrighty. Cool. So with that, Wen, I will turn things over to you.

Wen Phan | Ahana 

Awesome. Thanks, Ali. Thanks, everyone for taking that poll that was very, very useful. Like Ali said, I’m a product manager here at Ahana. I’m really excited to be talking about Ahana Cloud and Lake Formation today. It’s been a project that I’ve been working on for several months now. And excited to have it released. So here’s the agenda, we’ll go through today. Pretty straightforward. We’ll start with some overview of AWS Lake Formation, what it is, then transition to what Ahana is, and then talk about the integration between Ahana Cloud for Presto and AWS Lake Formation. So let’s get into it, AWS Lake Formation. So this is actually an AWS slide. Like Ali mentioned, Gary had something come up, so I’ll go ahead and present it. The bottom line is everybody, and companies, want more value from their data. And what you see here on the screen are some of the trends that we’re seeing in terms of the data growing, coming from multiple sources, being very diverse. Images and tax. It’s being democratized more throughout the organization, and more workloads are using the data. So traditional BI workloads are still there. But you’ll see a lot more machine learning data science type workloads. The paradigm that is emerging to support, this proliferation of data with low-cost storage, as well as allowing for multiple applications to consume it is the data lake essentially.

Today, folks that are building and securing data lakes, it’s taking a while, and this is what AWS is seeing. This is the impetus of why they built AWS Lake Formation. There are three kind of high level components to Lake Formation. The first one is to just streamline the process and make building data lakes a lot faster. So try to compress what used to take months to today’s and providing tooling that can make it easier to move store, update and catalog data. The second piece is the security piece. This is actually the cornerstone of what we’ll be demonstrating and talking about today. But how do you go about, securing –  once you have your data in your data lake, how do you go about securing it? Enforcing policies and authorization model? And although data lake is very centralized, sharing the data across the organization, is very important. So another tenant of AWS Lake Formation is to actually make it quite easy or easier to discover and share your data.

So that’s a high level of Lake Formation. Now, we’ll go into Ahana and kind of why we went and built this and worked with AWS at a early stage to integrate with Lake Formation. So first, for those of you who don’t know, Ahana is the Presto company. And I think there are a few of you who are very new to Presto. So this is a single slide essentially giving a high level overview of what Presto is. Presto is a distributed query engine. It’s not a database, it is a way for us to allow you to access different data sources using ANSI SQL and querying it. The benefit of this distributed query nature is you can scale up and as you need it for the for the data. So that’s really the second point. Presto offers very low latency, a performance that can scale to a lot of large amounts of data. The third piece is Presto was also created in a pluggable architecture for connectors. And what this really translates to, is it supports many data sources. And one prominent use case for Presto, in addition to low latency interactive querying is federated querying or querying across data sources.

The final high-level kind of takeaway for Presto, it is open source, it was originally developed at Meta, aka Facebook, and it’s currently under the auspices of the Linux Foundation. And at the bottom of this slide, here are typical use cases of why organizations go ahead and deploy Presto, given the properties that I’ve kind of mentioned above. Here is a architecture diagram of Presto, I just saw a question it’s MPP. To answer that question.

Ali LeClerc | Ahana 

Can you repeat the question? So everybody knows what it was.

Wen Phan | Ahana 

Yeah, the question is your architecture MPP or SMP? It’s MPP. And this is the way it’s kind of laid out kind of, again, very high level. So, the bottom layer, you have a bunch of sources. And you can see it’s very, very diverse. We have everything from NoSQL type databases to typical relational databases, things in the cloud, streaming, Hadoop. And so Presto is kind of this query layer between your storage, wherever your data is, be able to query it. And at the top layer other consumers of the query engine, whether it be a BI tool, a visualization tool, a notebook. Today, I’ll be using a very simple CLI to access Presto, use a Presto engine to query the data on the data lake across multiple sources and get your results back. So this all sounds amazing. So today, if you were to use Presto and try to stand up Presto yourself, you’re running to potentially run to some of the challenges. And basically, you know, maintaining, managing, spinning up a Presto environment can still be complex today. First of all, it is open source. But if you were to just get the open-source bits, you still have to do a lot of legwork to get the remaining infrastructure to actually start querying. So you still need a catalog. I know some of you are new to data lakes, essentially, you have essentially files in some kind of file store. Before it used to be distributed file systems like HDFS Hadoop, today, the predominant one is S3, which is an object store. So you have a bunch of files, but those files really don’t really mean anything in terms of a query until you have some kind of catalog. So if you were to use Presto, at least the open source version, you still have to go figure out – well, what catalog am I going to use to map those files into some kind of relational entity, mental model for them for you to query? The other one is Presto, has been actually around for quite a while, and it was born of the Hadoop era, it has a ton of configurations. And so if you were to kind of spin this up, you’d have to go figure out what those configurations need to be, going to have to figure out the settings, there’s a lot of complexity there, and, tied to the configuration, you wouldn’t know how to really tune it. What’s good out of the box, might have poor out of the box performance. So all of these challenges, in addition to the proliferation of data lakes, is why Ahana was born and the impetus for our product, which is Ahana Cloud for Presto.

We aim to get you from zero to Presto, in 30 minutes or less. It is a managed cloud service, I will be using it today, you will be able to see it in action. But as a managed cloud service, there is no installation or configuration. We specifically designed this for data teams of all experience levels. In fact, a lot of our customers don’t have huge engineering teams and just really need an easy way of managing this infrastructure and providing the Presto query engine for their data practitioners. Unlike other solutions, we take away most of the complexity, but we still give you enough knobs to tune things, we allow you to select the number of workers, the size of the workers that you want, things like that. And obviously, we have many Presto experts within the company to assist our customers. So that’s just a little bit about Ahana Cloud for Presto, if you want to try it, it’s pretty simple. Just go to our website at that address above, like Ali said, you’ll get this recording, and you can go ahead to that site, and then you can sign up. You will need an AWS account. But if you have one, we can go ahead and provision the infrastructure in your account. And you can get up and running with your first Presto cluster pretty quickly. And a pause here, see if there’s another question.

Ali LeClerc | Ahana 

Looks like we have a few. What format is the RDBMS data stored in S3?

Wen Phan | Ahana 

Yeah, so we just talked about data. I would say the de facto standard, today’s Parquet. You can do any kind of delimited format, CSV, ORC files, things like that. And that then just depends on your reader to go ahead and interpret those files. And again, you have to structure that directory layout with your catalog to properly map those files to a table. And then you’ll have another entity called the database on top of the table. You’ll see some of that, well. I won’t go to that low level, but you’ll see the databases and tables when I show AWS Lake Formation integration.

Ali LeClerc | Ahana 

Great. And then, I just want to take a second, Gary actually was able to join. So welcome Gary. Gary is Solutions Architect at AWS and, obviously, knows a lot about Lake Formation. Great to have you on Gary. And he’s available for questions, if anybody has specific Lake Formation questions, so carry on Wen.

Wen Phan | Ahana 

Hey, Gary, thanks for joining. Okay, so try to really keep it tight. So, just quickly about Lake Formation, since many of you are new to it. And again, there are three pieces – making it easier to stand up the data lake, the security part, and the third part being the sharing. What we’re focused on primarily, with our integration, and you’ll see this, is the security part. How do we use Lake Formation as a centralized source of authorization information, essentially. So what are the benefits? Why did we build this integration? And what is the benefit? So first of all, many folks we’re seeing have invested in AWS as their data lake infrastructure of choice. S3 is huge. And a lot of folks are already using Glue today. Lake Formation leverages both Glue and AWS. So it’s, it’s a, it was a very natural decision for us seeing this particular trend. And so for folks that have already invested put into S3, and Glue, this is a basic native integration for you guys. So this is a picture of how it works. But essentially, you have your files stored in your data lake storage – parquet, CSV, or RC – the data catalog is mapping that to databases and tables, all of that good stuff. And then the thing that we’re going to be applying is Lake Formations access control. So you have these databases, you have these tables. And what we’ll see is can you control it can you control access to which user has access to which table? Actually will be able to see which users have access to which columns and which rows. And so that’s basically, the integration that we’ve built in. So someone – the data lake admin – will go ahead and not only define the schemas but define the access and Ahana for Presto will be able to take advantage of those policies that have been centrally defined.

We make this very easy to use, this is a core principle in our product as well, as I kind of alluded to at the beginning. We’re trying to really reduce complexity and make things easy to use and really democratize this capability. So doing this is very few clicks, and through a very simple UI. So today, if you were going to Ahana, and I’m going to show this with the live the live application, if we show you the screens. Essentially, it’s an extension of Glue, so you would have Glue, we have a single click called “Enabled AWS Lake Formation.” When you go ahead and click that, we make it very easy, we actually provide a CloudFormation template, or stack, that you can run that will go ahead and hook up Ahana, your Ahana Cloud for Presto, to your Lake Formation. And that’s it. The second thing that we do is you’ll notice that we have a bunch of users here. So, you have all these users. And then you can map them to essentially your IAM role, which are what the policies are tied to in Lake Formation. So, in Lake Formation, you’re going to create policies based on these roles. You can say, for example, the data admin can see everything, the HR analyst can only see tables in the HR database, whatever. But you have these users that then will be mapped to these roles. And once we know what that mapping is, when you log into presto, as these users, the policies tied to those roles are enforced in your queries. And I will show this. But again, the point here is we make it easy, right? There’s a simple user interface for you to go ahead and make the mapping. There’s a simple user interface where then for you to go ahead and enable the integration.

Wen Phan | Ahana 

Are we FedRAMP certified in AWS? At this moment, we are we are not. That is inbound requests that we have had, and that we are exploring, depending on, I think, the need. Today, we are not FedRAMP certified. Then the final piece is the fine-grained access control. So, leveraging Lake Formation. I mentioned this, you’re going to build your data lake, you’re going to have databases, you’re going to have tables. And you know, AWS Lake Formation has had database level security and table level security for quite some time we offer that. More recently, they’ve added more fine-grained access control. So not only can you control the database and the table you have access to, but also the columns and the specific roles you have access to. The role level one being just announced, a little over a month ago at the most recent re:Invent. We’re actually one of the earliest partners to go ahead and integrate with this feature that essentially just went GA. I’ll show this. Okay, so that was a lot of talking. I’m going to do a quick time check, we’re good on time. I’m going to pause here. Let me go see before I go into the demo, let me see what we have for other questions. Okay, great. I answered the FedRAMP one.

Ali LeClerc | Ahana 

Here’s one that came in – Can Presto integrate with AzureAD AWS SSO / AWS SSO for user management and SSO?

Wen Phan | Ahana 

Okay, so the specific AD question, I don’t know the answer to that. This is probably a two to level question. So, there’s, there’s Presto. Just native Presto that you get out of the box and how you can authenticate to that. And then there is the Ahana managed service. What I can say is single sign-on has been a request and we are working on providing more single sign on capabilities through our managed service. For the open-source Presto itself, I am not aware of any direct capability to AzureAD kind of integration there. If you are interested, I can definitely follow up with a more thorough answer. I think we have who asked that, if you actually are interested, feel free to email us and we can reach out to you.

Ali LeClerc | Ahana 

Thanks, Wen.

Wen Phan | Ahana 

Okay, so we’re going to do the demo. Before I get to the nitty gritty of demo. Let me give you some kind of overview and texture. So let me just orient you, everyone, to the application first, let’s go ahead and do that. So many of you are new to go move this new to Ahana. Once you have Ahana installed, this is what the UI looks like. It’s pretty simple, right? You can go ahead and create a cluster, you can name your cluster, whatever, [example] Acme. We have a few settings, and how large you want your instances, what kind of auto scaling you want. Like we mentioned out of the box, if you need a catalog, we can provide a catalog for you. You can create users so that users can log into this cluster, we have available ones here, you can always create a new one. Step one is create your cluster. And then we’ve separated the notion of a cluster from a data source. That way, you can have multiple clusters and reuse configuration that you have with your data source. For example, if you go to a data source, I could go ahead and create a glue data source. And as you select different data sources, you provide the configuration information specific to that data source. In my case, I’ll do a Lake Formation one. So, I’m going to Lake formation, you’ll select what region your Lake Formation services in. You can use Vanilla Glue as well, you don’t have to use Lake Formation, if you don’t want to use the fine-grained access control. If you want to, and you want to use your policies, you enable Lake Formation, and then you go ahead and run the CloudFormation script stack. And they’ll go ahead and do the integration for you. If you want to do it yourself, or you’re very capable, we do provide information about that in our documentation. So again, we try to make things easy, but we also try to be very transparent. If you want more control on you on your own. But that’s it. And then you can map the roles, as I mentioned before, and then you go ahead and add the data source. And it will go ahead and create the data source. In the interest of time, I’ve already done this.

You can see I have a bunch of data sources, I have some Netflix data on Postgres, it’s not really, real data, it’s just, it’s what we call it. We have another data source for MySQL, I have Vanilla Glue, and I have a Lake Formation one. I have a single cluster right now that’s been idle for some time for two hours called “Analysts.” Once it’s up, you can see by default has three workers. It’s scaled down to one, not a really big deal, because these queries I’m going to run aren’t going to be very, very large. This is actually a time saving feature. But once it’s up, you can connect it you’ll have an endpoint. And whatever tool you want, can connect via JDBC, or the endpoint, we have Superset built in. I’m going to go ahead and use a CLI. But that was just a high-level overview of the product, since folks probably are new to it. But pretty simple. The final thing is you can create your users. And you can see how many clusters your users are attached to. All right, so let’s go back to slideware for a minute and set the stage for what you’re going to see. We’re going to query some data, and we’re going to see the policies in Lake Formation in action.

I’ve set up some data so we can have a scenario that we can kind of follow along and see the various capabilities, the various fine grained access control in Lake Formation. So imagine we work for a company, we have different departments, sales department and HR department. And so let’s say the sales department has their own database. And inside there, they have transactions data, about sales transactions, you have information on the customers, and we have another database for human resources or HR to have employees. So here’s a here’s a sample of what the transaction data could look like. You have your timestamp, you have some customer ID, you have credit card number, you have perhaps the category by which that transaction was meant and you have whatever the amount for that transaction was. Customer data, you have the customer ID, which is just a primary key the ID – first name last name, gender, date of birth, where they live – again fictitious data, but will represent kind of representative of maybe some use cases that you’ll run into. And then HR, instead of customers, pretend you have another table with just your employees. Okay? All right. So let’s say I am the admin, and my name is Annie. And I want to log in, and I’m an admin. I should have access to everything, let’s go ahead and try this. So again, my my cluster is already up, I have the endpoints.

Wen Phan | Ahana 

I’m going to log in as Annie. And let’s take a look at what we see.  Presto has different terminology. And it might seem a little confusing. And I’ll go ahead and decode it for everyone, for those of you that are not familiar. Each connector essentially becomes what is called a catalog. Now, this is very different than a data catalog that we talk about. It’s just what they call it. In my case, the Lake Formation data source, that I created, is called LF for Lake Formation. I also called it LF, because I didn’t want to type as much, just to tie this back to what you are seeing. If we go back to here, you notice that the data source is called LF, and I’ve attached it to this cluster that I created, this analyst cluster that I created and attached it. So that’s why you see the catalog name as LF. So that’s great. And LF is attached to Lake Formation, which is has native integration to Glue and S3.  If I actually look at what databases, they’re called schemas in Presto, I have in LF, I should see the databases that I just showed you. So, you see them and you see, ignore the information schema that’s just kind of metadata information, you see sales, and you see HR. And I can actually take a look at what tables I have in the “sales database,” and I have customers and transactions. And you know, I’m an admin. So I should be able to see everything in the transactions table, for example. And I’ve set this policy already in Lake Formation. So I go here, and I should see all the tables, the same data that I showed you in the PowerPoint. So you see the transaction, the customer ID, the credit card number category, etc. So great, I’m an admin, I can do stuff.

Let me see some questions. What do you type to invoke Presto? Okay, so let’s, let’s be very specific for this question. So Presto is already up, right. So I’ve already provisioned this cluster through Ahana. So when I went and said, create cluster, this spun up all the nodes of the Presto cluster set up, configured it did the coordinator, all of that behind the scenes, that’s Presto. It’s a cluster, it’s a query engine distributed cluster. Then you Presto exposes endpoints, [inaudible] endpoint, a JDBC endpoint, that then you can have a client attached to them. Okay, you can have multiple clients. Most BI tools will be able to access this.

In this case, for the simplicity of this demo, I just use a CLI. So I would basically download the CLI, which is just another Java utility. So you need to have Java installed. And then you run the CLI with some parameters. So I have the CLI, it’s actually called Presto, that’s what the binary is, then I pass it some parameters. And I said, Okay, what’s the server? Here’s the endpoint. So it’s actually connecting from my local desktop to that cluster in AWS with this, but you can’t just access it, you need to provide some credentials. So I’m saying I’m going to authenticate myself with a password.

The user I want to access that cluster with is “Annie,” why is this valid? Well, this is valid, because when I created this cluster, When I created this cluster, I specified which users are available in that cluster. So I have Annie, Harry, I have Olivia, I have Oscar, I have Sally, I have Wally. Okay, so to again, just to summarize, I didn’t invoke Presto, from my desktop, my local machine, I’m just using a client, in this case, the Presto CLI to connect to than a cluster that I provisioned via Ahana Cloud – in the cloud. And I’m just accessing that. As part of that, that cluster is already configured to use Lake Formation. The first thing I did was log log in as Annie, and as we mentioned, Annie is my admin. And as an admin, she can access everything she has access to all the databases, all the tables, etc.

Wen Phan | Ahana 

Okay, so let’s do a much more another interesting case.  And let’s say instead of Annie, I log in as Sally, who is a sales analyst. As a platform owner, I know that Sally in order to do her job, all she needs to look at are transactions. Because let’s say she’s going to forecast what the sales are, or she’s going to do some analysis on what type of transactions have been good. So if we go back and look at the transactions table, this is what it looks like. Now, when I do this, though, I do notice that there’s credit card number, and I know that I don’t really want to expose a credit card number to my analysts, because they don’t need it for their work. So I’m going to go ahead – and also in this policy, for financial information – say, you know, any sales analysts, in this case, Sally, can only have access to the transactions table. And when she accesses the transactions table, she will not be able to see the credit card number. Okay. So let’s go see what this looks like. So instead of Annie, I’m going to log in as Sally. Let’s go ahead and just see what we did here. If we actually look at the data source, Annie got mapped to the role of data admin, so she can see everything. Sally is mapped to the role of “sales analysts,” and therefore can only do what a sales analyst is defined to do in Lake Formation. But the magic is it’s defined in Lake Formation. But Ahana Cloud for Presto can take advantage of that policy.

So I’m going to go ahead and log into Sally. Let’s first take a look at the databases that I can see, they’re called schemas in LF. So first thing you’ll notice is Sally does not see HR, because she doesn’t need to, and she has been restricted, so she can only see sales. Now let’s see what tables let’s see what tables Sally can see. Sally can only see transactions, Sally cannot actually see the customers table. But she doesn’t know this. She’s just running queries. And she’s saying, “Well, this is all I can see. And it’s what I need to do my job. So I’m okay with it.” So let’s actually try to query the transactions table now. So sales, LF sales, transactions. When I tried to do this, I actually get an Access Denied. Why? The reason I get an access denied here is because I cannot actually look at all the columns, I’ve been restricted, there’s only a subset of the columns that I can look at. As I mentioned, we are not able to see the credit card number. So when I tried to do a select star, I can’t really do a star because I can’t see the credit card number, we are making this an improvement where we won’t do an explicit deny. And we’ll just return the columns that you have access to. Otherwise, this can be a little annoying. But the end of the day, you can see the authorization being enforced. You have Presto, and it’s and the policies are being enforced, that are set in Lake Formation.

So now instead of doing a star and actually specifically, paste the columns I have access to – I can see it and I can do whatever I need to do. I can do a group by to see what categories are great. I can do a time series analysis on revenue, I’d get in and then do a forecast for the next three months, whatever I need to do as sales analyst. So that’s great. Okay, so I’m going to go ahead and log out. So let’s go back to this. So we know Sally’s world. So now let’s say you know, the marketing manager Ali here has to marketing analyst and there she’s got them responsible for different regions, and we want to get some demographics on our customers. So we have Wally. And if you look at the customer’s data, there’s a bunch of PII – first name, last name, date of birth. So couple of things, we can automatically say, You know what, they don’t need to see this PII, we’re going to go ahead and mask it with Lake Formation. Okay, and like I mentioned, you know, Ali’s kind of segments of her analysts to have different regions across the Pacific West Coast. So Wally is really responsible for only Washington. So we decided to say hey, on a need to know basis, you’re only really going to get rows back that are from customers that live in Washington. Alright, so let’s go ahead and do that. 

Wen Phan | Ahana 

I’m going to log in as Wally, and let’s go actually see the databases again, just to justice to see it and I’m just showing you different layers of the authorization. So Wally can see skills, not HR, well, let’s see what tables while he can see. So while he should only see customers – Wally can only see customers, he cannot see the transactions, because he’s been restricted to it. Let’s try the same thing again, select star from sales customers. And we expect this to air out why? Because again, PII data, we, we cannot do a star. We don’t allow first name, last name, date of birth, all of that, if I do this, and I go ahead and take the customers out. I’ll see the columns that I want, and I only see the rows that come from Washington, I technically did not have to select the state, I just want to prove that I’m only getting records from Washington.

Let’s try another analyst Olivia. And guess what, Olivia is responsible only for Oregon. So she’s basically up here to Wally. But she’s responsible for Oregon. So I’m going to go ahead and do the same query, which is saved and see what happens. So in this case, Olivia can only see Oregon. What you’re seeing here is basically the fine-grained access control, you’re seeing database restriction, you’re seeing table level restriction, you’re seeing column number restriction, and you’re seeing role level restriction. And you can do as many of these as, as you want. So we talked about Wally, and we know Olivia can only see Oregon, one more persona, actually two more personas, and then we’re done. I think you guys all get the point.

I think I’ve probably done enough sufficient proof that we can in fact enforce policies. So last one is just Harry who’s in HR. So if I actually log in as Harry. Harry should only be able to see the HR data set. So if I go Harry. And I show the tables. Well, first of all, let’s just again, just to be complete, I’m only going to say HR, I want to see the sales data. So you can it’s hairy one see the transactions he couldn’t. And then I can go ahead. And since I already know what the schema is, look at all the employees in this database. And I’ll see everything because I’m in HR, so I can see your personal information. And it doesn’t restrict me.

Okay. And the final thing is, what happens if I have a user that I haven’t mapped any policies to? So I actually have one user here, who is Oscar, and I actually didn’t give Oscar any policies whatsoever. So let me go ahead here. So notice that Oscar is in the cluster. But Oscar is not mapped to any role whatsoever. I go back to my cluster, I go here. Oscar, he is here. So Oscar is a valid user in the cluster. But Oscar has no role. And so by default, if you have no role we deny you access. That’s essentially what’s going to happen. But just to prove it. Oscar is using this cluster, show catalogs, you’ll see the LF? Well, let’s say I try to I try to see what’s in LF, what’s in that connector, Access Denied. Because there is no mapping, you can’t see anything, we’re not going to tell you anything. We’re going to tell you what databases are in there. No tables, nothing. So that’s the case where you know, it’s very secure, you don’t have explicit access, you don’t you don’t get any information. Okay, so I’ve been in demo mode for a while, just wanted to check if there’s any questions or chat. All right, none.

So. So let’s just do a summary of what we saw. And then kind of wrap it up for Q&A. We’re good on time, actually. And give you some information of where you can get more information if you want to, you want to dig in, deep.

So first, the review. So we had all these users, you see the roles, we saw a case where you have all access, you saw the case where you have no access. And I did a bunch of other demos where you saw different varying degrees of access, table, database, column role, all of that stuff. And so that’s what that’s what this really integration brings to folks that have a data lake today. You’ve gotten all your data there. Inside your data lake, you’ve decided that Presto is the way to go in terms of interactive querying, because it scales, it can manage all your data. And now you want a role that’s all your analysts or your data practitioners, but you want to do it in a secure way. And you want to enforce it and you want to do it in one place. Lake Formation doesn’t only integrate with Ahana it can integrate with other tools, within the AWS ecosystem. Sure, defining these policies in one place, and Ahana managed Presto clusters can take advantage of that.

There was a more A technical talk on this, if you’re interested in some of the technical details that we just presented at Presto Con, with my colleague, Jalpreet, who is the main engineer on this, as well as another representative from AWS, Roy. If you’re interested, go ahead and just Google this and go to YouTube. And you can go watch this. And they’ll give you more of the nitty gritty underneath the hood, if you’re interested in that. And that is all I have for plans, content.

Ali LeClerc | Ahana 

Wen what a fantastic demo, thanks for going through all of those. Fantastic. So I wanted to give Gary kind of a chance to share his perspective on the integration and his thoughts on you know what this kind of means, from the AWS point of view. So Gary, if you don’t mind putting on your video, that would be awesome. If you can just say hi to everyone and let you kind of share your thoughts.

Gary Stafford | AWS 

That’s much better than that corporate picture that was up there. Yeah, thank you. And I would also recommend as Wen said to view the PrestoCon video with Roy and Jalpreet. I think they go into a lot a lot of detail in respect to how the integration works under the covers. And also, maybe share two links Ali, I’ll paste them in there. One link, kind of what’s new with AWS Lake formation, Roy mentioned some of the new features that were announced, I’ll drop a link in there to let folks know what’s new, it’s a very actively developed project, there’s a lot of new features coming out. So I’ll share that link. And also, Jalpreet mentioned a lot of the API features. Lake Formation has a number of API’s, I’ll drop a link in there too, that discusses some of those available endpoints and API’s a little better. I’ll just I’ll share my personal vision. And I think of services like Amazon Event Bridge that has a partner integration, which makes it very easy for SaaS partners to integrate with customers on AWS platform, I think it’d be phenomenal at some point if Lake Formation progresses to that standpoint with some of the features that that Roy mentioned and Wen demonstrated today. Where partners like Ahana could integrate with Lake Formation, and get an out of the box data lake, a way to create a data lake, a way to secure a data lake and simply add their analytics engine with their special sauce on top of that, and not have to do that heavy lifting. And I hope that’s the direction that Lake Formation is headed in. I think that’ll be phenomenal to have a better integration story with our partners on AWS.

Ali LeClerc | Ahana 

Fantastic. Thanks, Gary. With that, we have a few questions that have come in. Again, if you do have a question, you can pop it into the Q&A box, or even the chat box. So Wen, I think this one’s for you, can you share a little bit more about the details on what happens with the enabling of the integration?

Wen Phan, Ahana 

Sure, I will answer this question in two ways. I will show you what we’re doing under the hood. So that you know, and kind of this API exchange. And this is a recent release. So let me go ahead and share my screen again. I think and whoever asked the question, if I didn’t answer the question, let me know. So when you go to the data source, like I mentioned, it’s pretty simple. And we make we do that on purpose. So when you enable Lake Formation, you can go ahead and launch this CloudFormation template, which will go ahead and do the integration. What does it actually doing under the hood? So first of all, this is actually a good time for me to introduce our documentation. If you go to ahana.io, all of this is documented. So you go to docs, Lake Formation is tightly coupled with Glue, go to manage data sources, you go to Glue, this will tell you walk you through it. And there’s a section here, that tells you if you didn’t want to use this, like you didn’t want to actually use the CloudFormation. Or you just simply want it to understand what is this really doing, you can go ahead and read about it. The essentially, like Roy mentioned, there’s a bunch of API’s, one of the API’s is this data lake settings API with Lake Formation. If you use the AWS CLI, you can actually see this, and you’ll get a response, what we’re doing is there’s a bunch of flags that you need to set, you have to allow Ahana Presto to actually do the filtering on your behalf. So we’re going to get the data, we’re going to look at the policies and block out anything you’re not supposed to see. And we also are a partner. So the service needs to know that this is a valid partner that is interacting with the Lake Formation service. So that’s all this is doing. You could do this all manually if you really wanted to with the CLI. We just take care of this for you, on your behalf. So that’s what’s going on to enable the integration. The second part, and again, this goes into a lot more detail in this talk is what’s actually happening under the hood. I’m just going to show a quick kind of slide for this. But essentially what’s happening is when you make a query, so you defined everything in AWS when you make a query, our service so in our case, we’re a third party application, we go ahead and talk to Lake Formation, you set this up, we go talk for Lake Formation, we get temporary credentials. And then we know what the policies are. And we are able to access only the data that you’re allowed to see. And then we process it with a query. And then you see kind of in the client, in my case, that’s what you saw in the CLI.

Ali LeClerc | Ahana 

Cool, thanks Wen, thorough answer. Next question that came in is this is this product, a competitor to Redshift? I’m assuming when you say product, do you mean Ahana? But maybe you can talk about both Ahana and Presto Wen?

Wen Phan | Ahana 

Yeah, I mean, it all comes down to your use case. So Redshift is kind of more like a data warehouse. And that’s great. It has its own use cases. And again, Presto can connect to Redshift. So it depends on what you want. I mean, Presto can talk to data lake. So if you have use cases that make more sense on a data lake – Presto, is one way to access it. And actually, if you have use cases that need to span both the data lake and Redshift, Presto can federate that query as well. So it’s just another piece in the ecosystem. I don’t necessarily think it’s a competitor, I think it’s, as with many things, what’s your what’s your use-case and pick the right tool for your use-case.

Ali LeClerc | Ahana 

Great. I think you just mentioned something around Glue, Wen, So somebody asked, do I need to use Glue from my catalog? If I’m using Lake Formation with Ahana Cloud?

Wen Phan | Ahana 

Yes, you do. Yes, you do. It’s a tightly coupled AWS stack, which works very well. And so you do have to use Glue.

Ali LeClerc | Ahana 

All right. So I think we’ve answered a ton of questions along the way, as well as just now. If there are no more, and it looks like no more have come in, then I think we can probably wrap up here. So any last kind of parting thoughts Wen and Gary before we say goodbye to everybody? So on that note, I’m going to post our, our link in here. I don’t know if Wen mentioned, maybe he did, we have a 14-day free trial. So no commitment, you can check out Ahana Cloud for Presto on AWS free for 14 days, play around with it, get started Lake Formation. If you’re interested in learning more, we’ll make sure to put you in touch with Wen who again is the local expert at that at Ahana. And then Gary, of course, is always able to help as well. And so, so feel free to check out our 14-day free trial. And with that, I think that’s it. All right, everyone. Thanks Wen fantastic demo, fantastic presentation. Appreciate it. Gary, thanks for being available. Appreciate all of your support and getting this integration off the ground and into the hands of our customers. So fantastic. Thanks, everybody for joining for sticking through with us till the end. You’ll get a link to the recording and the slides and we’ll see you next time.

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Wen Phan

Principal Product Manager, Ahana

Wen-Phan_Picture
andy sacks

Ahana Responds to Growing Demand for its SaaS for Presto on AWS with Appointment of Chief Revenue Officer

Enterprise Sales Exec Andy Sacks Appointed to Strengthen Go-to-Market Team

San Mateo, Calif. – January 11, 2022 Ahana, the only SaaS for Presto, today announced the appointment of Andy Sacks as Chief Revenue Officer, reporting to Cofounder and CEO Steven Mih. In this role, Andy will lead Ahana Cloud’s global revenue strategy. With over 20 years of enterprise experience, Andy brings expertise in developing significant direct and indirect routes to market across both pre and post sales organizations.

Ahana Cloud for Presto is a cloud-native managed service for AWS that gives customers complete control, better price-performance, and total visibility of Presto clusters and their connected data sources. “We’ve seen rapidly growing demand for our Presto managed service offering which brings SQL to AWS S3, allowing for interactive, ad hoc analytics on the data lake,” said Mih. “As the next step, we are committed to building a world-class Go-To-Market team with Andy at the helm to run the sales organization. His strong background building enterprise sales organizations, as well as his deep experience in the open source space, makes him the ideal choice.”

“I am excited to join Ahana, the only company that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Sacks. “I am looking forward to leveraging my experiences to help drive Ahana’s growth through innovative Presto use cases for customers without the complexities of managing cloud deployments.”

Prior to Ahana, Andy spent several years as an Executive Vice President of Sales. Most recently at Alloy Technologies, and prior to that at Imply Data and GridGain Systems, he developed and led each company’s global Sales organization, while posting triple digit growth year over year. At both Imply and GridGain, he created sales organizations from scratch. Prior to GridGain, he spent over six years at Red Hat, where he joined as part of the JBoss acquisition. There he developed and led strategic sales teams while delivering substantial revenue to the company. Prior to Red Hat, he held sales leadership roles at Bluestone Software (acquired by HP), RightWorks (acquired by i2) and Inktomi (acquired by Yahoo! and Verity), where he was instrumental in developing the company’s Partner Sales organization. Andy holds a Bachelor of Science degree in Computer Science from California State University, Sacramento.

Supporting Resources

Download a head shot of Andy Sacks https://ahana.io/wp-content/uploads/2022/01/Andy-Sacks.jpg 

Tweet this:  @AhanaIO bolsters Go-To-Market team adding Chief Revenue Officer Andy Sacks #CRO #newhire #executiveteam https://bit.ly/3zJCvBL

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Managed service for SQL

Ahana Cofounders Make Data Predictions for 2022

Open Data Lake Analytics Stack, Open Source, Data Engineering and More SaaS and Containers Top the List

San Mateo, Calif. – January 5, 2022 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Executive Officer, Steven Mih predict major developments in cloud, data analytics, databases and data warehousing in 2022. 

The COVID-19 pandemic continues to propel businesses to make strategic data-driven shifts. Today more companies are augmenting the traditional cloud data warehouse with cloud data lakes for much greater flexibility and affordability. Combined with more Analytics and AI applications, powerful, cloud-native open source technologies are empowering data platform teams to analyze that data faster, easier and more cost-effectively in SaaS environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2022:

  • OpenFlake – the Open Data Lake for Warehouse Workloads: Data warehouses like Snowflake are the new Teradata with proprietary formats. 2022 will be about the Open Data Lake Analytics stack that allows for open formats, open source, open cloud and no vendor lock-in.
  • More Open Source Behind Analytics & AI – As the momentum behind the Open Data Lake Analytics stack to power Analytics & AI applications continues to grow, we’ll see a bigger focus on leveraging Open Source to address flexibility and cost limitations from traditional enterprise data warehouses. Open source cloud-native technologies like Presto, Apache Spark, Superset, and Hudi will power AI platforms at a larger scale, opening up new use cases and workloads.
  • Database Engineering is Cool Again – With the rise of the Data Lake tide, 2022 will make database engineering cool again. The database benchmarking wars will be back and the database engineers who can build a data lake stack with data warehousing capabilities (transactions, security) but without the compromises (lock-in, cost) will win. 
  • A Post-Pandemic Data-Driven Strategic Shift to Out-Of-The-Box Solutions – The pandemic has brought about massive change across every industry and the successful “pandemic” companies were able to pivot from their traditional business model. In 2022 we’ll see less time spent on managing complex, distributed systems and more time focused on delivering business-driven innovation. That means more out-of-the-box cloud solution providers that reduce cloud complexities so companies can focus on delivering value to their customers.
  • More SaaS, More Containers – When it comes to 2022, abstracting the complexities of infrastructure will be the name of the game. Containers provide scalability, portability, extensibility and availability advantages, and technologies like Kubernetes alleviate the pain around building, delivering, and scaling containerized apps. As the SaaS space continues to explode, we’ll see even more innovation in the container space. 

Steven Mih, Co-founder and Chief Executive Officer, outlines a major trend he sees on the horizon in 2022:

  • Investment & Adoption of Managed Services for Open Source Will Soar – More companies will adopt managed services for open source in 2022 as more cloud-native open source technologies become mainstream (Spark, Kafka, Presto, Hudi, Superset). Open source companies offering easier-to-use, managed service versions of installed software enable companies to take advantage of these powerful systems without the resource overhead so they can focus on business-driven innovation.

Tweet this: @AhanaIO announces 2022 #Data Predictions #cloud #opensource #analytics https://bit.ly/3pT0KtZ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Managed service for SQL

Ahana and Presto Praised for Technology Innovation and Leadership in Open Source, Big Data and Data Analytics with Recent Industry Awards

San Mateo, Calif. – December 15, 2021 Ahana, the only SaaS for Presto, today announced the addition of many new industry accolades in 2H 2021. Presto, originally created by Meta (Facebook) who open sourced and donated the project to Linux Foundation’s Presto Foundation, is the SQL query engine for the data lake. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

Recent award recognitions, include:

  • 2021 BIG Awards for Business, “Start-up of the Year” –  Ahana is recognized by the Business Intelligence Group as a winner of the 2021 BIG Awards for Business Program in the Start-up of the Year category as a company leading its respective industry.
  • CRN, “Emerging Vendors for 2021 – As part of CRN’s Emerging Vendors for 2021, here are 17 hot big data startups, founded in 2015 or later, that solution providers should be aware of. Ahana is listed for its cloud-native managed service for the Presto distributed SQL query engine for Amazon Web Services.
  • CRN, “2021 Tech Innovator Awards” – From among 373 applicants, CRN staff selected products spanning the IT industry—including in cloud, infrastructure, security, software and devices—that offer both strong differentiation and major partner opportunities. Ahana Cloud for Presto was named a finalist in the Big Data category. 
  • DBTA, “Trend Setting Products in Data and Information Management for 2022” – These products, platforms and services range from long-established offerings that are evolving to meet the needs of their loyal constituents to breakthrough technologies that may only be in the early stages of adoption. However, the common element for all is that they represent a commitment to innovation and seek to provide organizations with tools to address changing market requirements. Ahana is included in this list of most significant products. 
  • Infoworld, “The Best Open Source Software of 2021” – InfoWorld’s 2021 Bossie Awards recognize the year’s best open source software for software development, devops, data analytics, and machine learning. Presto, an open source, distributed SQL engine for online analytical processing that runs in clusters, is recognized with a prestigious Bossie award this year. The Presto Foundation oversees the development of Presto. Meta, Uber, Twitter, and Alibaba founded the Presto Foundation and Ahana is a member.  
  • InsideBIGDATA, “IMPACT 50 List for Q3 and Q4 2021” – Ahana earned an Honorable Mention for both of the last two quarters of the year as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 
  • Solutions Review, “Coolest Data Analytics and Business Intelligence CEOs of 2021” – This list of the coolest data analytics CEOs which includes Ahana’s Cofounder and CEO Steven Mih is based on a number of factors, including the company’s market share, growth trajectory, and the impact each individual has had on its presence in what is becoming the most competitive global software market. One thing that stands out is the diversity of skills that these chief executives bring to the table, each with a unique perspective that allows their company to thrive. 
  • Solutions Review, “6 Data Analytics and BI Vendors to Watch in 2022” – This list is an annual listing of solution providers Solutions Review believes are worth monitoring, which includes Ahana. Companies are commonly included if they demonstrate a product roadmap aligning with Solutions Review’s meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication.

“We are proud that Ahana’s managed service for Presto has been recognized by top industry publications as a solution that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Steven Mih, cofounder and CEO, Ahana. “In less than a year, Ahana’s innovation has been proven with innovative use cases delivering interactive, ad-hoc analytics with Presto without having to worry about the complexities of managing cloud deployments.”

Tweet this:  @AhanaIO praised for technology innovation and leadership with new industry #awards @CRN @DBTA @BigDataQtrly @insideBigData @Infoworld @SolutionsReview #Presto #OpenSource #Analytics #Cloud #DataManagement https://bit.ly/3ESDnWy 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

aws ahana color

Ahana Cloud for Presto Delivers Deep Integration with AWS Lake Formation Through Participation in Launch Program

Integration enables data platform teams to seamlessly integrate Presto with their existing AWS data services while providing granular security for data

San Mateo, Calif. – December 9, 2021 Ahana, the only SaaS for Presto, today announced Ahana Cloud for Presto’s deep integration with AWS Lake Formation, an Amazon Web Services, Inc. (AWS) service that makes it easy to set up a secure data lake, manage security, and provide self-service access to data with Amazon Simple Storage Service (Amazon S3). As an early partner in the launch program, this integration allows data platform teams to quickly set up a secure data lake and run ad hoc analytics on that data lake with Presto, the de facto SQL query engine for data lakes.

Amazon S3 has quickly become the de facto storage for the cloud, widely used as a data lake. As more data is stored in the data lake, query engines like Presto can directly query the data lake for analytics, opening up a broader set of Structured Query Language (SQL) use cases including reporting and dashboarding, data science, and more. Security of all this data is paramount because unlike databases, data lakes do not have built-in security and the same data can be used across multiple compute engines and technologies. This is what AWS Lake Formation solves for.

AWS Lake Formation enables users to set up a secure data lake in days. It simplifies the security on the data lake, allowing users to centrally define security, governance, and auditing policies in one place, reducing the effort in configuring policies across services and providing consistent enforcement and compliance. With this integration, AWS users can integrate Presto natively with AWS Glue, AWS Lake Formation and Amazon S3, seamlessly bringing Presto to their existing AWS stack. In addition to Presto, data platform teams will get unified governance on the data lake for many other compute engines like Apache Spark and ETL-focused managed services in addition to the already supported AWS native services like Amazon Redshift and Amazon EMR.

“We are thrilled to announce our work with AWS Lake Formation, allowing AWS Lake Formation users seamless access to Presto on their data lake,” said Dipti Borkar, Cofounder and Chief Product Officer at Ahana. “Ahana Cloud for Presto coupled with AWS Lake Formation gives customers the ability to stand up a fully secure data lake with Presto on top in a matter of hours, decreasing time to value without compromising security for today’s data platform team. We look forward to opening up even more use cases on the secure data lake with Ahana Cloud for Presto and AWS Lake Formation.”

The Ahana Cloud and AWS Lake Formation integration has already opened up new use cases for customers. One use case centers around making Presto accessible to internal data practitioners like data engineers and data scientists, who can then in turn develop downstream artifacts (e.g. models, dashboards). Another use case is exposing the data platform to external clients, which is how Ahana customer Metropolis is leveraging the integration. In Metropolis’ case, they can provide their external customers transparency into internal operational data and metrics, enabling them to provide an exceptional customer experience.

“Our business relies on providing analytics across a range of data sources for our clients, so it’s critical that we provide both a transparent and secure experience for them,” said Ameer Elkordy, Lead Data Engineer at Metropolis. “We use Amazon S3 as our data lake and Ahana Cloud for Presto for ad hoc queries on that data lake. Now, with the Ahana and AWS Lake Formation integration, we get even more granular security with data access control that’s easy to configure and native to our AWS stack. This allows us to scale analytics out to our teams without worrying about security concerns.”

Ahana Cloud for Presto on AWS Lake Formation is available today. You can learn more and get started at https://ahana.io/aws-lake-formation

Supporting Resources:

TWEET THIS: @Ahana Cloud for #Presto delivers deep integration with AWS Lake Formation  #OpenSource #Analytics #Cloud https://bit.ly/3Ix9L35

About Ahana

Ahana, the only SaaS for Presto, offers a managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the Ahana Cloud for Presto integration with AWS Lake Formation

Screen Shot 2021 12 08 at 9.15.34 PM

We’re excited to announce that Ahana Cloud for Presto now integrates with AWS Lake Formation, including support for the recent general availability of row-level security.

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services. Over the past several months we’ve worked closely with the AWS Lake Formation team to bring Lake Formation capabilities to Presto on AWS.  Further, we’re grateful to our customers who were willing to preview early versions of our integration.

Today, Ahana Cloud for Presto allows customers to use Presto to query their data protected with AWS Lake Formation fine-grained permissions with a few clicks.  Our customers can bring Presto to their existing AWS stack and scale their data teams without compromising security.  We’re thrilled that the easiest managed service for Presto on AWS just got easier and more secure.

Here’s a quick video tutorial that shows you how easy it is to get started with AWS Lake Formation and Ahana:

Additionally, we’ve put together a list of resources where you can learn more about the integration.

What’s Next?

If you’re ready to get started with AWS Lake Formation and Ahana Cloud, head over to our account sign up page where you can start with a free 14-day trial of Ahana Cloud. You can also drop us a note at product@ahana.io and we can help get you started. Happy building!

JSON data

Advanced SQL Tutorial

Advanced SQL: JSON

Advanced SQL queries with JSON

Presto has a wide-range of JSON functions supporting advanced SQL queries. Consider this Json test input data (represented in the query using the  VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the  the first (0th) key/value pair we would code:

SELECT json_extract(v, '$.dogs) AS all_json, 
  json_extract(v, '$.dogs[0].name') AS name_json, 
  json_extract_scalar(v, '$.dogs[0].name') AS name_scalar 
FROM 
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
 
                         all_json                         | name_json | name_scalar 
----------------------------------------------------------+-----------+-------------
 [{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle"  | Beagle      
(1 row)

All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html 

Advanced SQL: Arrays, Un-nesting, and Lambda functions 

Consider the following array of test data elements, and simple query to multiple each element by 2:

SELECT elements,
    ARRAY(SELECT v * 2
          FROM UNNEST(elements) AS v) AS my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);
 
    elements    | my_result
----------------+---------------------
 [1, 2]         | [2, 4]
 [1, 3, 9]      | [2, 6, 18]
 [1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)

The above advanced SQL query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data. 

Now here’s the same query, but written using a lambda expression. Why use lambda expressions?  This method makes advanced SQL querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:

SELECT elements, 
transform(elements, v -> v * 2) as my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);

Both queries return the same result. The transform function and “x -> y” notation simply means  “do y to my variable x”.

To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html 

Advanced SQL: Counting Distinct Values

Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines.  The Presto-cli will even display a warning reminding you of this.  

A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values.  The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data). 
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows.  Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):

presto:amazon> select count(distinct product_id) from review;
 
  _col0   
----------
 21460962 
(1 row)
 
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
 
 
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
 
presto:amazon> select approx_distinct(product_id) from review;
  _col0   
----------
 21567368 
(1 row)
 
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

logo prestocon

Ahana Joins Leading Members of the Presto® Community at PrestoCon as Platinum Sponsor, Will Share Use Cases and Technology Innovations

Ahana to deliver keynote and co-present sessions with Uber, Intel and AWS; Ahana customers to present case studies

San Mateo, Calif. – December 1, 2021 Ahana, the only SaaS for Presto, announced today their participation at PrestoCon, a day dedicated to the open source Presto project taking place on Thursday, December 9, 2021. Presto was originally created by Facebook who open sourced and donated the project to Linux Foundation’s Presto Foundation. Since then it has massively grown in popularity with data platform teams of all sizes.

PrestoCon is a day-long event for the PrestoDB community by the PrestoDB community that will showcase more of the innovation within the Presto open source project as well as real-world use cases. In addition to being the platinum sponsor of the event, Ahana will be participating in 5 sessions and Ahana customer Adroitts will also be presenting their Presto use case. Ahana and Intel will also jointly be presenting on the next-generation Presto which includes the native C++ worker.

“PrestoCon is the marquee event for the Presto community, showcasing the latest development and use cases in Presto,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana, Program Chair of PrestoCon and Chair of the Presto Foundation Outreach Committee. “In addition to contributors from Meta, Uber, Bytedance (TikTok) and Twitter sharing their work, we’re excited to highlight more within the Presto ecosystem including contributions like Databricks’ delta lake connector for Presto, Twitter’s Presto Iceberg Connector, and Presto on Spark. Together with our customers like Adroitts, Ahana will be presenting the latest technology innovations including governance on data lakes with Apache Ranger and AWS Lake Formation. We look forward to the best PrestoCon to date.”

“PrestoCon continues to be the showcase event for the Presto community, and we look forward to building on the success of this event over the past year to share even more innovation and use of the open source project with the larger community,” said Chris Aniszczyk, Vice President, Developer Relations, The Linux Foundation. “Presto Foundation continues to focus on community adoption, and PrestoCon is a big part of that in helping bring the Presto community together for a day of deep learning and connecting.”

“As members of the Presto Foundation focused on driving innovation within the Presto open source project, we’re looking forward to sharing our work on the new PrestoDB C++ execution engine with the community at this year’s PrestoCon,” said Arijit Bandyopadhyay, CTO of Enterprise Analytics & AI, Head of Strategy – Cloud and Enterprise, Data Platforms Group, Intel. “Through collaboration with other Presto leaders Ahana, Bytedance, and Meta on this project, we’ve been able to innovate at a much faster pace to bring a better and faster Presto to the community.”

Ahana Customers Speaking at PrestoCon

Ahana Sessions at PrestoCon

  • Authoring Presto with AWS Lake Formation by Jalpreet Singh Nanda, software engineer, Ahana and Roy Hasson, Principal Product Manager, Amazon 
  • Updates from the New PrestoDB C++ Execution Engine by Deepak Majeti, principal engineer, Ahana and Dave Cohen, senior principal engineer, Intel.
  • Presto Authorization with Apache Ranger by Reetika Agrawal, software engineer, Ahana
  • Top 10 Presto Features for the Cloud by Dipti Borkar, cofounder & CPO, Ahana

Additionally, industry leaders Bytedance (TikTok), Databricks, Meta, Uber, Tencent, and Twitter will be sharing the latest innovation in the Presto project, including Presto Iceberg Connector, Presto on Velox, Presto on Kafka, new Materialized View in Presto, Data Lake Connectors for Presto, Presto on Elastic Capacity, and Presto Authorization with Apache Ranger.

View all the sessions in the full program schedule.  

PrestoCon is a free virtual event and registration is open

Other Resources

Tweet this: @AhanaIO announces its participation in #PrestoCon #cloud #opensource #analytics #presto https://bit.ly/3l50AwJ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is Presto on Spark?

Presto Queries

Overview

Presto was originally designed to run interactive queries against data warehouses. However, now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

1. Reporting and dashboarding

This includes serving custom reporting for both internal and external developers for business insights and also many organizations using Presto for interactive A/B testing analytics. A defining characteristic of this use case is a requirement for low latency. It requires tens to hundreds of milliseconds at very high QPS, and not surprisingly this use case is almost exclusively using Presto. That’s what Presto is designed for.

2. Data science with SQL notebooks

This use case is one of ad hoc analysis and typically needs moderate latency ranging from seconds to minutes. These are the queries of data scientist, and business analysts who want to perform compact ad hoc analysis to understand product usage, for example, user trends and how to improve the product. The QPS is relatively lower because users have to manually initiate these queries.

3. Batch processing for large data pipelines

These are scheduled jobs that are running every day, hour, or whenever the data is ready. They often contain queries over very large volumes of data and the latency can be up to tens of hours and processing can range from CPU days to years and terabytes to petabytes of data.

Presto works exceptionally effectively for ad-hoc or interactive queries today, and even some batch queries, with the constraint that the entire query must fit in memory and run quickly enough that fault tolerance is not required. Most ETL batch workloads that don’t fit in this box are running on “very big data” compute engines like Apache Spark. Having multiple compute engines with different SQL dialects and APIs makes managing and scaling these workloads complicated for data platform teams. Hence, Facebook decided to simplify and build Presto on Spark as the path to further scale Presto. Before we get into Presto on Spark, let me explain a bit more about the architecture of each of these two popular engines.

Presto’s Architecture

Presto Architecture

Presto is designed for low latency and follows the classic MPP architecture; it uses in-memory streaming shuffle to achieve low latency. Presto has a single shared coordinator per cluster with an associated pool of workers. Presto tries to schedule as many queries as possible on the same Presto worker (shared executor), in order to support multi-tenancy.

This architecture provides very low latency scheduling of tasks and allows concurrent processing of multiple stages of a query, but the tradeoff is that the coordinator is a SPOF and bottleneck, and queries are poorly isolated across the entire cluster.

Additionally streaming shuffle does not allow for much fault tolerance further impacting the reliability of long running queries.

Spark’s Architecture

Spark Architecture

On other hand, Apache Spark is designed for scalability from the very beginning and it implements a Map-Reduce architecture. Shuffle is materialized to disk fully between stages of execution with the capability to preempt or restart any task. Spark maintains an isolated Driver to coordinate each query and runs tasks in isolated containers scheduled on demand. These differences improve reliability and reduce overall operational overhead.

Why Presto alone isn’t a good fit for batch workloads?

Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1]. To simplify this let’s examine the below aggregation query. Essentially this query goes over the orders table in TPCH and does aggregation grouping on custom keys, and summing the total price. Presto leverages in-memory shuffle and executes shuffle on the custom key, after reading the data and doing aggregation for the same key, on each worker.

Parallel Processing

Doing in-memory shuffle means the producer will buffer data in memory and wait for the data to be fetched by the consumer as a result. We have to execute all the tasks, before and after the exchange at the same time. So thinking about in the mapreduce world all the mappers and the reducer have to be run concurrently. This makes in-memory shuffle an all-or-nothing exclusion model.

This causes inflexible scheduling and scaling query size becomes more difficult because everything is running concurrently. In the aggregation phase the query may exceed the memory limit because everything has to be held in the memory in hash tables in order to track each group (custkey).

Additionally we are limited by the size of a cluster in how many nodes we can hash partition the data across to avoid having to fit it all in memory. Using distributed disk (Presto-on-Spark, Presto Unlimited) we can partition the data further and are only limited by the number of open files and even that is a limit that can be scaled quite a bit by a shuffle service.

For that reason it makes Presto difficult to scale to very large and complex batch pipelines. Such pipelines remain running for hours, all to join and aggregate over a huge amount of data. This motivated the development of Presto Unlimited which adapts Presto’s MPP design to large ETL workloads, and improves user experience at scale.

Presto Unlimited

While Presto Unlimited solved part of the problem by allowing shuffle to be partitioned over distributed disk, it didn’t fully solve fault tolerance. Additionally, it did nothing to improve isolation and resource management.

Presto on Spark

Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library with Spark’s RDD API used to manage execution of Presto’s embedded evaluation. This is similar to how Google chose to embed F1 Query inside their MapReduce framework.

The high level goal is to bring a fully disaggregated shuffle to Presto’s MPP run time and we achieved this by adding a materialization step right after the shuffle. The materialized shuffle is modeled as a temporary partition table, which brings more flexible execution after shuffle and allows to partition level retries. With Presto on Spark, we can do a fully disaggregated shuffle on custom keys for the above query both on mapper and reducer side, this means all mappers and reducers can be independently scheduled and are independently retriable.

Presto on Spark

Presto On Spark at Intuit

Superglue is a homegrown tool at Intuit that helps users build, manage and monitor data pipelines. Superglue was built to democratize data for analysts and data scientists. Superglue minimizes time spent developing and debugging data pipelines, and maximizes time spent on building business insights and AI/ML.

Many analysts at Intuit use Presto (AWS Athena) to explore data in the Data Lake/S3. These analysts would spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines in Superglue. To minimize SQL dialect conversion issues and associated productivity loss for analysts, the Intuit team started to explore various options including query translation, query virtualization, and presto on spark. After a quick POC, Intuit decided to go with Presto on Spark. This is because it leverages Presto’s compiler/evaluation as a library (no query conversion is required) and Spark’s scalable data processing capabilities.

Presto on Spark is now in production at Intuit. In three months, there are hundreds of critical pipelines that have thousands of jobs running on Presto On Spark via Superglue.

Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster. Scheduled batch data pipelines are launched on ephemeral clusters to take advantage of resource isolation, manage cost, and minimize operational overhead. DDL statements are executed against Hive and DML statements are executed against Presto. This enables analysts to write Hive-compatible DDL and the user experience remains unchanged.

This solution helped enable a performant and scalable platform with seamless end-to-end experience for analysts to explore and process data. It thereby improved analysts’ productivity and empowered them to deliver insights at high speed.

When To Use Presto on Spark

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. Presto on Spark heavily benefits pipelines written in Presto that operate on terabytes/petabytes of data. It takes advantage of Spark’s large scale processing capabilities. The biggest win here is that no query conversion is required and you can leverage Spark for

  • Scaling to larger data volumes
  • Scaling Presto’s resource management to larger clusters
  • Increase reliability and elasticity of Presto as a compute engine

Why Presto on Spark matters

We tried to achieve the following to adapt ‘Presto on Spark’ to Internet-scale batch workloads [2]:

  • Fully disaggregated shuffles
  • Isolated executors
  • Presto resource management, Different Scheduler, Speculative Execution, etc.

A unified option for batch data processing and ad hoc is important for creating the experience of queries that scale. More queries will fail without requiring rewrites between different SQL dialects. We believe this is only a first step towards more confluence between the Spark and the Presto communities. This is also major step towards enabling unified SQL experience between interactive and batch use cases. Today many internet giants like Facebook, have moved over to Presto on Spark. We have even seen many organizations including Intuit start running their complex data pipelines in production with Presto on Spark.

“Presto on Spark” is one of the most active development areas in Presto, feel free check it out and please give it a star! If you have any questions, feel free to ask in the PrestoDB Slack Channel.

Reference

[1] MapReduce: Simplified Data Processing on Large Clusters 

[2] Presto-on-Spark: A Tale of Two Computation Engines

AWS Data Analytics & Competency

Ahana Achieves AWS Data & Analytics ISV Competency Status

AWS ISV Technology Partner demonstrates AWS technical expertise 

and proven customer success

San Mateo, Calif. – November 10, 2021 Ahana, the Presto company, today announced that it has achieved Amazon Web Services (AWS) Data & Analytics ISV Competency status. This designation recognizes that Ahana has demonstrated technical proficiency and proven success in helping customers evaluate and use the tools, techniques, and technologies of working with data productively, at any scale, to successfully achieve their data and analytics goals on AWS.

Achieving the AWS Data & Analytics ISV Competency differentiates Ahana as an AWS ISV Partner in the AWS Partner Network (APN) that possesses deep domain expertise in data analytics platforms based on the open source Presto SQL distributed query engine, having developed innovative technology and solutions that leverage AWS services.

AWS enables scalable, flexible, and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise. 

“Ahana is proud to achieve the AWS Data & Analytics ISV Competency, which adds to our AWS Global Startups and AWS ISV Accelerate Partner status,” said Steven Mih, Co-Founder and CEO at Ahana. “Our team is dedicated to helping companies bring SQL to their AWS S3 data lake for faster time-to-insights by leveraging the agility, breadth of services, and pace of innovation that AWS provides.”

TWEET THIS: @Ahana Cloud for #Presto achieves AWS Data and #Analytics Competency Status #OpenSource #Cloud https://bit.ly/3EZXpy1

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Tutorial: How to define SQL functions with Presto across all connectors

share img define sql functions 1200x625 2

Presto is the open source SQL query engine for data lakes. It supports many native functions which are usually sufficient for most use cases. However, there is may be a corner case where you need to implement your own function. To simplify this, Presto allows users to define expressions as SQL functions. These are dynamic functions separated from the Presto source code, managed by a functions namespace manager that you can set up with a MySQL database. In fact, this is one of the most widely used features of Presto at Facebook, with over 1000s of functions defined.

Function Namespace Manager

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. Each catalog.schema can be a function namespace. A function namespace manager is a plugin that manages a set of these function catalog schemas. Catalog can be mapped to connectors in Presto (a connector for functions, no tables or view) and allows the Presto engine to perform actions such as creating, altering, and deleting functions.

This user defined function management is separated from connector API for flexibility, hence these SQL functions can be used across all connectors. Further, the query is guaranteed to use the same version of the function throughout the execution and any modification to the functions is versioned. 

Implementation

Today, function namespace manager is implemented with the help of MySQL, so users need to have a running MySQL service to initialize the MySQL based function namespace manager. 

Step 1: Provision MySQL server and generate jdbc url for further access.

Suppose the MySQL server can be reached at localhost:3306, example database url – 

jdbc:mysql://localhost:3306/presto?user=root&password=password

Step 2: Create database & tables in MySQL database to store function namespace manager related data

 CREATE DATABASE presto;
 USE presto;

Step 3: Configure at Presto [2]

Create Function namespace manager configuration under etc/function-namespace/mysql.properties:

function-namespace-manager.name=mysql database-url=jdbc:mysql://localhost:3306/presto?user=root&password=password
function-namespaces-table-name=function_namespaces
functions-table-name=sql_functions

And restart the Presto Service.

Step 4: Create new function namescape

Now once the Presto server is started we will see below tables under presto database (which is being used to manage function namespace) in Mysql –

mysql> show tables;
+---------------------+
| Tables_in_presto    |
+---------------------+
| enum_types          |
| function_namespaces |
| sql_functions       |
+---------------------+
93 rows in set (0.00 sec)

To create a new function namespace ”ahana.default”, insert into the function_namespaces table:

INSERT INTO function_namespaces (catalog_name, schema_name)
    VALUES('ahana', 'default');

Step 5: Create a function and query from Presto [1]

SQL functions_blog

Here is simple example of SQL function for COSECANT: 

presto>CREATE OR REPLACE FUNCTION ahana.default.cosec(x double)
RETURNS double
COMMENT ‘Cosecant trigonometric function'
LANGUAGE SQL
DETERMINISTIC
RETURNS NULL ON NULL INPUT
RETURN 1 / sin(x);

More examples can be found at https://prestodb.io/docs/current/sql/create-function.html#examples [1]

Step 6: Apply the newly created function and SQL query

SQL functions_blog

It is required for users to use fully qualified function name while using in SQL query.

Following the the example of using cosec SQL function in the query. 

presto> select ahana.default.cosec (50) as Cosec_value;
     Cosec_value     
---------------------
 -3.8113408578721053 
(1 row)

Query 20211103_211533_00002_ajuyv, FINISHED, 1 node
Splits: 33 total, 33 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s

Here is another simple example of creating an EpochTimeToLocalDate function to convert Unix time to local timezone under ahana.default function namespace.

presto> CREATE FUNCTION ahana.default. EpochTimeToLocalDate (x bigint) 
     -> RETURNS timestamp 
     -> LANGUAGE SQL 
     -> DETERMINISTIC RETURNS NULL ON NULL INPUT 
     -> RETURN from_unixtime (x);
CREATE FUNCTION

ahana.default.EpochTimeToLocalDate(1629837828) as date;
          date           
-------------------------
 2021-08-24 13:43:48.000 
(1 row)

Note

 function-namespaces-table-name  <The name of the table that stores all the function namespaces managed by this manager> property can be used if there is a use case  to instantiate multiple function namespace managers,  otherwise if we can create functions in only one function namespace manager then it can be utilized across all different databases or connectors. [2]

At Ahana we have simplified all these steps that is MySQL container, Schema, databases, tables and additional configurations required to manage functions and data platforms users just need to create their own SQL functions and use them in SQL queries, that’s it, no need to be wary about provisioning and managing additional MySQL servers. 

Future Roadmap

Remote function Support with remote UDF thrift API 

Allows you to run arbitrary functions that are either not safe or not possible to run within worker JVM: unreliable Java functions, C++, Python, etc.

References

[1] DDL Syntax to use FUNCTIONS

[2] Function Namespace Manager Documentation

Ahana Cofounder Will Present Session At Next Gen Big Data Platforms Meetup hosted by LinkedIn About Open Data Lake Analytics

Screen Shot 2021 10 18 at 10.44.35 AM

San Mateo, Calif. – November 2, 2021 — Ahana, the Presto company, today announced that its Cofounder and Chief Product Officer Dipti Borkar will present a session at Next Gen Big Data Platforms Meetup hosted by LinkedIn about open data lake analytics. The event is being held on Wednesday, November 10, 2021.

Session Title: “Unlock the Value of Data with Open Data Lake Analytics.”

Session Time: Wednesday, November 10 at 4:10 pm PT / 7:10 pm ET

Session Presenter: Ahana Cofounder and Chief Product Officer and Presto Foundation Chairperson, Outreach Team, Dipti Borkar

Session Details: Favored for its affordability, data lake storage is becoming standard practice as data volumes continue to grow. Data platform teams are increasingly looking at data lakes and building advanced analytical stacks around them with open source and open formats to future-proof their platforms. This meetup will help you gain clarity around the choices available for data analytics and the next generation of the analytics stack with open data lakes. The presentation will cover: generation of analytics, selecting data lakes vs data warehouses, share how these approaches differ from Hadoop generation, why open matters, use cases and workloads for data lakes, and intro to the data lakehouse stack. 

To register for the Next Gen Big Data Platforms Meetup, please go to the event registration page to purchase a registration.

TWEET THIS: @Ahana to present at Next Gen Big Data Platforms Meetup about Open Data Lake Analytics #Presto #OpenSource #Analytics #Cloud https://bit.ly/3vXKl8S

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Presto SQL Engine

How to Manage Presto Queries Running Slow

There are a few reasons that Presto queries might lag or performance is impacted. Your Presto queries might be running slow for a few different reasons. Below we’ll share some things to do in terms of diagnosis and Presto tuning, as well as possible solutions to address the issue. These can help with performance and Presto usage.

Troubleshoot Presto Queries

  1. How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analyzing, planning and optimizing queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help. You may also run into issues if there’s Presto multiple users.
  2. Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.
  3. What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
Presto queries
  1. Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
    • Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
    • Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies
  1. Do you have a separate coordinator node? When running Presto queries, keep in mind you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)
  2. Is memory exhausted? If so this will delay your Presto queries and affect the performance. Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large Presto queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 
    • The good news is there is memory information available in at least two places:
      • Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking
      • There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.
  1. When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection. This in turn can increase the speed of your Presto queries.
  2. Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be. This fix can resolve the performance of your Presto queries on your data.
  3. Does IO and CPU utilization look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough.
  4. Have you checked your data volumes recently? An obvious one to check but data volumes can grow in fits and starts and sometimes peaks occur  unexpectedly. The Presto queries may simply be taking longer because there is x% more data than last month. This increase can result in slowing Presto queries.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. Making alterations here, if you are not careful, can affect the speed of your Presto queries. More information can be found in the PrestoDB documentation.

As you can tell, there’s a lot to configure and tune when it comes to addressing Presto performance issues. To make it easier, you can use Ahana Cloud, SaaS for Presto. It’s available in the AWS Marketplace and is pay as you go. Check out our free trial at https://ahana.io/sign-up


Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

Presto 105: Running Presto with AWS Glue as catalog on your Laptop

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto 104: Running Presto with Hive Metastore

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.

We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.

What is AWS Glue?

AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.

In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.   

This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.

Implementation steps

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2: 

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                          

e8f8aa933633: Pull complete                                                          
b7cf38297b9f: Pull complete                                                          
a4205d42b3be: Pull complete                                                          
81b659bbad2f: Pull complete                                                          
ef606708339: Pull complete                                                          
979857535547: Pull complete                                                          
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latest
docker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandboxd
b74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3
ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.

# yum install -y aws-cli

Step 9: 

Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access

aws iam create-user --user-name glueuser
{
    "User": {
        "Path": "/",
        "UserName": "glueuser",
        "UserId": "AXXXXXXXXXXXXXXXX",
        "Arn": "arn:aws:iam::XXXXXXXXXX:user/glueuser",
        "CreateDate": "2021-10-07T01:07:28+00:00"
    }
}

aws iam list-policies | grep AmazonS3FullAccess
            "PolicyName": "AmazonS3FullAccess",
            "Arn": "arn:aws:iam::aws:policy/AmazonS3FullAccess",

aws iam list-policies | grep AWSGlueConsoleFullAccess
            "PolicyName": "AWSGlueConsoleFullAccess",
            "Arn": "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess",

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

Step 10:

Create access key

% aws iam create-access-key --user-name glueuser
{
   "AccessKey": {
       "UserName": "glueuser",
        "AccessKeyId": "XXXXXXXXXXXXXXXXXX", 
       "Status": "Active",
        "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "CreateDate": "2021-10-13T01:50:45+00:00"
    }
}

Step 11:

Run aws configure and enter the access and secret key configured.

aws configure
AWS Access Key ID [None]: XXXXXXXXXXXXXAWS
Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Default region name [None]:
Default output format [None]

Step 12:

Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.

connector.name=hive-hadoop2
hive.metastore=glue
hive.non-managed-table-writes-enabled=true
hive.metastore.glue.region=us-east-2
hive.metastore.glue.aws-access-key=<your AWS key>
hive.metastore.glue.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use glue as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog glue

Step 15:

Create a schema using S3 location.

presto:default> create schema glue.demo with (location= 's3://Your_Bucket_Name/demo');
CREATE SCHEMA
presto:default> use demo;

Step 16:

Create table under glue.demo schema

presto:demo> create table glue.demo.part with (format='parquet') AS select * from tpch.tiny.part;
CREATE TABLE: 2000 rows
    
Query 20211013_034514_00009_6hkhg, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:06 [2K rows, 0B] [343 rows/s, 0B/s]

Step 17:

Run select statement on the newly created table.

presto:demo> select * from glue.demo.part limit 10; 
partkey |                   name                   |      mfgr      |  brand
---------+------------------------------------------+----------------+---------
       1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13
       2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13
       3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42
       4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34
       5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32
       6 | bisque cornflower lawn forest magenta    | Manufacturer#2 | Brand#24
       7 | moccasin green thistle khaki floral      | Manufacturer#1 | Brand#11
       8 | misty lace thistle snow royal            | Manufacturer#4 | Brand#44
       9 | thistle dim navajo dark gainsboro        | Manufacturer#4 | Brand#43
      10 | linen pink saddle puff powder            | Manufacturer#5 | Brand#54

Summary

In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

zero to presto aws 1200x627 1

0 to Presto in 30 minutes with AWS & Ahana Cloud

On-Demand Webinar

Data lakes are widely used and have become extremely affordable, especially with the advent of technologies like AWS S3. During this webinar, Gary Stafford, Solutions Architect at AWS, and Dipti Borkar, Cofounder & CPO at Ahana, will share how to build an open data lake stack with Presto and AWS S3.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. 

In this webinar, you’ll learn how:

  • What an Open Data Lake Analytics stack is
  • How you can use Presto to underpin that stack in AWS
  • A demo on how to get started building your Open Data Lake Analytics stack in AWS

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar, Ahana
share img build open datalake analytics stack od 1

Webinar On-Demand
How to Build an Open Data Lake Analytics Stack

While data lakes are widely used and extremely affordable, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake?

The answer is the Open Data Lake Analytics stack. In this webinar, we’ll discuss how to build this stack using 4 key components: open source technologies, open formats, open interfaces & open cloud. Additionally, you’ll learn why open source Presto has become the de facto query engine for the data lake, enabling ad hoc data discovery using SQL.

You’ll learn:

• What an Open Data Lake Analytics Stack is

• How Presto, the de facto query engine for the data lake, underpins that stack

• How to get started building your open data lake analytics stack today

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar

Presto 104: Running Presto with Hive Metastore on your Laptop

Introduction

This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.

Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular. 

In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.   

What is the Hive Metastore?

The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.

Presto 104

Implementation steps

Step 1

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                               e8f8aa933633: Pull complete                                                               b7cf38297b9f: Pull complete                                                               a4205d42b3be: Pull complete                                                               81b659bbad2f: Pull complete                                                               3ef606708339: Pull complete                                                               979857535547: Pull complete                                                              
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandbox
db74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3  ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install and configure hive on the coordinator container.

Install wget procps and tar 

# yum install -y wget procps tar less

Step 9:

Download and install hive and hadoop packages, set HOME and PATH for JAVA,HIVE and HADOOP 

#HIVE_BIN=https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
#HADOOP_BIN=https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz


#wget --quiet ${HIVE_BIN}
#wget --quiet ${HADOOP_BIN}


#tar -xf apache-hive-3.1.2-bin.tar.gz -C /opt
#tar -xf hadoop-3.3.1.tar.gz -C /opt
#mv /opt/apache-hive-3.1.2-bin /opt/hive
#mv /opt/hadoop-3.3.1 /opt/hadoop


#export JAVA_HOME=/usr
#export HIVE_HOME=/opt/hive
#export HADOOP_HOME=/opt/hadoop
#export PATH=$PATH:${HADOOP_HOME}:${HADOOP_HOME}/bin:$HIVE_HOME:/bin:.
#cd /opt/hive

Step 10:

Download additional jars needed to run with S3

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.10.6/aws-java-sdk-core-1.10.6.jar

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar

#wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.4/hadoop-aws-2.8.4.jar

#cp aws-java-sdk-core-1.10.6.jar /opt/hadoop/share/hadoop/tools/lib/
#cp aws-java-sdk-s3-1.10.6.jar  /opt/hadoop/share/hadoop/tools/lib/
#cp hadoop-aws-2.8.4.jar  /opt/hadoop/share/hadoop/tools/lib/

echo "export
HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.ja

r:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3
1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar" >>/opt/hive/conf/hive-env.sh

Step 11:

Configure and start hive

cp /opt/hive/conf/hive-default.xml.template /opt/hive/conf/hive-site.xml
mkdir -p /opt/hive/hcatalog/var/log
bin/schematool -dbType derby -initSchema
bin/hcatalog/sbin/hcat_server.sh start

Step 12:

Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.

If you choose to validate using AWS S3 bucket provide security credentials for the same.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Key>
hive.s3.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use hive as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog hive

Step 15:

Create schema using local or S3 location.

presto:default> create schema tpch with (location='file:///root');
CREATE SCHEMA
presto:default> use tpch;

If you have access to S3 bucket then use the following create command using s3 as destination 

presto:default> create schema tpch with (location='s3a://bucket_name');
CREATE SCHEMA
presto:default> use tpch;

Step 16:

Hive has option to create two types of table, they are

  • Managed tables 
  • External tables

Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.

External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.

The transactional semantics(ACID) is only supported on managed tables.

We will create managed table under hive.tpch schema

Create table under hive.tpch schema

presto:tpch> create table hive.tpch.lineitem with (format='PARQUET') AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows
Query 20210921_051649_00015_uvkq7, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1:48 [6M rows, 0B] [55.4K rows/s, 0B/s]

Step 17:

Do a desc table to see the table.

presto> desc hive.tpch.lineitem     
-> ;    
Column     |    Type     | Extra | Comment
---------------+-------------+-------+--------- 
orderkey      | bigint      |       | 
partkey       | bigint      |       | 
suppkey       | bigint      |       | 
linenumber    | integer     |       | 
quantity      | double      |       | 
extendedprice | double      |       | 
discount      | double      |       | 
tax           | double      |       | 
returnflag    | varchar(1)  |       | 
linestatus    | varchar(1)  |       | 
shipdate      | date        |       | 
commitdate    | date        |       | 
receiptdate   | date        |       | 
shipinstruct  | varchar(25) |       | 
shipmode      | varchar(10) |       | 
comment       | varchar(44) |       |
(16 rows)
Query 20210922_224518_00002_mfm8x, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:08 [16 rows, 1.04KB] [1 rows/s, 129B/s]

Summary

In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

share img unlocking value data lake od 1

Webinar On-Demand
Unlocking the Value of Your Data Lake

Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.

During this webinar, Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.

Dipti will cover:

  • Open Data Lake analytics – what it is and what use cases it supports
  • Why companies are moving to an open data lake analytics approach
  • Why the open source data lake query engine Presto is critical to this approach

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar

Connect Superset to Presto

Presto with Superset

This blog post will provide you with an understanding of how to connect Superset to Presto.

TL;DR

Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.

Superset and SQLAlchemy

Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.

Apache superset

SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.

The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.

In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.

superset@localhost:superset> \\l

+-----------+----------+----------+------------+------------+-----------------------+
| Name      | Owner    | Encoding | Collate    | Ctype      | Access privileges     |
|-----------+----------+----------+------------+------------+-----------------------|
| postgres  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| superset  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| template0 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
| template1 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
+-----------+----------+----------+------------+------------+-----------------------+

We can look into the superset database and see the tables in that database.

The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.

superset@localhost:superset> \c superset

You are now connected to database "superset" as user "superset"

+--------+----------------------------+-------+----------+
| Schema | Name                       | Type  | Owner    |
|--------+----------------------------+-------+----------|
| public | Clean                      | table | superset |
| public | FCC 2018 Survey            | table | superset |
| public | ab_permission              | table | superset |
| public | ab_permission_view         | table | superset |
| public | ab_permission_view_role    | table | superset |
| public | ab_register_user           | table | superset |
| public | ab_role                    | table | superset |
| public | ab_user                    | table | superset |
| public | ab_user_role               | table | superset |
| public | ab_view_menu               | table | superset |
| public | access_request             | table | superset |
| public | alembic_version            | table | superset |
| public | alert_logs                 | table | superset |
| public | alert_owner                | table | superset |
| public | alerts                     | table | superset |
| public | annotation                 | table | superset |
| public | annotation_layer           | table | superset |
| public | bart_lines                 | table | superset |
| public | birth_france_by_region     | table | superset |
| public | birth_names                | table | superset |
| public | cache_keys                 | table | superset |
| public | channel_members            | table | superset |
| public | channels                   | table | superset |
| public | cleaned_sales_data         | table | superset |
| public | clusters                   | table | superset |
| public | columns                    | table | superset |
| public | covid_vaccines             | table | superset |
:

With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):

  1. postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
  2. postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
  3. postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.

Superset lists its recommended Python packages for database drivers in the public documentation.

Presto Catalogs

Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.

In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.

As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.

connector.name=postgresql
connection-url=jdbc:postgresql://pghost:5432/mydatabase
connection-user=sonny
connection-password=foobar

Prestodb

Superset to Presto

Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>

what is apache superset

The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.

New Superset Database Connection UI

In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.

Try It Out!

In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto Tutorial 103: PrestoDB cluster on GCP

Introduction

This tutorial is Part III of our Getting started with PrestoDB series. As a reminder, Prestodb is an open source distributed SQL query engine. In tutorial 102 we covered how to run a three node prestodb cluster on a laptop. In this tutorial, we’ll show you how to run a prestodb cluster in a GCP environment using VM instances and GKE containers.

Environment

This guide was developed on GCP VM instances and GKE containers.

Presto on GCP with VMs

Implementation steps for prestodb on vm instances

Step1: Create a GCP VM instance using the CREATE INSTANCE tab, name it as presto-coordinator. Next, create three more VM instances as presto-worker1, presto-worker2 and presto-worker3 respectively.

SoyhGYtcBwVT9eTIEixXIQXZSSVWI0yUc SKXD71F 83rsakVmxsXC9wgPiqqPG 3K6c5M7RLbtQRweuz2oZUQf6gD2sbSTeFOWa3LnDzyEtM iBgQkSmqbFwiV

Step 2: By default GCP blocks all network ports, so prestodb will need ports 8080-8083 enabled. Use the firewalls rule tab and enable them.

DvTlwt8tcej6YIcd1V5CUQQvRuflGEpOgdKhCZcbp5bEf5F1TPvgrVZEiIoQNKFp NpYYPvU53mA1Xkl4yKT9Fzl7s7VGgGc9yZBDvbmKScqLplknnbkve3kNTr3eVid0CyHXXpB=s0

Step 3: 

Install JAVA and python.

Step 4:

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  721M  100  721M    0     0   245M      0  0:00:02  0:00:02 --:--:--  245M
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 12.7M  100 12.7M    0     0  15.2M      0 --:--:-- --:--:-- --:--:-- 15.1M
user@presto-coordinator-1:~$

Step 5:

Use gunzip and tar to unzip and untar the presto-server

user@presto-coordinator-1:~$gunzip presto-server-0.235.1.tar.gz ;tar -xf presto-server-0.235.1.tar

Step 6: (optional)

Rename the directory without version number

user@presto-coordinator-1:~$ mv presto-server-0.235.1 presto-server

Step 7:  

Create etc, etc/catalog and data directories

user@presto-coordinator-1:~/presto-server$ mkdir etc etc/catalog data

Step 8:

Define etc/node.config, etc/config.properties, etc/jvm.config and etc/catalog/jmx.properties files as below for presto co-ordinator server.  

user@presto-coordinator-1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/user/presto-server/data

user@presto-coordinator-1:~/presto-server$ cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

user@presto-coordinator-1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-coordinator-1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-coordinator-1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step: 9 

Check the cluster UI status. It should  show the Active worker count at 0 since we enabled only the coordinator.

OgmEUeQvZPNw xsOcMibEdAihAKM9MXxuy2l2Sf7xWukxPWL 00RPcSdU6HbbqWcaHlgEMYr BVFTvrg1F9OacLiavldA

Step 10: 

Repeat steps 1 to 8 on the remaining 3 vm instances which will act as worker nodes.

On the configuration step for worker nodes, set coordinator to false and http-server.http.port to 8081, 8082 and 8083 for worker1, worker2 and worker3 respectively.

Also make sure node.id and http-server.http.port are different for each worker node.

user@presto-worker1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffffd
node.data-dir=/home/user/presto-server/data
user@presto-worker1:~/presto-server$ cat etc/config.properties
coordinator=false
http-server.http.port=8083
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://presto-coordinator-1:8080

user@presto-worker1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-worker1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-worker1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step 11: 

Check cluster status, it should reflect the three worker nodes as part of the prestodb cluster.

oNL q

Step 12:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
-----------------------------------------------------------------------------------------------------------------------------------
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator


Implementation steps for Prestodb on GKE containers

Step 1:

Go to the Google cloud Console and activate the cloud console window

6Wx6z30Cfef1ydfz bMNhzpsXXudCn6Wx0NHRZAMKnYYWtqzc4ZRFJze55idbO I8yUozbyrRXJucmIP5RFtaL d1B7k4qSGJlwlXgG9m1XyLvZX3bf2G8x5jJtHwOUB GwhVz1U=s0

Step 2:

Create an artifacts repository using the below command and replace REGION with the valid region you would prefer to create the repository.

gcloud artifacts repositories create ahana-prestodb \
   --repository-format=docker \
   --location=REGION \
   --description="Docker repository

Step 3:

Create the container cluster by using the gcloud command: 

user@cloudshell:~ (weighty-list-324021)$ gcloud config set compute/zone us-central1-c
Updated property [compute/zone].

user@cloudshell:~ (weighty-list-324021)$ gcloud container clusters create prestodb-cluster01

Creating cluster prestodb-cluster01 in us-central1-c…done.
Created 
.
.
.

kubeconfig entry generated for prestodb-cluster01.
NAME                LOCATION       MASTER_VERSION   MASTER_IP     MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
prestodb-cluster01  us-central1-c  1.20.8-gke.2100  34.72.76.205  e2-medium     1.20.8-gke.2100  3          RUNNING
user@cloudshell:~ (weighty-list-324021)$

Step 4:

After container cluster creation, run the following command to see the cluster’s three nodes

user@cloudshell:~ (weighty-list-324021)$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
gke-prestodb-cluster01-default-pool-34d21367-25cw   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-7w90   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-mwrn   Ready    <none>   7m53s   v1.20.8-gke.2100
user@cloudshell:~ (weighty-list-324021)$

Step 5:

Pull the prestodb docker image 

user@cloudshell:~ (weighty-list-324021)$ docker pull ahanaio/prestodb-sandbox

Step 6:

Deploy ahanaio/prestodb-sandbox locally on the shell and create an image named as coordinator which will later be deployed on the container clusters.

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name coordinator ahanaio/prestodb-sandbox
391aa2201e4602105f319a2be7d34f98ed4a562467e83231913897a14c873fd0

Step 7:

Edit the etc/config.parameters file inside the container and set the node-scheduler.include-coordinator property to false. Now restart the coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker exec -i -t coordinator bash                                                                                                                       
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker restart coordinator
coordinator

Step 8:

Now do a docker commit, create a tag called coordinator based on imageid, this will create a new local image called coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker commit coordinator
Sha256:46ab5129fe8a430f7c6f42e43db5e56ccdf775b48df9228440ba2a0b9a68174c

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED          SIZE
<none>                     <none>    46ab5129fe8a   15 seconds ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago     1.81GB

user @cloudshell:~ (weighty-list-324021)$ docker tag 46ab5129fe8a coordinator

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED              SIZE
coordinator                latest    46ab5129fe8a   About a minute ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago         1.81GB

Step 9:

Create tag with artifacts path and copy it over to artifacts location

user@cloudshell:~ docker tag coordinator:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

user@cloudshell:~ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

Step 10:

Deploy the coordinator into the cloud container using the below kubectl commands.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment coordinator –image=coordinator
deployment.apps/coordinator created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment coordinator –name=presto-coordinator –type=LoadBalancer –port 8080 –target-port 8080
service/presto-coordinator exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)          AGE
kubernetes           ClusterIP      10.7.240.1    <none>          443/TCP          41m
presto-coordinator   LoadBalancer   10.7.248.10   35.239.88.127   8080:30096/TCP   92s

Step 11:

Copy the external IP on a browser and check the status

6RaSDuRnc51zeYN5VkrBLCxk1TXaXARgEpqrxIglI3yY6bPnWJ44 b61RKyDz42i9WHPy35Y2iAMbPAnCnVCAUt5QGj CJLgFQGQWeCFYFy38SA ctgiG u2qD rlb IZils5rE=s0

Step 12:

Now to deploy worker1 into the GKE container, again start a local instance named worker1 using the docker run command.

user@cloudshell:~ docker run -d -p 8080:8080 -it –name worker1 coordinator
1d30cf4094eba477ab40d84ae64729e14de992ac1fa1e5a66e35ae553964b44b
user@cloudshell:~

Step 13:

Edit worker1 config.properties inside the worker1 container to set coordinator to false and http-server.http.port to 8081. Also the discovery.uri should point to the coordinator container running inside the GKE container.

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker1  bash                                                                                                                             
bash-4.2# vi etc/config.properties
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://presto-coordinator01:8080

Step 14:

Stop the local worker1 container, commit the worker1 as image and tag it as worker1 image

user@cloudshell:~ (weighty-list-324021)$ docker stop worker1
worker1
user@cloudshell:~ (weighty-list-324021)$ docker commit worker1
sha256:cf62091eb03702af9bc05860dc2c58644fce49ceb6a929eb6c558cfe3e7d9abf
ram@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
<none>                                                                <none>    cf62091eb037   6 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag cf62091eb037 worker1:latest
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
worker1                                                               latest    cf62091eb037   2 minutes ago   1.81GB

Step 15:

Push the worker1 image into google artifacts location

user@cloudshell:~ (weighty-list-324021)$ docker tag worker1:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1]
b12c3306c4a9: Pushed
.
.
coordinator=false
v1: digest: sha256:fe7db4aa7c9ee04634e079667828577ec4d2681d5ac0febef3ab60984eaff3e0 size: 2201

Step 16:

Deploy and expose the worker1 from the artifacts location into the google cloud container using this kubectl command.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker01  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1                               
deployment.apps/presto-worker01 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker01 –name=presto-worker01 –type=LoadBalancer –port 8081 –target-port 8081                                       
service/presto-worker01 exposed

Step 17:

Check presto UI for successful deployment of worker1

SY

Step 18:

Repeat steps 12 to steps 17 to deploy worker2 inside GKE container:

  • deploy ahana local instance using docker and name it as worker2, 
  • then edit the etc/config.properties inside the worker2 container to set coordinator to false, port to 8082 and discover.uri to the coordinator container name.
  • shut the instance then commit that instance and create docker image as worker2 
  • push that worker2 image to google artifacts location 
  • use kubectl commands to deploy and expose the worker2 instance inside a google container. Now check the prestodb UI for the second worker being active.
  • Check prestodb UI for successful deployment of worker2
user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker2 worker1                                                                                                     
32ace8d22688901c9fa7b406fe94dc409eaf3abfd97229ab3df69ffaac00185d
user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker2 bash
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8082
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
sha256:08c0322959537c74f91a6ccbdf78d0876f66df21872ff7b82217693dc3d4ca1e
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    08c032295953   11 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 08c032295953 worker2:latest

user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
Sha256:b1272b5e824fdebcfd7d434fab7580bb8660cbe29aec8912c24d3e900fa5da11

user@cloudshell:~ (weighty-list-324021)$ docker tag worker2:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2]
aae10636ecc3: Pushed
.
.
v1: digest: sha256:103c3fb05004d2ae46e9f6feee87644cb681a23e7cb1cbcf067616fb1c50cf9e size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker02  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
deployment.apps/presto-worker02 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker02 –type=LoadBalancer –port 8082 –target-port 8082
service/presto-worker02 exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)          AGE
kubernetes             ClusterIP      10.7.240.1     <none>           443/TCP          3h35m
presto-coordinator01   LoadBalancer   10.7.241.37    130.211.208.47   8080:32413/TCP   49m
presto-worker01        LoadBalancer   10.7.255.27    34.132.29.202    8081:31224/TCP   9m15s
presto-worker02        LoadBalancer   10.7.254.137   35.239.88.127    8082:31020/TCP   39s
GHDW nMyK D3FTxn5t2KH1XQ1tB82PltZzuVF1UiHJ9qItuSMyWH8J1y014ax2hbwX1TCVjR1m9ZQzNmoop

Steps 19:

Repeat steps 12 to steps 18 to provision worker3 inside the google cloud container

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker3 worker1
6d78e9db0c72f2a112049a677d426b7fa8640e8c1d3aa408a17321bb9353c545

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker3 bash                                                                                                                              
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8083
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
Exit

user@cloudshell:~ (weighty-list-324021)$ docker commit worker3
sha256:689f39b35b03426efde0d53c16909083a2649c7722db3dabb57ff0c854334c06
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    689f39b35b03   25 seconds ago   1.81GB
ahanaio/prestodb-sandbox                                                latest    76919cf0f33a   37 hours ago     1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 689f39b35b03 worker3:latest

user@cloudshell:~ (weighty-list-324021)$ docker tag worker3:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3]
b887f13ace4e: Pushed
.
.
v1: digest: sha256:056a379b00b0d43a0a5877ccf49f690d5f945c0512ca51e61222bd537336491b size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker03  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
deployment.apps/presto-worker03 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker03 –type=LoadBalancer –port 8083 –target-port 8083
service/presto-worker03 exposed



kzkuajCnulsvWgLslSFca1wUnVlym6u85wSOZqUcTbFKvXQFPspUHL V1oWjZOwKOtN z2V49tSJ8ks6w1Wo

Step 20:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
———————————————————————————————————————————–
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator

Summary

In this tutorial you learned how to  provision and run prestodb inside Google VM instances and on GKE containers. Now you should be able to validate the functional aspects of prestodb. 

If you want to run production Presto workloads at scale and performance, check out https://www.ahana.io which provides a managed service for Presto.

Presto functions: mathematical, operators and aggregate

Data Lakehouse

Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.

The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g. sin, cos, tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.

The mathematical functions subcategory further fall into another layer of classification:

  1. Functions that perform coarser approximation, such as rounding and truncation: abs, ceiling (ceil), floor, round, sign, truncate
  2. Conversions: degrees, radians, from_base, to_base
  3. Exponents, logarithms, roots: exp, ln, log2, log10, power (pow), cbrt, sqrt
  4. Convenient constants, such as pi(), e(), random (rand)
  5. Cumulative distribution functions (and inverses):binomial_cdf, inverse_binomial_cdf, cauchy_cdf, inverse_cauchy_cdf, chi_squared_cdf, inverse_chi_squared_cdf, normal_cdf, inverse_normal_cdf, poisson_cdf, inverse_poisson_cdf, weibull_cdf, inverse_weibull_cdf, beta_cdf, inverse_beta_cdf, width_bucket
  6. Miscellaneous: mod, cosine_similarity

The mathematical operators are basic arithmetic operators, such as addition (+), subtraction (-), multiplication (*), and modulus (%).

Let’s apply these mathematical functions in an example. In the following query, have a floating-point column x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.

select
	x,
	radians(x) as radians_x,							/* convert to radians */
	ln(x) as ln_x,												/* natural log */
	normal_cdf(0,30,x) as_normal_cdf_x,		/* Normal CDF */
	mod(x,2) as mod_x_2,									/* Modulo 2 */
	random() as r,												/* Random number */
	3*((x/2)+2) as formula								/* Formula using operators */
from
	example;

The following is the output the above query with some rounding for ease of viewing.

Output of query with rounding

So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.

Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.

The bitwise aggregate functions are two functions that return the bitwise AND and bitwise OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as stddev, variance, kurtosis, and skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.

The general functions subcategory further fall into another layer of classification:

  1. Common summarizations: count, count_if, min, max, min_by, max_by, sum, avg, geometric_mean, checksum
  2. Boolean tests: bool_or, bool_and, every
  3. Data structure consolidation: array_agg, set_agg, set_union
  4. Miscellaneous: reduce_agg, arbitrary

Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column x .

select
	sum(x) as sum_x,
	count(x) as count_x,
	min(x) as min_x,
	max(x) as max_x,
	avg(x) as avg_x,
	checksum(x) as ckh_x
from
	example;

The following is the output the above query.

SQL query output

In the following query, we showcase a boolean test with the bool_or function. We know that the natural log will return a NaN for negative values of x. So, if we apply the is_nan check, we expect x to always be false, but for our ln result to occasionally be true. Finally, if we were to do the bool_or aggregation on our is_nan functions, we expect the column derived from x to be false (i.e. no true at all) and the column derived fro ln(x) to be true (i.e. at least one true value). The following query and accompanying result illustrate this.

with nan_test as (
	select
		is_nan(x) as is_nan_x,
		is_nan(ln(x)) as is_nan_ln_x
	from
		example
)
select
	bool_or(is_nan_x) as any_nan_x_true,
	bool_or(is_nan_ln_x) as any_nan_ln_x_true
from
	nan_test;

Presto SQL query output

This final example illustrates the use of an example of data consolidation, taking a x and radians(x) columns and creating a single row with a map data structure.

with rad as(select x, radians(x) as rad_x from example)
select map_agg(x, rad_x) from rad;

query output

The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (approx_distinct), percentiles (approx_percentile), and histograms (numeric_histogram). In fact, we have a short answer post on how to use the approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.

A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an OVER clause. One popular application of window functions is time-series analysis. In particular, the lag function window function is quite useful. We have a short answer post on how to use the lag window function and to compute differences in dates using the lag window function.

This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Ahana Cofounder Will Co-lead Session At OSPOCon 2021 About Presto SQL Query Engine

San Mateo, Calif. – September 21, 2021 — Ahana, the SaaS for Presto company, today announced that its Cofounder and Chief Product Officer Dipti Borkar will co-lead a session with Facebook Software Engineer Tim Meehan at OSPOCon 2021 about Presto, the Facebook-born open source high performance, distributed SQL query engine. The event is being held September 27-30 in Seattle, WA and Virtual.

Session Title: “Presto – Today and Beyond – The Open Source SQL Engine for Querying All Data Lakes.”

Session Time: Wednesday, September 29 at 3:55pm – 4:45pm PT

Session Presenters: Ahana Cofounder and Chief Product Officer and Presto Foundation Chairperson, Outreach Team, Dipti Borkar; and Facebook Software Engineer and Presto Foundation Chairperson, Technical Steering Committee, Tim Meehan. 

Session Details: Born at Facebook, Presto is an open source high performance, distributed SQL query engine. With the disaggregation of storage and compute, Presto was created to simplify querying of all data lakes – cloud data lakes like S3 and on premise data lakes like HDFS. Presto’s high performance and flexibility has made it a very popular choice for interactive query workloads on large Hadoop-based clusters as well as AWS S3, Google Cloud Storage and Azure blob store. Today it has grown to support many users and use cases including ad hoc query, data lake house analytics, and federated querying. This session will give an overview on Presto including architecture and how it works, the problems it solves, and most common use cases. Dipti and Tim will also share the latest innovation in the project as well as the future roadmap.

To register for Open Source Summit + Embedded Linux Conference + OSPOCon 2021, please go to the event registration page to purchase a registration.

TWEET THIS: @Ahana to present at #OSPOCon 2021 about Presto #Presto #OpenSource #Analytics #Cloud https://bit.ly/3AnfAMl

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the workload profile feature in Ahana Cloud

Ahana Cloud for Presto is the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams. With the managed Presto service, we provide a lot of tuned configurations out of the box for Ahana customers.

We’re excited to announce that the workload profile feature is now available on Ahana Cloud. With this release, users can create a cluster with a validated set of configurations that suits the type of workloads or queries users plan to run.

Today, the Presto clusters are configured with default properties that work well for generic workloads. However, achieving workload-specific tuning and resource allocation requires a good understanding of presto’s resource consumption for that workload. Further, to change or add any property, we first update the configuration file and then restart the cluster. This makes data platform users spend a couple of days and in many cases weeks iterating, evaluating, and experimenting with the config tuning to reach the ideal configuration, specific to the workloads. To solve this pain point and to deliver predictable performance at scale, Ahana Cloud allows user to select tuned set of properties for desired workloads in a one click away.

Here is the short demo of creating Presto Cluster with a workload profile:

Concurrent queries are simply the number of queries executing at the same time in a given cluster. To simplify this experience we have classified workloads based on the number of concurrent queries and curated a set of tuned session properties for each profile.

Low concurrency is useful for clusters that run a limited number of queries or a few large, complex queries. It also supports bigger and heavier ETL jobs.

High concurrency is better for running multiple queries at the same time. For example, dashboard and reporting queries or A/B testing analytics, etc.

This setting can be changed once the cluster has been created and cluster restart is not required. However, the change will only apply to new queries. Following is the short demo on how you can change these profiles for running clusters.

This feature is the beginning of auto-tune capabilities for workload management. We are continuously innovating Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto. Please give it a try and log in to the Ahana Cloud console to get started. We have a free trial as well that you can sign up for today.

share img presto on aws sep2021

Webinar On-Demand
Presto on AWS: Exploring different Presto services

Presto is a widely adopted distributed SQL engine for data lake analytics. Running Presto in the cloud comes with many benefits – performance, price, and scale are just a few. To run Presto on AWS, there are a few services you can use to do that: EMR Presto, Amazon Athena, and Ahana Cloud.

In this webinar, Asif will discuss these 3 approaches, the pros and cons of each, and how to determine which service is best for your use case. He’ll cover:

  • Quick overview of EMR Presto, Athena, and Ahana
  • Benefits and limitations of each
  • How to pick the best approach based on your needs

If you’re using or evaluating Presto today, register to learn more about running Presto in the cloud.

Speaker

Asif Kazi
Principal Solutions Engineer, Ahana

asif

Ahana Joins AWS ISV Accelerate Program to Expand Access to Its Presto Managed Service for Fast SQL on Amazon S3 Data Lakes

aws ahana color

Ahana also selected into the invite-only AWS Global Startup Program

San Mateo, Calif. – September 14, 2021 — Ahana, the Presto company, today announced it has been accepted into the AWS ISV Accelerate Program. Ahana Cloud for Presto was launched in AWS Marketplace in December. As a member of the AWS ISV Accelerate Program, Ahana will be able to drive new business and accelerate sales cycles by co-selling with AWS Account Managers who are the trusted advisors in most cases.

Ahana has also been selected into the AWS Global Startup Program, an invite-only, go-to-market program built to support startups that have raised institutional funding, achieved product-market fit, and are ready to scale.

“Traditional warehouses were not designed to hold all of a company’s data. Couple that with rising compute and storage costs associated with the warehouse, and many companies are looking for an alternative,” said Steven Mih, Cofounder and CEO, Ahana, “Open source Presto is that alternative, making it easy to run SQL queries on Amazon S3 data lakes at a much lower price-point. Using Presto as a managed service for open data lake analytics makes it easy to use SQL on Amazon S3, freeing up data platform teams for mission critical, value-add work. Ahana’s acceptance into the AWS ISV Accelerate Program and AWS Global Startup Program will allow us to be better aligned with AWS Account Managers who work closely with AWS customers, to drive adoption of Ahana Cloud for Presto and help more organizations accelerate their time to insights.”

Securonix uses Ahana Cloud for Presto for SQL analytics on their Amazon S3 data lake. They are pulling in billions of events per day, and that data needs to be searched for threats. With Ahana Cloud on AWS, Securonix customers can identify threats in real-time at a reasonable price.

“Before Presto we were using a Hadoop cluster, and the challenge was on scale…not only was it expensive but the scaling factors were not linear,” said Derrick Harcey, Chief Architect at Securonix. “The Presto engine was designed for scale, and it’s feature-built just for a query engine. Ahana Cloud on AWS made it easy for us to use Presto in the cloud.”

Ahana’s acceptance into the AWS ISV Accelerate Program enables the company to meet customer needs through collaboration with the AWS Sales organization. Collaboration with the AWS Sales team enables Ahana to provide better outcomes to customers.

Supporting Resources

Learn more about the AWS ISV Accelerate Program

TWEET THIS: @Ahana joins AWS ISV Accelerate Program, AWS Global Startup Program  https://bit.ly/3A9JdR0 #Presto #OpenSource #Analytics #Cloud

About Ahana

Ahana, the Presto company, offers managed service for Presto with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter, and thousands more, is a standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

share img ahana 101 sep2021

Ahana 101: An introduction to Ahana Cloud for Presto on AWS, SaaS for Presto on AWS

Webinar On-Demand

Presto is the fastest growing query engine used by companies like Facebook, Uber, Twitter and many more. While powerful, Presto can be complicated to run on your own especially if you’re a smaller team that may not have the skillset.

That’s where Ahana comes in. Ahana Cloud is SaaS for Presto, giving teams of all sizes the power to deploy and manage Presto on AWS. Ahana takes care of hundreds of deployment and management configurations of Presto including attaching/detaching external data sources, configuration parameters, tuning, and much more.

In this webinar Ram will discuss why companies are using Ahana Cloud for their Presto deployments and give an overview of Ahana including:

  • The Ahana SaaS console
  • How easy it is to add data sources like AWS S3 and integrate catalogs like Hive
  • Features like Data Lake Caching for 5x performance and autoscaling

Speaker

Ram Upendran
Technical Product Marketing Manager, Ahana

ram headshot
share img presto 101 sep2021 od

Presto 101: An introduction to open source Presto

Webinar On-Demand

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Dipti will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Speaker

Dipti Borkar
Cofounder and Chief Product Officer, Ahana

Dipti Borkar 1

What is a Presto lag example?

Data Lakehouse

The Presto lag function a window function that returns the value of an offset before the current row in a window. One common use case for the lag function is with time series analysis, such as autocorrelation.

Figure 1 shows the advert table of sales and advertising expenditure from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. The advert column is the monthly advertising expenditure, and the sales column is the monthly sales volume.

/

A simple analysis could be to track the difference between the current month’s sales volume and the previous one, which is shown in Figure 2. The lag_1_sales column is a single period lagged value of the sales column, and the diff column is the difference between sales and lag_1_sales. To generate the table in Figure 2, we can use the lag function and the following query:

select
  advert,
  sales,
  lag_1_sales,
  round(sales - lag_1_sales,2) as diff
from (
  select
    advert,
    sales,
    lag(sales, 1) over(range unbounded preceding) as lag_1_sales
  from advert
);

The subquery uses the lag function to get a one period offset preceding value of the sales column, where the OVER clause syntax is specifying the window. The main query then computes the diff column. Here are a couple of additional useful notes about the lag function:

  1. You can change the offset with the second argument lag(x, OFFSET), where OFFSET is any scalar expression. The current row is OFFSET=1.
  2. By default, if an offset value is null or outside the specified window, a NULL value is used. We can see this in the first row of the table in Figure 2. However, the default value to use in these cases is configurable with an optional third argument lag(x, OFFSET, DEFAULT_VALUE), where DEFAULT_VALUE is desired value.

A closely related function is the lead function returns the value at an offset after the current row.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

share img sql on the data lake sep2021 od

SQL on the Data Lake, Using open source Presto to unlock the value of your data lake

Webinar On-Demand

While data lakes are widely used and have become extremely affordable as data volumes have grown, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine, or more commonly the SQL engine, that runs on top of a data lake.

In this webinar, Dipti will discuss why open source Presto has quickly become the de-facto query engine for the data lake. Presto enables ad hoc data discovery where you can use SQL to run queries whenever you want, wherever your data resides. With Presto, you can unlock the value of your data lake.

She will cover:

  • An overview of Presto and why it emerged as the best engine for the data lake
  • How to use Presto to run ad hoc queries on your data lake
  • How you can get started with Presto on AWS S3 today

Speaker

Dipti Borkar
Cofounder and Chief Product Officer, Ahana

Dipti Borkar 1

How do I get the date_diff from previous rows?

To find the difference in time between consecutive dates in a result set, Presto offers window functions. Take the example table below which contains sample data of users who watched movies.

Example:

select * from movies.ratings_csv limit 10;

Screen Shot 2021 08 24 at 4.11.28 PM

select userid, date_diff('day', timestamp, lag(timestamp) over (partition by userid order by  timestamp desc)) as timediff from ratings_csv order by userid desc limit 10;

Screen Shot 2021 08 24 at 4.14.36 PM

The lag(x, y, start, end) function fetches the value of column x at row offset y and calculates the difference. When no offset is provided, the default value is 1 (previous row). Notice, that the first row in timediff is NULL due to not having a previous row.

share img data warehouse or data lake od

Webinar On-Demand
Data Warehouse or Data Lake, which one do I use?

Slides

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). 

There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.

In this webinar, you’ll hear from industry analyst John Santaferraro and Ahana cofounder and CPO Dipti Borkar who will discuss the data landscape and how many companies are thinking about their data warehouse/data lake strategy. They’ll share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lake.


Webinar Transcript

SPEAKERS

John Santaferraro | Industry Analyst, Dipti Borkar | CPO & Co-Founder Ahana, Ali LeClerc | Moderator, Ahana

Ali LeClerc | Ahana 

Hi, everybody, welcome to today’s webinar, Data Warehouse or Data Lake, or which one do I use? My name is Ali, and I will be moderating the event today. Before we get started, and I introduce our wonderful speakers, just a few housekeeping items, one, this session is being recorded. If you miss any parts of it, if you join late, you’ll get a link to both the recording and the slides that we are going through today. Second, we have allotted some time for questions at the end. So please feel free to pop in your questions, there is a questions tab in your GoToWebinar panel. You can go ahead, ask away during the session itself, and we’ll get to them at the end.

So, without further ado, I am pleased to introduce our two speakers, we have John Santaferraro and Dipti Borkar. John is an industry analyst who has been doing this for over 26-years has a ton of experience in the space. So, looking forward to hearing his perspective. And then we’re also joined by Dipti. Dipti Borkar is the co-founder and CPO of Ahana, has a ton of experience in relational and non-relational database engines as well as the analytics market.

Today they’re going to be talking about data warehouse or data lake. So, with that, John, I will throw things over to you, please take it away.

John Santaferraro | Industry Analyst 

Awesome. Really excited to be here. This topic is top of mind, I think for everyone, we’re going to take a little look at history, you know, where did data lakes and data warehouses start? How have they been modernized? What’s going on in this world where they seem to be merging together? And then give you some guidance on what are use cases for these modern platforms? How do you know, how do you choose a data lake or a data warehouse? Which one do you choose? So, we’re going to provide some criteria for that, we’re going to look at the Uber technical case study and answer any questions that you guys have.

So, I’m going to jump right in I actually got into the data warehousing world all the way back in 1995. I co-founded a data warehouse startup company and eventually that sold to Teradata. Right now, I’m thinking back to those times. And really, the whole decade after that the traditional data warehouse was it was a relational database. Typically, with a columnar structure, although some of the original data warehouses didn’t have that, they had in database analytics for performance focused really only on structured data. The data had to be modeled. And data modeling, for a lot of folks was an endless task. And there was the whole ETL process was 70% of every project, extracting from all your source systems, transforming it, loading it into the data warehouse. There was primarily SQL access, and these data warehouses tended to be a few sources, one or two outputs, but they were expensive, slow, difficult to manage. They provided access to limited data. So, there were a lot of challenges, a lot of benefit as well, but a lot of challenges with the traditional data warehouses. So, the data lakes came along and initially Hadoop, you’ll remember this, was going to replace the data warehouse, right?

I remember reading articles about how the data warehouse is dead. This was the introduction of Hadoop with its file system data storage, suddenly, it was inexpensive to load data into the data lake. So, all data went in there, including semi-structured data, unstructured data, it was all about ingestion of data motored in [inaudible] the structure, once it had been loaded. Don’t throw anything out. Primary use cases were for discovery, text analytics, data science. Although there was some SQL access, initially, notebooks and Python, and other languages became the primary way to access. These data lakes were less expensive, but there was limited performance on certain kinds of complex analytics. Most of the analytics folks focused on unstructured data. There was limited SQL access, and they tended to be difficult to govern. Hadoop initially didn’t have all of the enterprise capabilities.

You know, Dipti your around through a lot of that what are your some of your memories about data lakes when they first showed up on the scene?

Dipti Borkar | Ahana 

Yeah. It’s great to do this with you, John, we’ve been at this for a while. I started my career in traditional databases as well, DB2 distributed, core storage and indexing kernel engineering. And we saw this entire movement of Hadoop. What it helped with is, in some ways, the separation of storage and compute. For the first time, it really separated the two layers where storage was HDFS, and then compute went through many generations. Even just in the Hadoop timeframe was MapReduce, Hive, variations of hive and so on. But what happened is the, you know, I feel like the company is the leaders that were driving Hadoop never really simplified it for the platform teams.

Technology is great to build, but if it’s complicated, and if it takes a long time to get value from, no matter how exciting it is, it doesn’t serve its purpose. And that was the biggest challenge with Hadoop, there were 70 different projects, that took six to nine months to integrate into, and to see real value or insights from the data in HDFS, and many of these projects didn’t go well. And that’s why over time, people struggled with it, we were, we’ll talk a little bit about cloud and how the cloud migration is playing such a big role in the in the modernization of these data lakes. So just some perspectives there.

John Santaferraro | Industry Analyst 

Yeah, you know, you just you reminded me as well, the other positive thing is that I think that we had seen open source as an operating system. And with the introduction of Hadoop, there was a massive uptake of acceptance and adoption around open source technology as well. So that was another real positive during that time. So, what we’ve what we’ve seen since the inception of the data warehouse, and you know, the incursion of Hadoop into the marketplace and the data lake, we’ve seen a very rapid modernization of those platforms driven by three things.

One is digital transformation, everything has moved to digital now, especially, massive uptake of mobile technology, internet technology is way more interactive and engaging than when it used to be informational, and tons of more data and data types. Along with that, there is an increasing need for engagement with customers and real time engagement with employees, engagement with partners, everything is moving closer and closer to either just in time or real time. And so that’s created the need to be able to respond quickly to business events of any kind. And I think third, we’re really on the cusp of seeing everything automated.

Obviously, we in the world of robotics, there are there are massive manufacturing plants, where everything is now automated. And that’s being transferred over to this world of robotic process automation. In order to automate everything that requires incredible intelligence delivered to machines, and sensors, and all kinds of, you know, every kind of device that you can imagine on the internet of things in order to automate everything. And so, these, these trends have really pushed us to the modernization of both the data warehouse and the data lake.

And interestingly enough, you can look at the slide that I popped up, but modernization is happening in all of these different areas, in both the data warehouse and the data lake. The most modern of both are cloud first. There is a there was a move to in-memory capabilities. On the Data Warehouse side, they’re now bringing in more complex data types that were typically only handled on the data lake and the modern data lake is bringing in columnar data types and with great performance. Now both have the separation of compute and storage. So, you can read the rest of them here. The interesting thing about the modernization movement is that that both the data warehouse and the data lake are being modernized.

What trends are you seeing and modernization, Dipti? I kind of tend to approach this at a high level looking at capabilities. I know you see the technology underneath and go deep in that direction. What’s your take on this?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, cloud first is really important. There’s a lot of companies that are increasingly just in the cloud, many are born in the cloud, like Ahana, but also their entire infrastructure is in the cloud. It could be multiple clouds, it could be a single cloud. That’s one of the you know, one of the aspects. The other aspect is within on clouds. containerization. A very, very big trend. A few years ago, Kubernetes wasn’t as stable. And so now today, the way we’ve built Ahana as cloud first, and it runs completely on Kubernetes. And it’s completely containerized. To help with the flexibility of the cloud and the availability, the scalability and leveraging some of those aspects. I think the other aspect is open formats.

Open formats are starting to play a big role. With the data lake, and I call it open data lakes, for a variety of reasons. Open formats is a big part of it. Formats like Apache ORC, Apache Parquet, they can be consumed by many different engines. In one way, you’re not locked into a specific technology, you can actually move from one engine to another, because many of them support it. Spark supports it, Presto supports it, TensorFlow just added some support as well. With a data lake, you can have open formats, which are highly performant, and have multiple types of processing on top of it. So, these are some of the trends that I’m seeing, broadly, on the data lake side.

And, of course, the data warehouses are trying to expand and extend themselves to the data lake as well. But what happens is, when you have a core path, a critical path for any product, it’s built for a specific format, or a specific type of data. We’ve seen that with data warehouses, most of the time it’s proprietary formats. And S3, and these cloud formats, might be an extension. And for data lakes, the data lake engines, are actually built for the open formats, and not for some of these proprietary formats.

These are some of the decisions and considerations that users need to think about in terms of what’s important for them. What kind of information they want to store – overtime, historical data – in their in their data lake or data warehouse? And how open do they want it to be?

John Santaferraro | Industry Analyst 

Yeah, I think you bring up a good point too, in that the modernization of the data lake has really opened up the opportunity for storage options. And specifically, lower cost storage options and storage tiering. So that in that environment, customers can choose where they want to store their data. If they need high performance analytics, then it goes in optimized storage of some kind. If what they need is massive amount of data, then they can store still in file systems. But the object storage, simple storage options are much less costly, and I think we’re I think we’re actually moving towards a world where, at some point in the future, companies will be able to store their data inexpensively, in one place, in one format, and use it endless number of times.

I think that’s the direction that things are going as we look at modernization.

Dipti Borkar | Ahana 

Absolutely. Look at S3, as the cloud, the most widely used, cloud store, it’s 15 years in the making. So, trillions of objects that are in S3. And now that it’s ubiquitous, and it’s so cheap, most of the data is landing there. So that’s the first place the data lands. And users are thinking about okay, once it lands there, what can I do with it? Do I have to move it into another system for analysis? And that might be the case as you said, there will be extremely low latency requirements, in some cases, where it might need to be in a warehouse.

Or it might need to be – of course, you know, operational systems will always be there – here we’re talking about analytics. And what other processing can I run directly on top of S3 and on top of these objects? Without moving the data around. So that I get the cost benefits, which, which AWS has driven through, it’s very cheap to store data now, and so can I have compute on top? Essentially to do structured analysis to do semi-structured analysis, or even unstructured analysis with machine learning, deep learning and so on.

S3 and the cloud migration, I would say, has played a massive role in, in this in the adoption of data lakes, and the move towards the modern data lake that you have here.

John Santaferraro | Industry Analyst 

So, you at Ahana, you guys talk about this move from data to insight and the idea of the SQL query engine. Do you want to walk us through this Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I touched on some of the different types of processing that’s possible on top of data lakes. One of those workloads, is SQL workload. Data warehouses and data, lakes are sitting next to each other, hey sit next to each other. You have, in the data warehouse, you have, obviously you have your storage and your compute. And typically, these are in the 10s of terabytes. That’s the most of the most of the data warehouses tend to be along that that dimension of scale. But as the amount of data has increased, and the types of information has increased, some of them are contributing to a lot more data, IoT data. Device data, third party data, behavioral data. So it used to be just enterprise data, it used to be, orders line items, or when you look at a benchmark like TPC DS is very, very simple. It’s enterprise data. But now we have a lot more data. And that is that is leading to storage, and all of this information, going into the data lake. And the terabytes are now becoming petabytes, even for small companies. So that’s where the cost factor becomes very, very important.

Lower costs are what users are looking for, and infrastructure workloads on the top of that. Presto has come up as one of the really great engines for SQL processing on top of data lakes. It can also query other data sources from like MySQL RDS, and so on. But Presto came out of Facebook as a replacement for hive, which was essentially built for the data lake. And so, reporting and dashboarding use cases come great use cases of on top of Presto is interactive use cases, I would say. There’s also an ad hoc querying, use case that’s increasing. Most often, we’re seeing this with SQL notebooks – with Jupiter or, or Zeppelin and others – and then there’s also a data transformation workloads that run on top of the data lakes. Presto is, good for that. But there are other engines like Spark, for example, that actually do a great job.

They’re built for the ETL, or in database in data lake and a transformation and they play a big role in in these workloads that run on top of the data lake. So, what we’re seeing as we talk to users, Data Platform teams, Data Platform engineers, is there are a couple of paths. If they are pre-warehouse, and I call them kind of pre-warehouse users, where they’re not on a data warehouse yet. They’re still perhaps running a Tableau or Looker, or MySQL or Postgres, you now have a choice, for the first time, where you can actually run some of these workloads on a data lake for slightly lower costs because data warehouses could be cost prohibitive. Or one approach was augment the data warehouse. And so you start off with a data warehouse, you might have some critical data in there where you need very low latencies. It is a tightly coupled system. And so you’re going to get good performance. If you don’t need extreme performance, if you don’t have that as a criteria, the data lake option becomes a very real option today, because the complexity of Hadoop has sort of disappeared.

And there are now much more simpler solutions that exist from a transformation perspective, like managed services for Spark, as well as for my interactive querying and ad hoc [inaudible] perspective, managed services for Presto. That’s what we’re seeing in some cases. Users may skip the warehouse, in some cases, they may augment it and have some data in a warehouse and some data in a data lake. Thoughts on that, John?

John Santaferraro | Industry Analyst 

Yeah, I mean, just to just to confirm, I think you’re the diagram that you are presenting here shows Presto, kind of above the Cloud Data Lake. But there could be another version of this. If somebody has a data warehouse, and they don’t want to rip and replace, and go to an open source data warehouse, Presto sits above both the data lake and the traditional the data warehouse. So it can unify access for those same tools above it. SQL access for both the data lake, the open source data warehouse and the columnar data warehouse, isn’t that correct?

Dipti Borkar | Ahana 

Absolutely. And I think we’re at a point where we just have to accept that there is proliferation of databases and data sources, and there will be many. There is a time element where you know, not all the data may be in the data lake. And so for those for those use cases, federation, or acquainting across data sources, is how you can correlate data across different data sources. So if the data in, not just a data warehouse, but let’s say Elasticsearch, where you’re doing some analytics, it has not landed in the data lake yet. Or a data warehouse. It has not landed in the data lake yet, and the pipelines are still running, you can have a query that runs across both, and have a unified access to your data lake as well as your database, your data warehouse, or semi structured system of record as well.

John Santaferraro | Industry Analyst 

Awesome. Yes. So one of the things that I want to do for you as an audience, is help you understand what kinds of use cases are best for which platform. The modern data lake and the modern data warehouse. Because it is not one size fits all. And so what you see here is actually very similar things on both sides, very similar use cases. But I tried to rank them in order. And at this point, this is not researched. I do research as well. But this happens to be based on my expertise in this area. And, Dipti, feel free to jump in and agree or disagree. Guess what, hey, we’re on a webinar, but we can disagree, right? So on the data lake side, again, one through eight, high performance data intensive kinds of workloads.

We’re going to talk about a story where there is hundreds of petabytes of information and looking at exabyte scale in the next, probably in the next year or so. That is definitely going to happen on the modern data lake not on the modern data warehouse. The data warehouse on the other side, the modern data warehouse, super high compute intensive kinds of workloads with complex analytics, and many joins, may still work better on the modern data warehouse, both have lower cost storage. Again, back to the data lake side, massive scale, well suitable for many different kinds of data types – structured and unstructured – diversity of the kinds of analytics that you want to run.

And then as you get down toward the bottom, you know, things like high concurrency of analytics, you can see it’s up higher on the right-hand side, where you, again, with a columnar database, or you may be able to support higher levels of concurrency. Now, all of this is moving, by the way, because the modern data lakes are working on “how do I drive up concurrency?” They know they’ve got to do that. I would say that because databases have been around a little bit longer, some of the modern data warehouses have more built-in enterprise capabilities. Things like governance and, and other capabilities. But guess what? All of that is rising on the modern data lake side.

So, from my perspective, this is this is my best guess based on 26 years of experience in this industry. All of this is a moving target because things are constantly changing. Dipti, jump in and you certainly don’t have to agree with me. Think of this as a straw man. What’s your take on use cases for the, for these two worlds? Modern data lakes and modern data warehouses?

Dipti Borkar | Ahana 

Yeah, absolutely. I’m trying to figure out where I disagree, John. But in terms of the criteria, these are some of the criteria that our users come with and say, “Look, we are, we are looking at a modern platform for analytics. We have certain criteria, we want to future proof it.” Future proofing is becoming important, because these, these are important decisions that you make for your data platform. You don’t change your data platform every other day.

A lot of these decisions are thought through very carefully, the criteria are wade-in. And there are different tools for different set of criteria. In terms of data lakes, I would say that the cost aspects and the scale aspects are probably the driving factor for the adoption of data lakes. High performance, I think tends to be more data intensive, you’re right there. You can also run, obviously, a lot of high complexity queries as well on data lakes. Which take Presto, as an example of a query engine, you can still run fairly complicated queries.

However, to your point, John, there is a lot of state of the art in the database world 50 years of research. Of complex joints, optimizer, optimizations, and in general, that we are actually working on to make the data lake stronger to get it at par with the data warehouse. Depending on the kind of queries that are run, what we’re seeing that simple, simple queries, you know, with predicate push, simple predicates, etc., run really great on the lake. There might be areas where the optimizer may not be as capable of figuring out – what is the right way to reorder joins, for example. Where there’s work that’s going on. So I think that most of these are in line with what we’re seeing from a user perspective. The other thing that I would add is the open aspect of it. Most of the data lakes have emerged, the technologies, have emerged from internet companies. And the best part is that they open sourced it. So that has benefited all the users that now have the ability to run Presto or Spark or other things.

But from a warehouse perspective, it’s still very closed, there, there isn’t actually a good open source data warehouse. And as platform teams get more mature and get more skilled, they are looking at ways to interact and contribute back and say, “hey, you know, this feature doesn’t exist yet. Do I wait for a vendor to build it in three years from now? Or can I don’t have the ability to contribute back.” And that’s where the open source aspect that you brought up earlier, starts to play a bigger role, which is not on this list, but it’s also starting to be a big part of decision making, as users and platform teams look at data lakes. They want the ability to contribute back, or at least not get perhaps locked into some extent, and have multiple vendors or multiple people, organizations working on it together so that the technology improves. They have options and they can keep their options open.

John Santaferraro | Industry Analyst 

Yeah, yeah, great, great input. The, you know, the other trend that I’m seeing, Dipti, is the merging of the cloud data warehouse and cloud data lake. And those two worlds coming together. And I and I think that’s driven largely by customer demands, I think that there are still a lot of companies that are running a data warehouse, and they have a data lake. As we’ve talked about the modernization of both of those, and even similarities now, between them that weren’t there 10 years ago, there is a merging of the cloud data warehouses.

Customers don’t want to have to manage two different platforms, with two different sets of resources, two different sets of skill sets. It’s too much and so they want to move from two platforms to one, from two resource types to one, from self-managed, to fully managed, from complex queries joins trying to understand intelligence that requires both the data lake and the data warehouse to a simple way to be able to ask questions of both at the same time. And as a result of that, from disparate to connected intelligence, where I don’t have a separate set of intelligence, that I get out of my data warehouse and a separate set that comes out of the data lake, I have all of my data and I can amplify my insight by being able to run queries across both of those, or in a single platform, that that is able to do the work of what used to be done on the two platforms.

I’m seeing this happen from three different directions. One of them is that traditional data warehouse companies are trying to bring in more complex data types, and provide support for discovery kinds of workloads and data science. On the data lake side, great progress has been made with what you [inaudible] the Open Data Warehouse. Where you can now be able to analyze ORC and parquet files, columnar files. In the same way that you would analyze things on a on a columnar database. So there those two. And then the third, which, go Ahana, is this idea of why not? Why not take SQL, the lingua franca of all analytics, the most common language of all analytics still on the planet today, where there’s the most resources possible, and be able to run distributed queries across both, data lakes and data warehouses and bring the two worlds together. I think that I think this is a this is the direction that things are going, and Dipti, this is, this is where – kudos to Ahana for, you know, for really commercializing and providing support for and bringing into the cloud, all of the capabilities of Presto.

This is not the Ahana version of why I think this is a good idea. This is the John version. SQL access means you leverage this vast number of resources, and every company in the world, both on the technical and the business side, as people who understands and write SQL, better insight, because you’re now looking at data in the data lake in the data warehouse. Unified analytics, which means you can support more of your business use cases, with a distributed query engine. Distributed query engines means that you get to leverage your existing investment in platforms with limitless scale and for all data types. So this is this is my version of the capabilities.

Any thoughts you have on this, Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I think that these two spaces are converging, right? There’s the big convergence that’s happening. The way I see from an architecture and technology perspective, is which one do you want to bet on for the future? Where is the bulk of your data? What is what is your primary path that you want to optimize for? The reason that that’s important is, that’s what that’s that will tell you where most of your data lives. Is it 80% in the warehouse? Is it 80% in the lake? And that’s an important decision. If, and this is obviously driven by the requirements, the business requirements that you have, what we’re seeing is that, you know, you have for some, some reports or dashboards where you need really kind of very, very high, high-performance access, the data warehouse would be a good fit.

But there is an emerging trend of different kinds of analysis, some of it which we don’t even know yet, that’s emerging. And having that on in a lake, and consolidating in a lake, gives you the ability to run these future proof kind of engines, platforms, whatever tools that come out on the lake. Because the technologies that are being built on – innovation, a lot more innovation is kind of happening on the lake side of it because of the cost profile of S3, GCS and others. That becomes the fundamental decision.

The next part and the good part is, even if you choose one way or the other, and I will have a bias for you towards the lake, because I think that’s where – I was on the warehouse, I’ve spent many years of my life on the warehouse – but from a future perspective, the next 10 years of analytics, I see that on the data lake. Either one you pick, the good part is you do have a layer on top that can abstract that and can give you access across both. And so you have now the ability which didn’t exist before, to actually query across multiple data sources.

Typically, we’re seeing that it’s the data lake, most people have a data lake, and then they want to query maybe one or two other sources. That’s the use case that we’re seeing. In addition, the cloud, you know, you talked about cloud and full service, is becoming a big, a big, big criteria for users, because installing tuning, then kind of ingesting data, running performance benchmarks, tuning some more, that phase of three to six to nine months of your blog, running POCs is not helping anyone.

Frankly, it doesn’t help the vendors as well, because we want, to create value for customers as soon as possible, right. And so with these managed services, what we’ve done and with Ahana, what I’ve done is we’ve taken three- or six-months process of installing and tuning to a 30 minute process where you can actually run SQL on S3 and get started in 30 minutes. This is in your environment, on your S3 using your catalog, it might be AWS Glue, or it might be a Hive meta store. And that is progress from where we were. And so that the data platform team can create value for their analysts, the data scientists, the data engineers a lot sooner, then with some of these other installed products.

So I see it as a few different dimensions, figure out your requirements, and then try to understand how much time you want to spend on the operational aspects of it. Increasingly, fully managed services are being picked because of the lower operational costs, and the faster time to insight from a data perspective.

John Santaferraro | Industry Analyst 

Great. So the other thing I want to leave you with as an audience is some considerations for any unified analytics decision. There are eight areas here to drill down into, I’m not going to go deep into these, but I want to provide this for you. So you can be thinking about eight areas of consideration as you’re choosing a unified analytics solution.

From a data perspective, what is the breadth of data that can be covered by this particular approach, in terms of unified analytics, moving forward? Looking at support for a broad range of different types of analytics, not just SQL, but Python, Notebook, search, anything that that enhances your analytic capabilities and broadens them, you want to make sure that your solution supports a broad set of users on a single platform. Everybody from the engineer to the business, and the analyst, and the scientist in between. It’s got to be cloud. In my opinion, Cloud is the future. Does the platform support enterprise requirements? All of the business requirements is it cost efficient from a from a business perspective? And then drilling down into the cloud. Looking at things like elasticity, which is automation. Scalability, mobility, because everything’s going mobile and [inaudible]? Am I able to do this as I expand to new regions?

In terms of drilling down on the enterprise – looking at security, privacy, governance, unification for the business is there – Does it support business semantics for my organization, and the logic that I want to include in it? Either in the product or on a layer above right either in some cases, it’s going to be through partners. Is it is it going to allow me to create measurable value for my organization and optimize? Create more value over time and then finally, in terms of costs, is it going to allow me to forecast my cost accurately? Contain costs over time? Looking at things like chargeback and scale, cost at scale. As this thing grows, anybody that’s doing analytics, that analytics program is growing.

So it’s got to be able to scale without just multiplying and creating incremental costs as you grow.

Dipti Borkar | Ahana 

One more thing I would add to costs, John, is the starting cost. So the initial cost, to even try it out, ad get started. This is important, because even the way the platform teams evaluate products and technologies is changing. And they will want the ability to have a pay as you go model. We’re seeing that be quite a bit useful for them. Because sometimes you don’t know until you’ve tried it out for a period of time.

What Cloud is enabling, is also a pay as you go model. So you initially you only pay for what you use, it could be it’s a consumption based model. You can, it might be compute hours, it might be storage, whatever, different vendors might do it different ways, but that is an important. Make sure you have that option. Because it will give you the flexibility of try things out in parallel. And you don’t have to have a exorbitant starting cost of trying out a technology. And the cloud is now allowing you to actually have that option be available.

John Santaferraro | Industry Analyst 

Yeah, yeah. Good. Good point. Dipti. So I had the privilege of interviewing Uber, both the user and developer of Presto, and what an incredible story. I was blown away. First of all, they hyperscale of analytics. Analytics is core to everything that Uber does. The hyperscale was amazing – 10,000 cities – and I’m just going to say it all, even though it’s right there in front of you to read, because it’s amazing. 18+ million trips every single day. They now have 256 petabytes of data, and they’re adding 35 petabytes of new data every day. They’re going to go to exabytes. They have 12,000 monthly, active users of analytics running more than 400,000 queries every single day. And all of that is running on Presto. They have all the enterprise readiness capabilities that they have from automation, workload management, running complex queries, security. It’s an amazing story. Dipti, I mean, you know this story well. What stands out to you about Uber and their not just their use of Presto, but their development of it as well?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, it’s an incredible story. And there’s many, many other incredible stories like this, where, you know, Presto was using being used at scale. If we refer back to your chart earlier, and we said, you know, scale, where does data lake fit in and where a data warehouse fits in, you probably would not be able to do this with a data warehouse. In fact, they migrated off a data warehouse. It was, I think, Vertica or something like that to Presto. They’ve completed that migration. And not just that they have other databases that sit next to Presto that Presto also queries.

So, you know, this is as perfect a use case for your unified analytics slide that you presented earlier. Because not only is it running on a data lake, it’s petabytes and petabytes of information, but it’s also actually abstracting and unifying across a couple of different a couple of different systems. And Presto is you is being done used for both. It is the de facto query engine for the lake. And it helps in some cases where you need to do a joiner or correlation across a couple of different databases. The other thing I’d add here is that not everybody is at Uber scale.

How many internet companies are there? But what we’re seeing is that users and platform teams, throw away a lot of data, and don’t store it because of cost implications of warehouses. The traditional warehouses, and also the cloud warehouses, may have double the cost. Because you have the data and your lake, but you also have to ingest it in another warehouse. And so you’re duplicating the storage cost. And you’re paying quite a bit more for your warehouse. And so instead of throwing away the data, because it’s cost prohibitive that’s where the lake helps. Store it in S3, you don’t have to throw a compute on it today.

But tomorrow, let’s say that data starts to become more interesting, you can very easily convert it to parquet or in a format – Presto can query JSON and queries, many different formats – and query it with Presto on top, from an analytics perspective, and correlated with other data that you have in S3. So, I would say that instead of, you know, aggregating data, and losing data, data is an asset. And most businesses are thinking about it in that way. It is on your balance sheet yet, there will be a time when you actually weigh the importance of the data you have.

If you have the ability to actually store all this data now, because it is cheap, you can use glacier storage, S3, AWS has really great [inaudible] where you have many different tiers of storage that are possible. And that’s a starting point. So that way, you have the option of a lake and building on a very powerful lake on top of that data, if and when you choose to. So just a few thoughts on that.

John Santaferraro | Industry Analyst 

Yeah, I think the other thing I was impressed with, and I think this is relevant to any size company is the breadth of use cases that they’re able to run on Presto. They’re doing their ETL, data science, exploration, OLAP, and federated queries all on this single platform. They really are contributing back to the Presto open-source code to push real time capabilities with it connection with Pinot sampling, being able to run queries on a sampling of data automatically written more optimizations to increase the performance. And you probably are intimately involved in the open-source projects that are listed here as well.

So, I think that it, it bodes well, for the future of Presto and for the future of Ahana.

Dipti Borkar | Ahana 

Yeah, it’s incredible to be partnering in a community driven project. There are many projects. Presto is a part of the Linux Foundation. And so, it’s a community driven project – Facebook, Uber, Twitter, Alibaba – they founded it Ahana is very early member of the project.

We contribute back, we worked together project Aria, for example that you see here came out of Facebook for optimizing ORC. We are working on ARIA for Parquet. Parquet is a popular format, that Uber can use, and Facebook can use, and other users can use as well. There are other projects, as well, for example, the multiple coordinator project. Presto initially had just one coordinator. And now there’s an alpha available where you have multiple coordinators that it that extends the scale, even beyond for Presto. Reduces the scheduling limitations, we were already talking about 1000s of nodes, but in case you needed more, it can go even beyond. But these are important.

These are important innovations, the performance dimension, and the scale dimension tends to be Facebook, Uber, we are also working on some performance. But the enterprise aspects like security, and governance, high availability, cloud readiness, those are aspects that Ahana is focused on and bringing to the community as well. And excited to see the second half, we have a second half roadmap for the for the for Presto, and excited to see how that comes along.

John Santaferraro | Industry Analyst 

Awesome. So, we started this session by talking about the complexity of Hadoop and open source when it was first launched. And quite frankly, nobody wants to manage 1000 nodes of Presto, unless you’re Ahana maybe? But so, let’s talk about Ahana. What have you guys done to simplify the use of Presto and make it immediately available for anybody who wants to do it? What’s going on with Ahana?

Dipti Borkar | Ahana 

Yeah, absolutely. And maybe, Ali, if I can share a couple of slides just bring up what that looks like in a minute. Okay. John, do you see the screen? Alright? Yes, I do. Okay, great. So Ahana is essentially a SaaS platform for Presto. We’ve built it to be fully integrated, it’s cloud native, and it’s a fully managed service that gives you the best of both worlds. It gives you the ability to have visibility into your clusters, number of nodes, and things like that, but also is built to be very, very easy, so that you don’t have to worry about installing, configurating, tuning variety of things.

How it works is pretty straightforward. You go in and you sign up for Ahana, you create an account. The next thing that you do is we create a compute plane in the users account, in your account. And we set up the environment for you, this is a one time thing, it takes about 20 to 30 minutes. This is bringing up your Kubernetes cluster, setting up your VPC, and your entire environment from all the way from networking on the top to the operating system below. Then from that point, you’re ready to create any number of Presto clusters. And so, it’s a single pane of glass that allows you to create different clusters for different purposes.

You might have an interactive workload for one cluster, you might have a transformation workload for another cluster. And you can scale them independently and manage them independently. So, it’s really, really straightforward and easy to get started. All of this is also to the AWS Marketplace. We’re an AWS first company and product is available pay as you go. So, we only charge for the Presto usage that you might have on an hourly basis. And so that’s really kind of how it works.

At a high level, just to summarize some of the important aspects of the platform, one of the key decisions we made is – Do you bring data to compute? Or do you take compute and move it to data? I thought about it from a user perspective. This was an important design decision we made. Increasingly users, data is very valuable, as I said earlier, incredibly valuable, and users want to move it out of their environment. Snowflake and other data warehouses are doing incredibly well. But if they had a choice, they would keep it in their own environment. What we’ve done is we take Presto, anything that touches data, Presto clusters, Hive meta store, even the Superset, so we have an instance of superset that provides a an admin console for Ahana, all of these things run in the user’s environment and the users VPC. None of this information ever crosses over to Ahana SaaS. And that’s very important.

From a governance perspective, there’s a lot of GDPR requirements increasingly, and so on. That’s, that’s the way it’s designed at a high level. Of course, as you mentioned, John, we connect to the data lake, that’s our primary path. 80% of the workloads we see are for S3, but the 5% to 10% might be for some of the other data sources. You can federate across RDS, MySQL, Redshift data warehouse, for example, Elastic and others. And we have first class integrations with Glue. Again, very, very easy to integrate, you can bring your own catalog, or we can, you can have one that were created with a click of a button in Ahana.

You can bring your own tools on the top, it’s standard JDBC ODBC. As you said, SQL is the lingua franca, it’s anti SQL. Presto is anti SQL. And so that makes it very easy to get started with any tool on top, and to integrate it into your environment.

So that’s a little bit about Ahana. And I think that might bring us to the end of our discussion here.

Ali LeClerc | Ahana 

Great. Well, thank you, Dipti and John. What a fantastic discussion and I hope everybody got a good overview of data, lakes, data warehouses, what’s going on in the market and kind of how to make a decision on and which way to go. So, we have a bunch of questions. I don’t think we’re going to have enough time to get to all of them. So, I’m going to ask some of the more popular ones that have kept popping up.

So first is around Presto. Dipti, probably for you, is Presto a means of data virtualization?

Dipti Borkar | Ahana 

Yeah, so that’s a good question. Presto was built as a data lake engine. But given its pluggable architecture, it is also able to support, whether you call virtualization I would say is an overloaded term, it means many things. But if it means accessing different data sources, then yes, it’s capable of doing that like you just saw in my last slide.

Ali LeClerc | Ahana 

Great. And by the way, folks, we do have another webinar. This is the first webinar in our series. Next week, we’ll be going into more detail on how you can actually do SQL on the data lake. Highly recommend if you’re interested in learning more, going a bit deeper, checking that out. I dropped the link to register in the chat box. So, feel free to do that.

So, question, I think I think for both of you Dipti, earlier, you touched on this idea of augmenting the data warehouse versus perhaps skipping the data warehouse altogether. And so, Dipti and, John, I think you both kind of bring a different perspective to that. What are you seeing in the market? Are people facing that decision? Or is it leaning one way or the other what’s going on around augmenting versus skipping?

John Santaferraro | Industry Analyst 

One of the trends that I’m seeing is that when data originates in the cloud, it tends to stay in the cloud, and it tends to move to a modern architecture. So, in truly digital instances, for organizations, rather than taking digital data and trying to get it back into a legacy or traditional data warehouse, that’s almost always going into a data lake and into which you know, I love what you term Dipti, the Open Data Warehouse using those formats.

That said, as people continue to migrate to the cloud – when I was at EMA, there was a, we saw that approximately 53% of data was already in the cloud. But that means 47% of the data is still on premise. And so, if the data is already there, and in a database, that migration may or may not make sense. You have to kind of weigh the value. And oftentimes the value is having it all in a single unified analytics warehouse.

Dipti Borkar | Ahana 

Right? Yeah, what I would say is that I think it depends on the cloud or on prem, most of our discussion has been in the cloud, because we are forward looking people, forward thinkers. But the truth is, there really is a lot of data on prem. On prem, what we’re seeing is that it will always almost be augment.

Because most folks will have a warehouse, whether it’s Vertica or Teradata, or DB2 Oracle, whichever it is, and they might have an HDFS, kind of a Hadoop kind of system on the side.  And that’s, that would be augment that’s more traditional. In the cloud, I think we’re seeing both, we’re seeing that for users who have been on the warehouse, they are choosing to augment and not just migrate off completely. And I think that that is the right thing to do, you do want to have a period of time. When I say period, it’s a years of time, where if you have a very mature warehouse, it will take some time to migrate that workload over to the lake. And so new workloads will be on the lake, old workloads will slowly migrate off. So that’s the way we see it’s really augment for a period of time.

You know, I often joke that mainframes are still around. So, warehouses aren’t going anywhere. And so that’s the argument. Now, the pre-warehouse users who don’t have a warehouse yet, are choosing, I would say, that percentage will continue to increase. I’m seeing that about 20-30% are choosing to skip the warehouse. That will only increase as more capabilities get built on the lake. Transactionality is very early right now. Governance, which just starting to get to column level, row level, masking, filtering, and masking and so on. So, there’s some work to be done.

We have our work cut out for us on the lake. I see it as a three-to-five-year period, where this will then start moving and more and more users will end up skipping the warehouse and moving to the lake. But today, it’s depending on the use cases, the simple use cases, we are seeing them about 20-30% are just going directly to the lake.

Ali LeClerc | Ahana  Wonderful. So, with that, I think we are over time now. We appreciate everybody who stuck around and stayed a few minutes past the hour. We hope that you enjoyed the topic. John, Dipti – what a fantastic conversation. Thanks for sharing your insights into this into this topic. So, with that, everybody, thank you. Thanks for staying with us we hope to see you next week and see you next time. Thank you.

Speakers

John Santaferraro
Industry Analyst

john santaferraro

Dipti Borkar
Cofounder & CPO, Ahana

Dipti Borkar 1

How do I use the approx_percentile function in Presto?

percentile

The Presto approx_percentile is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the approx_percentile function.

What is a percentile?

percentile function

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10.

If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

approx_percentile by example

To solidify our understanding of this and the approx_percentile function, we’ve created a few tables to use as example:

presto:default> show tables;
    Table
-------------
 dummy
 norm_0_1
 norm_100_10
 norm_all
(4 rows)
TableDescriptionNumber of Rows
dummySingle column 100 row table of all ones except for a single value of 100.100
norm_0_1Samples from normal distribution with mean of 0 and standard deviation of 1.5000
norm_100_10Samples from normal distribution with mean of 100 and standard deviation of 10.5000
norm_allCoalescence of all normal distribution tables.10000
Table 1

The approx_percentile function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the percentage parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The percentage parameter can either be the sec